Hadoop Fair Sojourn Protocol Based Job Scheduling On Hadoop
|
|
|
- Kevin Hunter
- 10 years ago
- Views:
Transcription
1 Hadoop Fair Sojourn Protocol Based Job Scheduling On Hadoop Mr. Dileep Kumar J. S. 1 Mr. Madhusudhan Reddy G. R. 2 1 Department of Computer Science, M.S. Engineering College, Bangalore, Karnataka 2 Assistant professor, Department of Computer Science, M.S. Engineering College Bangalore, Karnataka ABSTRACT: We present Hadoop Fair Scheduler Protocol, a scheduler introducing this technique to a real, multi-server, complex and widely used system such as Hadoop. The scheduling discipline is based on the concepts of virtual time and job aging. These techniques are conceived to operate in a multi server system with tolerance to failures, scale-out upgrades, and multi-phase jobs a peculiarity of Map Reduce. The results indicate that size-based scheduling is a realistic option for Hadoop clusters, because HFSP sustains even rough approximations of job sizes. Which are based on realistic workloads generated via a standard benchmarking suite, pinpoint at a significant decrease in system response times with respect to the widely used Hadoop Fair scheduler, without impacting the fairness of the scheduler, and show that HFSP is largely tolerant to job size estimation errors. Keywords: Map Reduce, Performance, Data Analysis, Scheduling. 1. INTRODUCTION Map Reduce has become a popular model for data-intensive computation in recent years. By breaking down each job in to small map and reduce tasks and executing them in parallel across a large number of machines, Map Reduce can significantly reduce the running time of data-intensive jobs. However, despite recent efforts toward designing resource-efficient map reduces schedulers, existing solutions that focus on scheduling at the task-level still offer sub-optimal job performance. The is because task can have highly varying resource requirements during their lifetime, which makes it difficult for task-level schedulers to effectively utilize available resources to reduce job execution time. The advent of large-scale data analytic, fostered by parallel frameworks such as Hadoop, Spark, and Naiad, has created the need to manage the resources of compute cluster operating in a shared, multi-tenant environment. Within the same company, many users share the same cluster because this avoids redundancy in physical deployments and in data storage, and may represent enormous cost savings. Initially designed for few very large batch processing jobs, data-intensive scalable computing frameworks such as map reduce are nowadays used by many companies for production, recurrent and even experimental data analysis jobs. The heterogeneity is substantiated by recent studies that analyze a variety of production-level workloads. An important fact that emerges from previous works is that there exists a stringent need for short system response times. Many operations, such as data exploration, preliminary analyses, and algorithm tuning, often involve interactivity, in the sense that there is a human in the loop seeking answers with a trial-and-error process. In addition, workflows schedulers such as Oozie contribute to workload heterogeneity by generating a number of small orchestration jobs. At the same time, there are many batch jobs working on big datasets: such jobs are a fundamental part of the workloads, since they transform data in to value transform data in to value. Due to the heterogeneity of the workload, it is very important to find the right trade-off in assigning the resources to interactive and batch jobs. The solution implements a size-based, preemptive scheduling discipline. The scheduler allocates cluster resources such that job size information which is not available a prior is inferred while the job makes Page 70
2 progress toward its completion, scheduling decisions use the concept of virtual time, in which jobs make progress according to an aging function cluster resources are focused on jobs according to their priority, computed through aging. This ensures that neither small nor large jobs suffer from starvation. The outcome of our work materializes as a fullfledged scheduler implementation that integrates seamlessly in Hadoop: we called our scheduler HFSP, to acknowledge an influential work in the size-based scheduling literature. 2. Literature Survey It shows existing system and how to overcome the existing system explanation of proposed system and its components. An important fact that emerges is that there exists a stringent need for short system response times. Many operations, such as data exploration, preliminary analyses, and algorithm tuning, often involve interactivity, in the sense that there is a human in the loop seeking answers with a trial-anderror process. In addition, workflow schedulers such as Oozier [1] contribute to workload heterogeneity by generating a number of small orchestration jobs. At the same time there are many batch jobs working a big datasets: such jobs are a fundamental part of the workloads, since they transform data into value. Due to the heterogeneity of the workloads, it is very important to find the right trade-off in assigning the resources to interactive and batch jobs. There are mainly two different strategies used to schedule jobs in a cluster. The first strategy is to split the cluster resources equally among all the running jobs. A remarkable example of this strategy is the Hadoop Fair Scheduler [2], [3]. While this strategy preserves fairness among jobs, when the system is overloaded, it may increase the response times of the jobs. The second strategy is to serve one job at a time, thus avoiding the resource splitting. An example of this strategy is First-In-First-Out(FIFO), in which the job that arrived first is served first. The problem with this strategy is that. Being blind to job size, the scheduling choices lead inevitably to poor performance: small jobs may find large jobs in the queue, thus they may incur in response times that are disproportionate to their size. As a consequence, the interactivity is difficult to obtain. Both strategies have drawbacks that prevent them from being used directly in production without precautions. Commonly, a manual configuration of the both the scheduler and the system parameters is required to overcome such drawbacks. This involves the manual setup of a number of pools to divide the resources to different job categories, and the finetuning of the parameters governing the resources allocation. This process is tedious, error prone, and cannot adapt easily to changes in the workload composition and cluster configuration. In addition, it is often the case for clusters to be over-dimensioned, this simplifies resource allocation (with abundance, managing resources is less critical), but has the downside of costly deployments and maintenance for resources for resources that are often left unused [4]. We present the design of a new scheduling protocol that caters both to a fair and efficient utilization of cluster resources, while striving to achieve short response time. Our approach satisfies both the interactivity requirements of small jobs and the performance requirements of large jobs, which can thus coexist in a cluster without requiring manual setups and complex tuning: our technique automatically adapts to resources and workload dynamics. Our solution implements a size-based, preemptive scheduling discipline. The scheduler allocates cluster resources such that job size information which is not available a priori is inferred while the job makes progress towards its completion. Scheduling decisions use the concept of virtual time, in which jobs make progress according to an aging function: cluster resources are focused on jobs according to their priority, computed through aging. This ensure that neither small nor large jobs suffer from starvation. The outcome of the work materializes a full-fledge scheduler implementation that seamlessly in Hadoop [5] 3. Traditional Scheduling Traditional Scheduling processor Sharing (PS) and First Come First Serve (FCFS) are arguably the two most simple and ubiquities scheduling disciplines in use in many system for instance, Fair and FIFO are two schedulers for Hadoop, the first inspired by FCFS, and the second by process scheduling. In FCFS, jobs are scheduled in the order Page 71
3 of their submission, while in PS resources are divided equally so that each active job keeps progressing. In loaded system, these disciplines have server shortcomings in FCFS, large running jobs can delay significantly small ones in PS, each additional job delays the completion of all the others. In order to improve the performance of the system in terms of delay, it is important to consider the size of the jobs. Size-based scheduling adopts the idea of giving priority to small jobs as such they will not be slowed down by large ones. The Shortest Remaining Processing Time (SRPT) policy, which prioritizes jobs that need the least amount of work to complete, is the one that minimizes the mean response time (or sojourn time), that is the time that passes between a job submission and its completion [6] the below figure provides an example that compares PS to SRPT in this case two small jobs j2 and j3- are submitted while a large job j1 is running. While in PS in three jobs run (slowly) in parallel in a size-based discipline j1 is preempted. results in job mistreatment. To avoid starvation, a common solution is to perform job aging. With job aging, the system decrease virtually the size of jobs waiting in the queue, and keeps them sorted according to their virtual size, serving the one with the current smaller virtual size. Job size is perfectly known a priori, the FSP discipline exploits aging to provide a strong dominance fairness guarantee no job completes in FSP later than it would in PS.FSP also guarantees excellent results in terms of both job response time and fairness when job sizes are not know exactly for these reason, the design of HFSP is guided by the abstract ideas beyond FSP. 4. Hadoop Fair Sojourn Protocol (HFSP) The Hadoop Fair Sojourn Protocol (HFSP) is a size-based scheduler with aging for Hadoop. Implementing HFSP raises a number of challenges a few come from Map Reduce itself. Example the fact that a job is composed by tasks while others come from the size based nature of the scheduler in a context where the size of the jobs is not known a priori. Figure 1.Comparison between PS and SRPT The result is that j2 and j3 complete earlier. Like most size-based scheduling techniques, SRPT temporarily suspends the progress of lower-priority jobs fortunately, this is not a problem in a batch system like Hadoop, for which results are usable only after the jobs is completed. While policies like SRPT improve means response time, they may incur in starvation if small jobs are continuously submitted, large ones may never receive service[7],[8] this Jobs: In Map Reduce jobs are scheduled at the granularity of tasks and they consist of two separate phases called MAP and REDUCE. We evaluate job sizes by running a subset of sample tasks for each job, however, reduce tasks can only be launched only after the Map phase is complete. The scheduler thus splits logically the job in the two phases and treats them independently therefore the scheduler consider the job as composed by two parts with different sizes, one for the MAP and the other REDUCE phase. When a resource is available for scheduling a MAP (resp. REDUCE) task, the scheduler sorts jobs based Page 72
4 on their virtual Map (resp. REDUCE) sizes, and grants the resource to the job with the smallest size for that phase. Estimated and virtual size: The size of each phase to which we will refer as real size, ia unknown until the phase itself is complete. The scheduler therefore works using an estimated size starting from this estimate the scheduler applies job aging I.e., it computes the virtual size, based on the time spent by the job in the waiting queue. The estimated and the virtual sizes are calculated by two different modules the estimation module, that outputs the estimated size and the aging module that takes in input the estimated size and applies an aging function. 5. Modules in phase 1. The Estimation Module 2. The Aging Module 3. The Scheduling Policy The Estimation Module: The role of the estimation module is to assign a size to a job phase such that, given two jobs, the scheduler can discriminate the smallest one for that phase. When a new job is submitted, the module assigns for each phase an initial size Si. Which is based on the number of its tasks.the initial size is necessary to quickly infer job priorities. A more accurate estimate is done immediately after the job submission, through a training stage in such a stage a subset of task called the training tasks is executed and their execution time is used to update Si to a final estimated size Sf. Choosing t induces the following trade-off a small value reduces the time spent in the training stage at the expense of inaccurate estimates a large value increases the estimation accuracy but result in a longer training stage. The scheduler is designed to work with rough estimates therefore a small t is sufficient for obtaining good performance. The Aging Module: The aging module takes as input the estimated sizes to compute virtual sizes. The use of virtual size is a technique applied in many practical implementations of well-known schedulers it consists in keeping track of the amount of the remaining work for each job phase in a virtual fair system and update it every time the scheduler is called. The result is that even if the job doesn t receive resources and thus its real size does not decrease, in the virtual system the job virtual size slowly decreases with time. Job aging avoids starvation, achieves fairness, and requires minimal computational load, since the virtual size does not incur in costly updates. The Scheduling Policy: In this section we describe how the estimation and the aging modules coexist to create a Hadoop scheduler that strives to be both efficient and fair. DEV: This workload is indicative of a development environment, where by users rapidly submit several small jobs to build their data analysis tasks, together with jobs that operate on larger datasets. This workload is inspired is inspired by the Facebook 2009 trace observed by Chen et al, the mean interval between job arrivals is µ = 30 s. Page 73
5 Test: This workload represents a test environment, whereby users evaluate and test their data analysis tasks on a rather uniform range of dataset sizes, with 20% of the jobs a large dataset. The mean interval between jobs is µ = 60 s. PROD: This workload is representative of a production environment, whereby data analysis tasks operate pre-dominantly on large datasets. The mean interval between jobs is µ = 60 s. 6. Conclusion: The work was motivated by the increasing demand for system responsiveness, driven by both interactive data analysis tasks and longrunning batch processing jobs, as well as for a fair and efficient allocation of system resources. We presented an novel approach to the resource allocation problem, based on the idea of size-based scheduling. The effort materialized in a full fledged scheduler that we called HFSP, the Hadoop Fair Sojourn Protocol, which implements a size-based discipline that satisfies simultaneously system responsiveness and fairness requirements. The work raised may challenges evaluating job sizes online without wasting resources avoiding job starvation for both small and large jobs, and guaranteeing short response times despite estimation errors were the most noteworthy. Our Feature work is related to job preemption. We are currently investigation a novel technique to fill the gap between killing running tasks and waiting for tasks to finish. Indeed killing a task too late is a huge waste of work, and waiting for a task to complete when it just started is detrimental as well. Our next goal is thus to provide a new set of primitives to suspend and resume tasks to achieve better preemption. Conference on Networked Systems Design and Implementation, 2012, pp [3] Apache, The Hadoop fair scheduler, ml. [4] Y. Chen, S. Alspaugh, and R. Katz, Interactive query processing in big data systems: A crossindustry study of MapReduce workloads, in Proc. of VLDB, [5] E. Friedman and S. Henderson, Fairness and efficiency in web server protocols, in proc. Of ACM SIGMETRICS, [6] L.E. scharge and L. W. Miller, The queue m/g/l with shorestest remaining processing time discipline, [7] stalling,operating system. Prentice hall, [8] Microsoft, The naiad system, MicrosoftResearchSVC/naiad. [9] D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi, Naiad: A timely dataflow system, in Proceedings of the 24th ACM Symposium on Operating Systems Principles, 2013, pp [10] [7] K. Ren et al., Hadoop s adolescence: An analysis of Hadoop usage in scientific workloads, in Proc. of VLDB, [11] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, Effective straggler mitigation: Attack of the clones. in NSDI, vol. 13, 2013 References [1] Apache, Oozie Workflow Scheduler, [2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A faulttolerant abstraction for in-memory cluster computing, in Pro- ceedings of the 9th USENIX Page 74
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)
UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter
SCHEDULING IN CLOUD COMPUTING
SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, [email protected]
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud
Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud Gunho Lee, Byung-Gon Chun, Randy H. Katz University of California, Berkeley, Yahoo! Research Abstract Data analytics are key applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Load Balancing in Distributed Data Base and Distributed Computing System
Load Balancing in Distributed Data Base and Distributed Computing System Lovely Arya Research Scholar Dravidian University KUPPAM, ANDHRA PRADESH Abstract With a distributed system, data can be located
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Matchmaking: A New MapReduce Scheduling Technique
Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu
A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS
A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS Suma R 1, Vinay T R 2, Byre Gowda B K 3 1 Post graduate Student, CSE, SVCE, Bangalore 2 Assistant Professor, CSE, SVCE, Bangalore
Fair Scheduling Algorithm with Dynamic Load Balancing Using In Grid Computing
Research Inventy: International Journal Of Engineering And Science Vol.2, Issue 10 (April 2013), Pp 53-57 Issn(e): 2278-4721, Issn(p):2319-6483, Www.Researchinventy.Com Fair Scheduling Algorithm with Dynamic
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Survey on Job Schedulers in Hadoop Cluster
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 1 (Sep. - Oct. 2013), PP 46-50 Bincy P Andrews 1, Binu A 2 1 (Rajagiri School of Engineering and Technology,
Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors
Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Sudarsanam P Abstract G. Singaravel Parallel computing is an base mechanism for data process with scheduling task,
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: [email protected] November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354
159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1
OPERATING SYSTEMS SCHEDULING
OPERATING SYSTEMS SCHEDULING Jerry Breecher 5: CPU- 1 CPU What Is In This Chapter? This chapter is about how to get a process attached to a processor. It centers around efficient algorithms that perform
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Analysis of Issues with Load Balancing Algorithms in Hosted (Cloud) Environments
Analysis of Issues with Load Balancing Algorithms in Hosted (Cloud) Environments Branko Radojević *, Mario Žagar ** * Croatian Academic and Research Network (CARNet), Zagreb, Croatia ** Faculty of Electrical
Final Project Proposal. CSCI.6500 Distributed Computing over the Internet
Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
Big Data Analysis and Its Scheduling Policy Hadoop
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy
Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
SECURED BIG DATA MINING SCHEME FOR CLOUD ENVIRONMENT
SECURED BIG DATA MINING SCHEME FOR CLOUD ENVIRONMENT Ms. K.S. Keerthiga, II nd ME., Ms. C.Sudha M.E., Asst. Professor/CSE Surya Engineering College, Erode, Tamilnadu, India [email protected] Abstract
PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters
PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters Rohan Gandhi, Di Xie, Y. Charlie Hu Purdue University Abstract For power, cost, and pricing reasons, datacenters are evolving
1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India
1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India Call for Papers Colossal Data Analysis and Networking has emerged as a de facto
Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems
Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems G.Rajina #1, P.Nagaraju #2 #1 M.Tech, Computer Science Engineering, TallaPadmavathi Engineering College, Warangal,
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Managing large clusters resources
Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Joint Optimization of Overlapping Phases in MapReduce
Joint Optimization of Overlapping Phases in MapReduce Minghong Lin, Li Zhang, Adam Wierman, Jian Tan Abstract MapReduce is a scalable parallel computing framework for big data processing. It exhibits multiple
ADAPTIVE LOAD BALANCING ALGORITHM USING MODIFIED RESOURCE ALLOCATION STRATEGIES ON INFRASTRUCTURE AS A SERVICE CLOUD SYSTEMS
ADAPTIVE LOAD BALANCING ALGORITHM USING MODIFIED RESOURCE ALLOCATION STRATEGIES ON INFRASTRUCTURE AS A SERVICE CLOUD SYSTEMS Lavanya M., Sahana V., Swathi Rekha K. and Vaithiyanathan V. School of Computing,
A Study on the Application of Existing Load Balancing Algorithms for Large, Dynamic, Heterogeneous Distributed Systems
A Study on the Application of Existing Load Balancing Algorithms for Large, Dynamic, Heterogeneous Distributed Systems RUPAM MUKHOPADHYAY, DIBYAJYOTI GHOSH AND NANDINI MUKHERJEE Department of Computer
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING
Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila
Rakam: Distributed Analytics API
Rakam: Distributed Analytics API Burak Emre Kabakcı May 30, 2014 Abstract Today, most of the big data applications needs to compute data in real-time since the Internet develops quite fast and the users
Big Data - Infrastructure Considerations
April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright
Analysis of Information Management and Scheduling Technology in Hadoop
Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering
FOUNDATIONS OF A CROSS- DISCIPLINARY PEDAGOGY FOR BIG DATA
FOUNDATIONS OF A CROSSDISCIPLINARY PEDAGOGY FOR BIG DATA Joshua Eckroth Stetson University DeLand, Florida 3867402519 [email protected] ABSTRACT The increasing awareness of big data is transforming
A Comprehensive View of Hadoop MapReduce Scheduling Algorithms
International Journal of Computer Networks and Communications Security VOL. 2, NO. 9, SEPTEMBER 2014, 308 317 Available online at: www.ijcncs.org ISSN 23089830 C N C S A Comprehensive View of Hadoop MapReduce
Concept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
Generic Log Analyzer Using Hadoop Mapreduce Framework
Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System
Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System Renyu Yang( 杨 任 宇 ) Supervised by Prof. Jie Xu Ph.D. student@ Beihang University Research Intern @ Alibaba
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
Performance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
CloudClustering: Toward an iterative data processing pattern on the cloud
CloudClustering: Toward an iterative data processing pattern on the cloud Ankur Dave University of California, Berkeley Berkeley, California, USA [email protected] Wei Lu, Jared Jackson, Roger Barga
An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems
An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems Aysan Rasooli, Douglas G. Down Department of Computing and Software McMaster University {rasooa, downd}@mcmaster.ca Abstract The
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
Hadoop Fair Scheduler Design Document
Hadoop Fair Scheduler Design Document October 18, 2010 Contents 1 Introduction 2 2 Fair Scheduler Goals 2 3 Scheduler Features 2 3.1 Pools........................................ 2 3.2 Minimum Shares.................................
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Evaluating Task Scheduling in Hadoop-based Cloud Systems
2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Fair Scheduler. Table of contents
Table of contents 1 Purpose... 2 2 Introduction... 2 3 Installation... 3 4 Configuration...3 4.1 Scheduler Parameters in mapred-site.xml...4 4.2 Allocation File (fair-scheduler.xml)... 6 4.3 Access Control
Massive Cloud Auditing using Data Mining on Hadoop
Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING
FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING Hussain Al-Asaad and Alireza Sarvi Department of Electrical & Computer Engineering University of California Davis, CA, U.S.A.
Scheduling Algorithms in MapReduce Distributed Mind
Scheduling Algorithms in MapReduce Distributed Mind Karthik Kotian, Jason A Smith, Ye Zhang Schedule Overview of topic (review) Hypothesis Research paper 1 Research paper 2 Research paper 3 Project software
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications
Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications by Samuel D. Kounev ([email protected]) Information Technology Transfer Office Abstract Modern e-commerce
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
Cloud Computing based on the Hadoop Platform
Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
The Improved Job Scheduling Algorithm of Hadoop Platform
The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: [email protected]
YARN Apache Hadoop Next Generation Compute Platform
YARN Apache Hadoop Next Generation Compute Platform Bikas Saha @bikassaha Hortonworks Inc. 2013 Page 1 Apache Hadoop & YARN Apache Hadoop De facto Big Data open source platform Running for about 5 years
4. Fixed-Priority Scheduling
Simple workload model 4. Fixed-Priority Scheduling Credits to A. Burns and A. Wellings The application is assumed to consist of a fixed set of tasks All tasks are periodic with known periods This defines
