A Survey of Cloud Computing Guanfeng Octides

Size: px
Start display at page:

Download "A Survey of Cloud Computing Guanfeng Octides"

Transcription

1 A Survey of Cloud Computing Guanfeng Nov 7, 2010 Abstract The principal service provided by cloud computing is that underlying infrastructure, which often consists of compute resources like storage, processors, RAM, load balancers etc, of cloud-based platform and system are entirely abstracted from the consumer of the software or services. This kind of character needs proper and fined programming models, runtime systems and communication protocols which are designed for the requirements of efficiently sharing computing capability among large numbers of different nodes. As such, they provide several means to reduce complexity and allow programmers to easily employ parallel and distributed system resources without introducing their own cumbersome and incompatible mechanism. In this survey, we analyze and summarize a couple of techniques proposed in this research field. Our study focus on programming models, algorithms and protocols used in real cloud platforms in terms of their system implementation and different resources construction, their concurrency and fault tolerance mechanism, their approach to load balance or schedule management. 1 Introduction With the vigorous development of Internet industry and rapid increase of data volume need to be processed, people become more and more interested in utii

2 lizing parallel computing resources to address their needs. However, programming with or even maintaining a distributed system is complicated and expensive. The different demands concerned by separated customers, like availability and performance of the system, and various computing resources, like commodity PC clusters, multi-core and shared-memory systems, or even the hybrid, make this topic much more difficult. Thus, research in cloud computing has focused for several years on resource-abstract-layer alike modeling and systems building, that is, on devising a programming model with related protocols to let users get rid of intricate computing resources management in a relatively general condition. Some efficient solutions have been found, such as, MapReduce [1], Phoenix [2], LATE [3], Centrifuge [4] and many others. MapReduce was proposed as a programming model for datacenter algorithms. It could be employed by many real world tasks. Inverted Index, as an example of one common used program, can be easily solved in a large cluster with MapReduce computations. The computations contain two parts, Map and Reduce, which are basically defined by users. This model split input data records into pieces at first. Then the master node dynamically select slave workers among the candidate nodes and assign one or many Map tasks whose functionality specified by users to them. Each worker node passes through the received input data and processes each pair in the way determined by Map function. Subsequently, worker node stores its intermediate key/value pairs and partitions those into a certain number of regions, which also decide upon partitioning function from users. Notified by the master node, some workers would be aware of the corresponding hashed result existence. After remotely reading (under the circumstance of real cluster infrastructure) processed pairs as input in the second phase, a.k.a. Reduce, they sort those intermediate values and group them by their keys. Finally, all output files of the Reduce function constitute the result of a user-defined computation. ii

3 2 Infrastructure Platforms or runtime systems built on various computing resources have different characters and demands. MapReduce programming model proposed by Google is implemented well among the commodity PC cluster. In execution, it needs to concern about the locations of intermediate files, network traffic, communication protocols between two machines and hardware failures. While some of them may not be necessarily considered in other computing resource bases, for example, Phoenix, an implementation of MapReduce for shared-memory system, have focused on buffer management, quite different fault recovery mechanism and performance management (like using prefetch to optimize locality) instead. For both multi-core chips and conventional symmetric multi-processors computing environment, Phoenix manages Map-Reduce buffers and Reduce-Merge buffers, whose access permissions are well specified, to store intermediate and final task output key/value pairs separately. Another programming model, Centrifuge was proposed by Microsoft for in-memory server pools. This datacenter lease manager provides a usable implementation to replace internal load balancer and guarantees systems correctness at the same time. However, inside a computing cluster, all nodes may not be homogeneous and probably take progress at a different speed. Long Approximate Time to End (LATE) is a scheduling algorithm designed for this kind of heterogeneous situation. 3 Availability and Fault Tolerance In MapReduce implementation, the master node sends an ICMP echo request packet to each worker and waiting for an ICMP response periodically. After a certain time waiting without receiving any answer, master node assumes the worker is crashed. Those incomplete Map or Reduce tasks previously assigned to this work would iii

4 be re-executed by some other nodes. Even encountering a large-scale failure, this method can still promise the computation termination. For the master node, it writes periodic checkpoints of all computing information, which would be reloaded to a substitution copy if the master fails. On the assumption of deterministic function input, MapReduce implementation is able to produce the same output with a longer execution time, but meet the requirement of fault tolerance. Phoenix system uses a different approach for worker failure detection. Since the shared-memory system cannot and have no need to employ the network layer utility to test the vital signs of workers, scheduler just mark the worker as transient failure if a worker does not complete a task within a reasonable amount of time. The failure would upgrade to permanent level if the task fails several times or the worker cannot finish successfully at a high frequency. Workers with this fault label would never be trusted and no task would be assigned to them anymore. However, a potential problem here we thought is how to determine the reasonable processing time slot. There are both pros and cons for allowing users to indicate this value or totally determined by system itself. If the system sets this time range too short, even not far from the reasonable value, there would be many computing resources wasted to re-execute the same tasks. Conversely, if the time range much longer than the task needs, it increase the time cost to detect a failure and cause performance degradation as a consequence. In Centrifuge model, the Manager service who decides the logic of lease management, partitioning and adaptive load balance consists of two parts, one is Paxos group, and the other is a set of leader server and standby servers. Paxos group is used to determine the distributed consensus in general, and in this implementation, it decides to pick a leader from all candidates, which provide a high-available store. Every operation leader will execute that change its internal state need report to the Paxos group. These sequences of requests would be stored at Paxos and give a iv

5 specified standby server when that standby server is elected to be the new leader after the old one crashed. For efficiency concern, all Owner and Lookup libraries only send requests to the valid leader. They have to broadcast their messages to all servers if previous leader they used to contact is unresponsive. An updating event would occur when the library receives a response from a new Manager node, and this one should be the only destination all other requests process towards later. 4 Performance In a computer cluster, network bandwidth is an important factor need to concern about. When a large quantity of Map workers have to read the input files from remote disks, it would consume plenty of scarce network resources and definitely lower the performance. Thus, Googles implementation takes the location information of source input data into master nodes consideration, and tries to assign as most Map tasks to the physical node that share the same local disk with their input data. In addition, it spawns redundant executions for the reason of potential stragglers towards the end of both two stages. straggler here means a machines spends an unreasonable long time to execute their tasks at the last few waves of all computations. In this special period, Customers have to wait only few workers that may even need fault tolerance mechanisms help later, which obviously prolong the total execution time to the users. The method that master node let several machines do the same operations is proposed to avoid that poor circumstance and speed up the processing rate. Another way in this model to improve the performance is to decrease the task granularity. Ideally, the more task pieces could be scheduled, the more balanced state could meet. Within the memory capability of master node, this method accelerates the recovery part as well. However, we thought the overhead brought by added partitioning operations should also be concerned. v

6 Shared-memory system has a different situation yet. Instead of taking real location information into account, runtime employ prefetch scheme, which is similar to double-buffering in streaming models, to optimize its performance. Moreover, Phoenix uses hardware compression of intermediate pairs to preserve bandwidth and cache resources. In heterogeneous environments, LATE algorithm provides its modified schedule mechanism to achieve optimal performance with its specified infrastructure. We will discuss this in the following part. 5 Scheduling and Load Balancing Phoenix and original Hadoop system [5] have many things in common on scheduling, like split input data, assign Map tasks to workers, notify Reduce workers with file locations or memory addresses. However, a new algorithm above heterogeneous computing resources still be designed and outperform those existed ones. LATE argues the homogeneity and linear processing assumption at first. At Hadoop scheduling algorithm, it rank tasks as following order: the failed task, unscheduled task and straggler task. The monitor inside the system selects a speculative task by its progress score label. Progress score is a real number from 0 to 1, and it is equally divided by Map and three phases in Reduce. A task that has run longer than one minute and its score less than average minus 1/5 would be marked as a straggler. The master node also guarantees that at most one execution copy of each task is running at a time. LATE uses a new method to pick the straggler through estimating and comparing tasks finish time into the future. They provide a simply heuristic to evaluate the remaining time and let users plug their own estimating methods into model as well. The scheduler calculates the progress rate as (progress score) / (execution time), and the remaining time as (1 progress score) vi

7 / (progress time). It determines a node is straggler or not depending on whether it is blow the SlowTaskThreshold. Furthermore, system only launch speculative task on fast nodes. A SlowNodeThreshold is set to distinguish between fast and slow nodes. Another parameter, SpeculativeCap, limits the total number of speculative tasks that can be running at the same time. Although, the value of above three parameters has been assigned to 25th percentile of task progress rate, 25th percentile of node progress and 10% of available task slots respectively, and it performs well in experiments, authors should still give more illustrations to argue their reasonability and what is the best combination. For in-memory server pools, Centrifuge is proposed as a replacement of internal network load balancer. It provides Manager service, Lookup library and Owner library. When a server want to publish some event to a specified topic, they just lookup its library by using the hash value of the topic name. Library then returns the address of appropriate server they should contact that hosts this topic. The servers linking in the Owner library receive leases on these topics also based on same hash keys. Before this server forwards the event to all parties when receiving a publish message, it need to check with its Owner library whether it holds a legal lease. Manager services job is to partition a flat key namespace into variable length ranges among all the servers linking in Owner libraries. It then distributes related partitioning assignment with a lease protocol. Manager would reassign leases for both adaptive load management needs and other Owner libraries requests. Lookup Library maintains the lease generation number and its corresponding Owner node for each range. It contacts Manager once every 30 seconds to get incremental changes and update the newest table. To solve the problem of message races, it introduces a protocol using two sequence numbers like the vector clock in some sense. If one side receives a message that does not contain receivers most recent sequence number, this message is sent before processing the previous message and vii

8 should be dropped. 6 Discussion At present, there are two kinds of solutions mainly implemented in cloud computing area. The first one is IaaS, Infrastructure as a Service, such as Rackspace Cloud [6], Joyent SmartMachines [7] and Amazon EC2 [8]. The keys of this kind of model are server virtualization and highly automated management. Through large-scale deployment of virtualized host, with high-performance storage cluster, these solutions could serve customers with unlimited virtual servers in a short time, and users are able to get full control permission and run their any applications above. Nevertheless, IaaS still need users to cope with a couple of issues within the framework or architecture, like security, replication among data servers and fault tolerance in application servers. Another novel solution is PaaS, Platform as a Service, which is an outgrowth of Software as a Service, such as Google App Engine [9], SmartPlatform Beta [10] and maybe OpenStack in the future [11]. In this solution, users have no need to worry about the problems brought by the scalability or performance from the platform itself. One potential pitfall is that PaaS may involve some risk of lock-in if offering require proprietary service interfaces or development languages. Still, it saves a lot of time and reduces trouble for the clients so that they can spend much more time on the business logic about which they really care. It could be the main trend in the future. There are still many open research issues in cloud computing field. First of all, we hold an opinion that some international standards should be established to avoid the vendor lock-in phenomena. Because it takes substantial switching costs for both developer and customer to change to other products or services. Secondly, the security and privacy issues are also big and practical topics in any internet serviii

9 vice. Cloud computing is no exception. It need to concern about malicious attack and other hacking behaviors. What is more, the proposed programming model, such as MapReduce, is not perform well on diverse data, which issues a challenge to both consistency and efficiency [12]. Furthermore, the lack of systems and services development method is contributing to serious efficiency and bottleneck issue. This kind of development should be executed automatically in most cases. In addition, an abstract layer focusing on cluster management system for managing virtual machines could be isolated from present model structure. Finally yet importantly, the scale of cloud computing service need continue expanding to support massive Enterprise Applications, and it also need some well performed substitutes for in-house management and monitoring tools. ix

10 References [1] J. Dean and J. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In the Proceedings of the 6th Symp. on Operating Systems Design and Implementation, Dec [2] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems, In the Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, p.13-24, Feb [3] M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In Proc. OSDI, pages 29 42, San Diego, CA, December [4] A. Adya, J. Dunagan, and A. Wolman. Centrifuge: Integrated Lease Management and Partitioning for Cloud Services. In Proceedings of USENIX NSDI, Apr [5] Hadoop: [6] Rackspace Cloud: [7] Joyent SmartMachines: ttp:// [8] Amazon EC2: [9] Google App Engine: [10] SmartPlatform Beta: [11] OpenStack: [12] N. Rapolu, K. Kambatla, S. Jagannathan, and A. Grama. Transactional Support in MapReduce for Speculative Parallelism x

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

Improving Job Scheduling in Hadoop

Improving Job Scheduling in Hadoop Improving Job Scheduling in Hadoop MapReduce Himangi G. Patel, Richard Sonaliya Computer Engineering, Silver Oak College of Engineering and Technology, Ahmedabad, Gujarat, India. Abstract Hadoop is a framework

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, newlin_rajkumar@yahoo.co.in

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Atul Adya Google John Dunagan Microso7 Alec Wolman Microso0 Research

Atul Adya Google John Dunagan Microso7 Alec Wolman Microso0 Research Atul Adya Google John Dunagan Microso7 Alec Wolman Microso0 Research Incoming Request (from Device D1): store my current IP = A Front end Web server Store Read D1 s D1 s IP IP addr ApplicaGon Server (In

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

MAPREDUCE [1] is proposed by Google in 2004 and

MAPREDUCE [1] is proposed by Google in 2004 and IEEE TRANSACTIONS ON COMPUTERS 1 Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao, Senior Member, IEEE Abstract MapReduce is a widely used parallel

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

How To Compare The Two Cloud Computing Models

How To Compare The Two Cloud Computing Models WHITE PAPER Elastic Cloud Infrastructure: Agile, Efficient and Under Your Control - 1 - INTRODUCTION Most businesses want to spend less time and money building and managing infrastructure to focus resources

More information

INFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling

INFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling INFO5011 Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling COMMONWEALTH OF Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Elastic Private Clouds

Elastic Private Clouds White Paper Elastic Private Clouds Agile, Efficient and Under Your Control 1 Introduction Most businesses want to spend less time and money building and managing IT infrastructure to focus resources on

More information

Scala Storage Scale-Out Clustered Storage White Paper

Scala Storage Scale-Out Clustered Storage White Paper White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

yvette@yvetteagostini.it yvette@yvetteagostini.it

yvette@yvetteagostini.it yvette@yvetteagostini.it 1 The following is merely a collection of notes taken during works, study and just-for-fun activities No copyright infringements intended: all sources are duly listed at the end of the document This work

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Cloud Computing. Cloud computing:

Cloud Computing. Cloud computing: Cloud computing: Cloud Computing A model of data processing in which high scalability IT solutions are delivered to multiple users: as a service, on a mass scale, on the Internet. Network services offering:

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India rlrooparl@gmail.com 2 PDA College of Engineering, Gulbarga, Karnataka,

More information

Cloud Computing Summary and Preparation for Examination

Cloud Computing Summary and Preparation for Examination Basics of Cloud Computing Lecture 8 Cloud Computing Summary and Preparation for Examination Satish Srirama Outline Quick recap of what we have learnt as part of this course How to prepare for the examination

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study DISTRIBUTED SYSTEMS AND CLOUD COMPUTING A Comparative Study Geographically distributed resources, such as storage devices, data sources, and computing power, are interconnected as a single, unified resource

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least

More information

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/cs5540/spring2014/index.html

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/cs5540/spring2014/index.html Datacenters and Cloud Computing Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/cs5540/spring2014/index.html What is Cloud Computing? A model for enabling ubiquitous, convenient, ondemand network

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together Fault-Tolerant Computer System Design ECE 695/CS 590 Putting it All Together Saurabh Bagchi ECE/CS Purdue University ECE 695/CS 590 1 Outline Looking at some practical systems that integrate multiple techniques

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Accelerate Cloud Computing with the Xilinx Zynq SoC

Accelerate Cloud Computing with the Xilinx Zynq SoC X C E L L E N C E I N N E W A P P L I C AT I O N S Accelerate Cloud Computing with the Xilinx Zynq SoC A novel reconfigurable hardware accelerator speeds the processing of applications based on the MapReduce

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Cloud computing - Architecting in the cloud

Cloud computing - Architecting in the cloud Cloud computing - Architecting in the cloud anna.ruokonen@tut.fi 1 Outline Cloud computing What is? Levels of cloud computing: IaaS, PaaS, SaaS Moving to the cloud? Architecting in the cloud Best practices

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004) MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

CSE-E5430 Scalable Cloud Computing Lecture 11

CSE-E5430 Scalable Cloud Computing Lecture 11 CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

A Secure Strategy using Weighted Active Monitoring Load Balancing Algorithm for Maintaining Privacy in Multi-Cloud Environments

A Secure Strategy using Weighted Active Monitoring Load Balancing Algorithm for Maintaining Privacy in Multi-Cloud Environments IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X A Secure Strategy using Weighted Active Monitoring Load Balancing Algorithm for Maintaining

More information

Cloud Computing - Architecture, Applications and Advantages

Cloud Computing - Architecture, Applications and Advantages Cloud Computing - Architecture, Applications and Advantages 1 Arun Mani Tripathi 2 Rizwan Beg NIELIT Ministry of C&I.T., Govt. of India 2 Prof. and Head, Department 1 of Computer science and Engineering,Integral

More information

GigaSpaces Real-Time Analytics for Big Data

GigaSpaces Real-Time Analytics for Big Data GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Daniel J. Adabi. Workshop presentation by Lukas Probst

Daniel J. Adabi. Workshop presentation by Lukas Probst Daniel J. Adabi Workshop presentation by Lukas Probst 3 characteristics of a cloud computing environment: 1. Compute power is elastic, but only if workload is parallelizable 2. Data is stored at an untrusted

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords

More information

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing Advanced Distributed Systems Cristian Klein Department of Computing Science Umeå University During this course Treads in IT Towards a new data center What is Cloud computing? Types of Clouds Making applications

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Centrifuge: Integrated Lease Management and Partitioning for Cloud Services

Centrifuge: Integrated Lease Management and Partitioning for Cloud Services Centrifuge: Integrated Lease Management and Partitioning for Cloud Services Atul Adya, John Dunagan, Alec Wolman Google, Microsoft Research Abstract: Making cloud services responsive is critical to providing

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Cloud Computing. Chapter 1 Introducing Cloud Computing

Cloud Computing. Chapter 1 Introducing Cloud Computing Cloud Computing Chapter 1 Introducing Cloud Computing Learning Objectives Understand the abstract nature of cloud computing. Describe evolutionary factors of computing that led to the cloud. Describe virtualization

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica University of California, Berkeley {matei,andyk,adj,randy,stoica}@cs.berkeley.edu

More information

Cloud Computing. Chapter 1 Introducing Cloud Computing

Cloud Computing. Chapter 1 Introducing Cloud Computing Cloud Computing Chapter 1 Introducing Cloud Computing Learning Objectives Understand the abstract nature of cloud computing. Describe evolutionary factors of computing that led to the cloud. Describe virtualization

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of

More information

Cloud Computing. Chapter 1 Introducing Cloud Computing

Cloud Computing. Chapter 1 Introducing Cloud Computing Cloud Computing Chapter 1 Introducing Cloud Computing Learning Objectives Understand the abstract nature of cloud computing. Describe evolutionary factors of computing that led to the cloud. Describe virtualization

More information

A New Approach of CLOUD: Computing Infrastructure on Demand

A New Approach of CLOUD: Computing Infrastructure on Demand A New Approach of CLOUD: Computing Infrastructure on Demand Kamal Srivastava * Atul Kumar ** Abstract Purpose: The paper presents a latest vision of cloud computing and identifies various commercially

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Relational Databases in the Cloud

Relational Databases in the Cloud Contact Information: February 2011 zimory scale White Paper Relational Databases in the Cloud Target audience CIO/CTOs/Architects with medium to large IT installations looking to reduce IT costs by creating

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Cloud Computing for SCADA

Cloud Computing for SCADA Cloud Computing for SCADA Moving all or part of SCADA applications to the cloud can cut costs significantly while dramatically increasing reliability and scalability. A White Paper from InduSoft Larry

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful. Architectures Cluster Computing Job Parallelism Request Parallelism 2 2010 VMware Inc. All rights reserved Replication Stateless vs. Stateful! Fault tolerance High availability despite failures If one

More information

Cloud Computing 159.735. Submitted By : Fahim Ilyas (08497461) Submitted To : Martin Johnson Submitted On: 31 st May, 2009

Cloud Computing 159.735. Submitted By : Fahim Ilyas (08497461) Submitted To : Martin Johnson Submitted On: 31 st May, 2009 Cloud Computing 159.735 Submitted By : Fahim Ilyas (08497461) Submitted To : Martin Johnson Submitted On: 31 st May, 2009 Table of Contents Introduction... 3 What is Cloud Computing?... 3 Key Characteristics...

More information

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one

More information