A Survey of Cloud Computing Guanfeng Octides
|
|
- Reynold Morton
- 3 years ago
- Views:
Transcription
1 A Survey of Cloud Computing Guanfeng Nov 7, 2010 Abstract The principal service provided by cloud computing is that underlying infrastructure, which often consists of compute resources like storage, processors, RAM, load balancers etc, of cloud-based platform and system are entirely abstracted from the consumer of the software or services. This kind of character needs proper and fined programming models, runtime systems and communication protocols which are designed for the requirements of efficiently sharing computing capability among large numbers of different nodes. As such, they provide several means to reduce complexity and allow programmers to easily employ parallel and distributed system resources without introducing their own cumbersome and incompatible mechanism. In this survey, we analyze and summarize a couple of techniques proposed in this research field. Our study focus on programming models, algorithms and protocols used in real cloud platforms in terms of their system implementation and different resources construction, their concurrency and fault tolerance mechanism, their approach to load balance or schedule management. 1 Introduction With the vigorous development of Internet industry and rapid increase of data volume need to be processed, people become more and more interested in utii
2 lizing parallel computing resources to address their needs. However, programming with or even maintaining a distributed system is complicated and expensive. The different demands concerned by separated customers, like availability and performance of the system, and various computing resources, like commodity PC clusters, multi-core and shared-memory systems, or even the hybrid, make this topic much more difficult. Thus, research in cloud computing has focused for several years on resource-abstract-layer alike modeling and systems building, that is, on devising a programming model with related protocols to let users get rid of intricate computing resources management in a relatively general condition. Some efficient solutions have been found, such as, MapReduce [1], Phoenix [2], LATE [3], Centrifuge [4] and many others. MapReduce was proposed as a programming model for datacenter algorithms. It could be employed by many real world tasks. Inverted Index, as an example of one common used program, can be easily solved in a large cluster with MapReduce computations. The computations contain two parts, Map and Reduce, which are basically defined by users. This model split input data records into pieces at first. Then the master node dynamically select slave workers among the candidate nodes and assign one or many Map tasks whose functionality specified by users to them. Each worker node passes through the received input data and processes each pair in the way determined by Map function. Subsequently, worker node stores its intermediate key/value pairs and partitions those into a certain number of regions, which also decide upon partitioning function from users. Notified by the master node, some workers would be aware of the corresponding hashed result existence. After remotely reading (under the circumstance of real cluster infrastructure) processed pairs as input in the second phase, a.k.a. Reduce, they sort those intermediate values and group them by their keys. Finally, all output files of the Reduce function constitute the result of a user-defined computation. ii
3 2 Infrastructure Platforms or runtime systems built on various computing resources have different characters and demands. MapReduce programming model proposed by Google is implemented well among the commodity PC cluster. In execution, it needs to concern about the locations of intermediate files, network traffic, communication protocols between two machines and hardware failures. While some of them may not be necessarily considered in other computing resource bases, for example, Phoenix, an implementation of MapReduce for shared-memory system, have focused on buffer management, quite different fault recovery mechanism and performance management (like using prefetch to optimize locality) instead. For both multi-core chips and conventional symmetric multi-processors computing environment, Phoenix manages Map-Reduce buffers and Reduce-Merge buffers, whose access permissions are well specified, to store intermediate and final task output key/value pairs separately. Another programming model, Centrifuge was proposed by Microsoft for in-memory server pools. This datacenter lease manager provides a usable implementation to replace internal load balancer and guarantees systems correctness at the same time. However, inside a computing cluster, all nodes may not be homogeneous and probably take progress at a different speed. Long Approximate Time to End (LATE) is a scheduling algorithm designed for this kind of heterogeneous situation. 3 Availability and Fault Tolerance In MapReduce implementation, the master node sends an ICMP echo request packet to each worker and waiting for an ICMP response periodically. After a certain time waiting without receiving any answer, master node assumes the worker is crashed. Those incomplete Map or Reduce tasks previously assigned to this work would iii
4 be re-executed by some other nodes. Even encountering a large-scale failure, this method can still promise the computation termination. For the master node, it writes periodic checkpoints of all computing information, which would be reloaded to a substitution copy if the master fails. On the assumption of deterministic function input, MapReduce implementation is able to produce the same output with a longer execution time, but meet the requirement of fault tolerance. Phoenix system uses a different approach for worker failure detection. Since the shared-memory system cannot and have no need to employ the network layer utility to test the vital signs of workers, scheduler just mark the worker as transient failure if a worker does not complete a task within a reasonable amount of time. The failure would upgrade to permanent level if the task fails several times or the worker cannot finish successfully at a high frequency. Workers with this fault label would never be trusted and no task would be assigned to them anymore. However, a potential problem here we thought is how to determine the reasonable processing time slot. There are both pros and cons for allowing users to indicate this value or totally determined by system itself. If the system sets this time range too short, even not far from the reasonable value, there would be many computing resources wasted to re-execute the same tasks. Conversely, if the time range much longer than the task needs, it increase the time cost to detect a failure and cause performance degradation as a consequence. In Centrifuge model, the Manager service who decides the logic of lease management, partitioning and adaptive load balance consists of two parts, one is Paxos group, and the other is a set of leader server and standby servers. Paxos group is used to determine the distributed consensus in general, and in this implementation, it decides to pick a leader from all candidates, which provide a high-available store. Every operation leader will execute that change its internal state need report to the Paxos group. These sequences of requests would be stored at Paxos and give a iv
5 specified standby server when that standby server is elected to be the new leader after the old one crashed. For efficiency concern, all Owner and Lookup libraries only send requests to the valid leader. They have to broadcast their messages to all servers if previous leader they used to contact is unresponsive. An updating event would occur when the library receives a response from a new Manager node, and this one should be the only destination all other requests process towards later. 4 Performance In a computer cluster, network bandwidth is an important factor need to concern about. When a large quantity of Map workers have to read the input files from remote disks, it would consume plenty of scarce network resources and definitely lower the performance. Thus, Googles implementation takes the location information of source input data into master nodes consideration, and tries to assign as most Map tasks to the physical node that share the same local disk with their input data. In addition, it spawns redundant executions for the reason of potential stragglers towards the end of both two stages. straggler here means a machines spends an unreasonable long time to execute their tasks at the last few waves of all computations. In this special period, Customers have to wait only few workers that may even need fault tolerance mechanisms help later, which obviously prolong the total execution time to the users. The method that master node let several machines do the same operations is proposed to avoid that poor circumstance and speed up the processing rate. Another way in this model to improve the performance is to decrease the task granularity. Ideally, the more task pieces could be scheduled, the more balanced state could meet. Within the memory capability of master node, this method accelerates the recovery part as well. However, we thought the overhead brought by added partitioning operations should also be concerned. v
6 Shared-memory system has a different situation yet. Instead of taking real location information into account, runtime employ prefetch scheme, which is similar to double-buffering in streaming models, to optimize its performance. Moreover, Phoenix uses hardware compression of intermediate pairs to preserve bandwidth and cache resources. In heterogeneous environments, LATE algorithm provides its modified schedule mechanism to achieve optimal performance with its specified infrastructure. We will discuss this in the following part. 5 Scheduling and Load Balancing Phoenix and original Hadoop system [5] have many things in common on scheduling, like split input data, assign Map tasks to workers, notify Reduce workers with file locations or memory addresses. However, a new algorithm above heterogeneous computing resources still be designed and outperform those existed ones. LATE argues the homogeneity and linear processing assumption at first. At Hadoop scheduling algorithm, it rank tasks as following order: the failed task, unscheduled task and straggler task. The monitor inside the system selects a speculative task by its progress score label. Progress score is a real number from 0 to 1, and it is equally divided by Map and three phases in Reduce. A task that has run longer than one minute and its score less than average minus 1/5 would be marked as a straggler. The master node also guarantees that at most one execution copy of each task is running at a time. LATE uses a new method to pick the straggler through estimating and comparing tasks finish time into the future. They provide a simply heuristic to evaluate the remaining time and let users plug their own estimating methods into model as well. The scheduler calculates the progress rate as (progress score) / (execution time), and the remaining time as (1 progress score) vi
7 / (progress time). It determines a node is straggler or not depending on whether it is blow the SlowTaskThreshold. Furthermore, system only launch speculative task on fast nodes. A SlowNodeThreshold is set to distinguish between fast and slow nodes. Another parameter, SpeculativeCap, limits the total number of speculative tasks that can be running at the same time. Although, the value of above three parameters has been assigned to 25th percentile of task progress rate, 25th percentile of node progress and 10% of available task slots respectively, and it performs well in experiments, authors should still give more illustrations to argue their reasonability and what is the best combination. For in-memory server pools, Centrifuge is proposed as a replacement of internal network load balancer. It provides Manager service, Lookup library and Owner library. When a server want to publish some event to a specified topic, they just lookup its library by using the hash value of the topic name. Library then returns the address of appropriate server they should contact that hosts this topic. The servers linking in the Owner library receive leases on these topics also based on same hash keys. Before this server forwards the event to all parties when receiving a publish message, it need to check with its Owner library whether it holds a legal lease. Manager services job is to partition a flat key namespace into variable length ranges among all the servers linking in Owner libraries. It then distributes related partitioning assignment with a lease protocol. Manager would reassign leases for both adaptive load management needs and other Owner libraries requests. Lookup Library maintains the lease generation number and its corresponding Owner node for each range. It contacts Manager once every 30 seconds to get incremental changes and update the newest table. To solve the problem of message races, it introduces a protocol using two sequence numbers like the vector clock in some sense. If one side receives a message that does not contain receivers most recent sequence number, this message is sent before processing the previous message and vii
8 should be dropped. 6 Discussion At present, there are two kinds of solutions mainly implemented in cloud computing area. The first one is IaaS, Infrastructure as a Service, such as Rackspace Cloud [6], Joyent SmartMachines [7] and Amazon EC2 [8]. The keys of this kind of model are server virtualization and highly automated management. Through large-scale deployment of virtualized host, with high-performance storage cluster, these solutions could serve customers with unlimited virtual servers in a short time, and users are able to get full control permission and run their any applications above. Nevertheless, IaaS still need users to cope with a couple of issues within the framework or architecture, like security, replication among data servers and fault tolerance in application servers. Another novel solution is PaaS, Platform as a Service, which is an outgrowth of Software as a Service, such as Google App Engine [9], SmartPlatform Beta [10] and maybe OpenStack in the future [11]. In this solution, users have no need to worry about the problems brought by the scalability or performance from the platform itself. One potential pitfall is that PaaS may involve some risk of lock-in if offering require proprietary service interfaces or development languages. Still, it saves a lot of time and reduces trouble for the clients so that they can spend much more time on the business logic about which they really care. It could be the main trend in the future. There are still many open research issues in cloud computing field. First of all, we hold an opinion that some international standards should be established to avoid the vendor lock-in phenomena. Because it takes substantial switching costs for both developer and customer to change to other products or services. Secondly, the security and privacy issues are also big and practical topics in any internet serviii
9 vice. Cloud computing is no exception. It need to concern about malicious attack and other hacking behaviors. What is more, the proposed programming model, such as MapReduce, is not perform well on diverse data, which issues a challenge to both consistency and efficiency [12]. Furthermore, the lack of systems and services development method is contributing to serious efficiency and bottleneck issue. This kind of development should be executed automatically in most cases. In addition, an abstract layer focusing on cluster management system for managing virtual machines could be isolated from present model structure. Finally yet importantly, the scale of cloud computing service need continue expanding to support massive Enterprise Applications, and it also need some well performed substitutes for in-house management and monitoring tools. ix
10 References [1] J. Dean and J. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In the Proceedings of the 6th Symp. on Operating Systems Design and Implementation, Dec [2] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems, In the Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, p.13-24, Feb [3] M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In Proc. OSDI, pages 29 42, San Diego, CA, December [4] A. Adya, J. Dunagan, and A. Wolman. Centrifuge: Integrated Lease Management and Partitioning for Cloud Services. In Proceedings of USENIX NSDI, Apr [5] Hadoop: [6] Rackspace Cloud: [7] Joyent SmartMachines: ttp:// [8] Amazon EC2: [9] Google App Engine: [10] SmartPlatform Beta: [11] OpenStack: [12] N. Rapolu, K. Kambatla, S. Jagannathan, and A. Grama. Transactional Support in MapReduce for Speculative Parallelism x
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationImproving MapReduce Performance in Heterogeneous Environments
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
More informationImproving Job Scheduling in Hadoop
Improving Job Scheduling in Hadoop MapReduce Himangi G. Patel, Richard Sonaliya Computer Engineering, Silver Oak College of Engineering and Technology, Ahmedabad, Gujarat, India. Abstract Hadoop is a framework
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationA SURVEY ON MAPREDUCE IN CLOUD COMPUTING
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, newlin_rajkumar@yahoo.co.in
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationAtul Adya Google John Dunagan Microso7 Alec Wolman Microso0 Research
Atul Adya Google John Dunagan Microso7 Alec Wolman Microso0 Research Incoming Request (from Device D1): store my current IP = A Front end Web server Store Read D1 s D1 s IP IP addr ApplicaGon Server (In
More informationMapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationMAPREDUCE [1] is proposed by Google in 2004 and
IEEE TRANSACTIONS ON COMPUTERS 1 Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao, Senior Member, IEEE Abstract MapReduce is a widely used parallel
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationHow To Compare The Two Cloud Computing Models
WHITE PAPER Elastic Cloud Infrastructure: Agile, Efficient and Under Your Control - 1 - INTRODUCTION Most businesses want to spend less time and money building and managing infrastructure to focus resources
More informationINFO5011. Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling
INFO5011 Cloud Computing Semester 2, 2011 Lecture 11, Cloud Scheduling COMMONWEALTH OF Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationElastic Private Clouds
White Paper Elastic Private Clouds Agile, Efficient and Under Your Control 1 Introduction Most businesses want to spend less time and money building and managing IT infrastructure to focus resources on
More informationScala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
More informationSnapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
More informationyvette@yvetteagostini.it yvette@yvetteagostini.it
1 The following is merely a collection of notes taken during works, study and just-for-fun activities No copyright infringements intended: all sources are duly listed at the end of the document This work
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
More informationCloud Computing. Cloud computing:
Cloud computing: Cloud Computing A model of data processing in which high scalability IT solutions are delivered to multiple users: as a service, on a mass scale, on the Internet. Network services offering:
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationEnhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
More informationMapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
More informationLoad Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India rlrooparl@gmail.com 2 PDA College of Engineering, Gulbarga, Karnataka,
More informationCloud Computing Summary and Preparation for Examination
Basics of Cloud Computing Lecture 8 Cloud Computing Summary and Preparation for Examination Satish Srirama Outline Quick recap of what we have learnt as part of this course How to prepare for the examination
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationDISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study
DISTRIBUTED SYSTEMS AND CLOUD COMPUTING A Comparative Study Geographically distributed resources, such as storage devices, data sources, and computing power, are interconnected as a single, unified resource
More informationIndex Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationFinal Project Proposal. CSCI.6500 Distributed Computing over the Internet
Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least
More informationDatacenters and Cloud Computing. Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/cs5540/spring2014/index.html
Datacenters and Cloud Computing Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/cs5540/spring2014/index.html What is Cloud Computing? A model for enabling ubiquitous, convenient, ondemand network
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationFault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together
Fault-Tolerant Computer System Design ECE 695/CS 590 Putting it All Together Saurabh Bagchi ECE/CS Purdue University ECE 695/CS 590 1 Outline Looking at some practical systems that integrate multiple techniques
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
More informationMapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
More informationSurvey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
More informationLifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India
Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationAccelerate Cloud Computing with the Xilinx Zynq SoC
X C E L L E N C E I N N E W A P P L I C AT I O N S Accelerate Cloud Computing with the Xilinx Zynq SoC A novel reconfigurable hardware accelerator speeds the processing of applications based on the MapReduce
More informationMapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
More informationLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationCloud computing - Architecting in the cloud
Cloud computing - Architecting in the cloud anna.ruokonen@tut.fi 1 Outline Cloud computing What is? Levels of cloud computing: IaaS, PaaS, SaaS Moving to the cloud? Architecting in the cloud Best practices
More informationTask Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
More informationBig Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
More informationMapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)
MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationHypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
More informationCSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus
More informationMap Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
More informationHadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
More informationA Secure Strategy using Weighted Active Monitoring Load Balancing Algorithm for Maintaining Privacy in Multi-Cloud Environments
IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X A Secure Strategy using Weighted Active Monitoring Load Balancing Algorithm for Maintaining
More informationCloud Computing - Architecture, Applications and Advantages
Cloud Computing - Architecture, Applications and Advantages 1 Arun Mani Tripathi 2 Rizwan Beg NIELIT Ministry of C&I.T., Govt. of India 2 Prof. and Head, Department 1 of Computer science and Engineering,Integral
More informationGigaSpaces Real-Time Analytics for Big Data
GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and
More informationA Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems
A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationDaniel J. Adabi. Workshop presentation by Lukas Probst
Daniel J. Adabi Workshop presentation by Lukas Probst 3 characteristics of a cloud computing environment: 1. Compute power is elastic, but only if workload is parallelizable 2. Data is stored at an untrusted
More informationMAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
More informationF1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013
F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords
More information24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing
Advanced Distributed Systems Cristian Klein Department of Computing Science Umeå University During this course Treads in IT Towards a new data center What is Cloud computing? Types of Clouds Making applications
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationCentrifuge: Integrated Lease Management and Partitioning for Cloud Services
Centrifuge: Integrated Lease Management and Partitioning for Cloud Services Atul Adya, John Dunagan, Alec Wolman Google, Microsoft Research Abstract: Making cloud services responsive is critical to providing
More informationBSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationCloud Computing. Chapter 1 Introducing Cloud Computing
Cloud Computing Chapter 1 Introducing Cloud Computing Learning Objectives Understand the abstract nature of cloud computing. Describe evolutionary factors of computing that led to the cloud. Describe virtualization
More informationProcessing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationImproving MapReduce Performance in Heterogeneous Environments
Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica University of California, Berkeley {matei,andyk,adj,randy,stoica}@cs.berkeley.edu
More informationCloud Computing. Chapter 1 Introducing Cloud Computing
Cloud Computing Chapter 1 Introducing Cloud Computing Learning Objectives Understand the abstract nature of cloud computing. Describe evolutionary factors of computing that led to the cloud. Describe virtualization
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationAnalysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms
Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of
More informationCloud Computing. Chapter 1 Introducing Cloud Computing
Cloud Computing Chapter 1 Introducing Cloud Computing Learning Objectives Understand the abstract nature of cloud computing. Describe evolutionary factors of computing that led to the cloud. Describe virtualization
More informationA New Approach of CLOUD: Computing Infrastructure on Demand
A New Approach of CLOUD: Computing Infrastructure on Demand Kamal Srivastava * Atul Kumar ** Abstract Purpose: The paper presents a latest vision of cloud computing and identifies various commercially
More informationNetworking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
More informationBig Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationRelational Databases in the Cloud
Contact Information: February 2011 zimory scale White Paper Relational Databases in the Cloud Target audience CIO/CTOs/Architects with medium to large IT installations looking to reduce IT costs by creating
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationCloud Computing for SCADA
Cloud Computing for SCADA Moving all or part of SCADA applications to the cloud can cut costs significantly while dramatically increasing reliability and scalability. A White Paper from InduSoft Larry
More informationAnalysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
More informationCluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.
Architectures Cluster Computing Job Parallelism Request Parallelism 2 2010 VMware Inc. All rights reserved Replication Stateless vs. Stateful! Fault tolerance High availability despite failures If one
More informationCloud Computing 159.735. Submitted By : Fahim Ilyas (08497461) Submitted To : Martin Johnson Submitted On: 31 st May, 2009
Cloud Computing 159.735 Submitted By : Fahim Ilyas (08497461) Submitted To : Martin Johnson Submitted On: 31 st May, 2009 Table of Contents Introduction... 3 What is Cloud Computing?... 3 Key Characteristics...
More informationSolving I/O Bottlenecks to Enable Superior Cloud Efficiency
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
More information