Cloud Federation to Elastically Increase MapReduce Processing Resources



Similar documents
CLEVER: a CLoud-Enabled Virtual EnviRonment

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Cloud Computing

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Cloud Courses Description

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Performance and Energy Efficiency of. Hadoop deployment models

Hadoop Scheduler w i t h Deadline Constraint

Cloud Computing. Adam Barker

A Cost-Evaluation of MapReduce Applications in the Cloud

Cloud computing - Architecting in the cloud

Cloud Courses Description

Hadoop Architecture. Part 1

Emerging Technology for the Next Decade

SURFsara HPC Cloud Workshop

Mobile Cloud Computing for Data-Intensive Applications

Open source Google-style large scale data analysis with Hadoop

Data Semantics Aware Cloud for High Performance Analytics

CHAPTER 8 CLOUD COMPUTING

Evaluation Methodology of Converged Cloud Environments

Apache Hadoop. Alexandru Costan

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL

Written examination in Cloud Computing

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Federation Establishment Between CLEVER Clouds Through a SAML SSO Authentication Profile

OpenNebula Leading Innovation in Cloud Computing Management

SURFsara HPC Cloud Workshop

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

Assignment # 1 (Cloud Computing Security)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

@ 2014 SEMAR GROUPS TECHNICAL SOCIETY.

ESPRESSO: An Encryption as a Service for Cloud Storage Systems

A very short Intro to Hadoop

Cluster, Grid, Cloud Concepts

A Middleware Strategy to Survive Compute Peak Loads in Cloud

Research Article Hadoop-Based Distributed Sensor Node Management System

Big Data - Infrastructure Considerations

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Cloud Computing from an Institutional Perspective

A Requirements Analysis for IaaS Cloud Federation

Test of cloud federation in CHAIN-REDS project

MapReduce, Hadoop and Amazon AWS

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Environments, Services and Network Management for Green Clouds

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Amazon Web Services Demo Tech Exchange. Slides:

Benchmarking Hadoop & HBase on Violin

If you do NOT use applications based on Amazon Web Services raise your hand.

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Cloud Computing Simulation Using CloudSim

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Windows Azure and private cloud

CLOUD COMPUTING. When It's smarter to rent than to buy

Open source large scale distributed data management with Google s MapReduce and Bigtable

International Journal of Engineering Research & Management Technology

Survey on Scheduling Algorithm in MapReduce Framework

Cloud Computing Summary and Preparation for Examination

Web Application Hosting Cloud Architecture

Sistemi Operativi e Reti. Cloud Computing

An Introduction to Cloud Computing Concepts

How To Understand Cloud Computing

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Geoff Raines Cloud Engineer

HSCLOUD: CLOUD ARCHITECTURE FOR SUPPORTING HOMELAND SECURITY

Sriram Krishnan, Ph.D.

Apache Hadoop new way for the company to store and analyze big data

Clearing Away the Clouds: What is the Future of Cloud Computing? BEBO WHITE PEWE WORKSHOP BRATISLAVA APRIL 2010

Evaluating MapReduce and Hadoop for Science

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

The OpenNebula Standard-based Open -source Toolkit to Build Cloud Infrastructures

Business applications:

Key Research Challenges in Cloud Computing

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

A.Prof. Dr. Markus Hagenbuchner CSCI319 A Brief Introduction to Cloud Computing. CSCI319 Page: 1

Introduction to OpenStack

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Linux/Open Source and Cloud computing Wim Coekaerts Senior Vice President, Linux and Virtualization Engineering

Unleash the IaaS Cloud About VMware vcloud Director and more VMUG.BE June 1 st 2012

Session 3. the Cloud Stack, SaaS, PaaS, IaaS

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

CLOUD STORAGE USING HADOOP AND PLAY

Cloud Computing through Virtualization and HPC technologies

Cloud Computing Technology

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing

Transcription:

Cloud Federation to Elastically Increase MapReduce Processing Resources A.Panarello, A.Celesti, M. Villari, M. Fazio and A. Puliafito {apanarello,acelesti, mfazio, mvillari, apuliafito}@unime.it DICIEAMA, University of Messina Contrada di Dio, S. Agata, 98166 Messina, Italy The second international FedICI'2014 workshop: Federative and interoperable cloud infrastructures

Outline Cloud federation introduction How Cloud federation can elastically increase providers' MapReduce resources Case of study: a video transcoding service System prototype (Hadoop, CLEVER, Amazon S3) Main factors involved in job submission Conclusion and future works

Toward Cloud Federation Currently, only the major cloud providers (e.g., Amazon, Google, Rackspace, etc) hold big datacenters, i.e., virtualization infrastructures Small cloud provider cannot directly compete with these market leaders. They have to buy services from these mega-providers. The largest business is in hand of mega-providers. Possible solution: Cloud Federation

Evolution of the Cloud Ecosystem Indepentent Cloud Cloud federation cloud federation: a mesh of cloud providers that are interconnected to provide a universal decentralized computing environment where everything is driven by constraints and agreements in a ubiquitous, multi-provider infrastructure Different distributed services (e.g., IaaS, PaaS, SaaS) One of the main challenges: minimizing the barriers of delivering services among different administrative domains

Why to Federate Cloud Providers? Multiple reasons: Clouds can benefit of a market in which the can buy/sell resources A cloud has saturated its own resources and it needs external assets A cloud needs particular types of services or resources that it does not hold A cloud wants to perform software consolidation in order to save energy cost A cloud wants to move part of processing into other providers (e.g., for security, performance, or for the deployment of particular location-dependent services) And so on...

Motivation MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster The major MapReduce pieces of framework are not cloud-like: They often are non resilient They often does not scale up/down They often require manual configurations Objectives: Make a piece of MapReduce framework cloud-like Investigate the main concerns regarding the job submission in a federated cloud environment

MapReduce Distributed Processing in Cloud Federation: A Reference Scenario (1) Actors: Multiple Cloud Providers (CPs), each one running a MapReduce system in its administrative domain. A public Cloud Storage Provider (CSP), offering storage services and supporting multi-part data download Clients, each one submitting a parallel processing request (job) to a particular CP (i.e., Home CP). The the piece of input data is stored in a CSP (e.g., Amazon S3, Dropbox, Drive, etc) to minimize the transmission overhead between federated CPs

MapReduce Distributed Processing in Cloud Federation: A Reference Scenario (2) The client contacts the home CP that offers a particular parallel processing service and he/she submits a job (where the piece of data is stored and how to process it) The home CP establishes a federation with other foreign CPs and sends them sub-job instructions. Exploiting the multi-part download each federated CP download chunks of data and process them exploiting the local MapReduce system Each federated CP upload the output in the CSP sends a notification to the home CP Finally, the client merges the processed chunks (if required) and read the whole output.

A Video Transcoding Use Case A user would like to watch a movie that is stored in a CSP using his/her mobile phone Unfortunately the movie is stored as HD file and the user's device is not able to play it Thus, the client submit a video transcoding job to reduce the resolution of the movie to a particular home CP The job submission includes where the input movie is stored and how to process it The Home CP establishes a federation with other foreign CPs submitting them a sub-job Each foreign CP downloads a chunk of file, processes it, upload it in the CSP, and sends a notification to the home CP. Once the Home CP received all the notification is generates a SMIL file, i.e., an XML file that allows to play a video without merge chunks. The home CP upload the SMIL file in the CSP The client is able to play the movie

System Prototype (1) System components Hadoop as MapReduce piece of framework CLEVER as middleware to make Hadoop cloud-like with federation capabilities in CPs Amazon S3 as public CSP Hadoop Master/Slave architecture It consists of a single master JobTracker node and several slave TaskTracker nodes. To speed up the processing it supports a distributed file system, i.e., HDFS including Name and Data nodes typically respectively deployed in the same nodes running JobTracker and TaskTracker

System Prototype (2) CLEVER The CLoud-Enabled Virtual EnviRonment (CLEVER) is a Message- Oriented Middleware for Cloud comptuting (MOM4C) that enables to arrange federated cloud systems A Cluster Manager (CM) acts as interface with client and manages several Host Managers (HMs) Inter-module communication by means of MUC using XMPP Pluggable architecture: agents can be added to control third party components (Sensor networks, virtualization, parallel processing, storage, etc )

System Prototype (3) Advantages of integrating Hadoop in CLEVER Typically, Hadoop uses the TCP/IP layer for communication: firewalls can block inter-domain communication. Solution: Integrating Hadoop in CLEVER communication can be sent on port 80 thanks ot XMPP. The system can automatically scale The two main software agents: Hadoop Master Node (HMN) and Hadoop Slave Node (HSN) running in respectively in CM and HM Two possible configurations: HM with HSN in PHs or in VMs (more resilient)

Experiments (1) Objective: understanding the main concerns regarding the job submission in the federated cloud environment. Processing time of a hadoop cluster was out of scope of this paper (many works are available in literature) Testbed Specification 4 CLEVER/Hadoop administrative domains (i.e., A, B, C, and D) deployed in 4 servers CPU: Intel(R) Core(TM)2 CPU 6300; 1.86GHz, 3GB RAM, running Linux Ubuntu 12.04 x86 64 OS and VirtualBox Overall system deployed in 10 VMs (1 VM in domain A and 3 in domains B, C, D) Amazon S3 Experiment repeated 50 times in order to consider mean values and confidence intervals

Experiments (2) Timeline T0, a client submit a video transcoding job to the home CP T1, the home CP that receives the request decides to establish a federation with the other ones, retrieving domain information. T2, the home CP performs a job assignment involving the whole federated environment. By means of the job tracker it creates the video transcoding job, and it assigns the sub-jobsto the other federated domains. T3, each involved federated CP downloads only particular video chunks from Amazon S3 using the multipart download mechanism. T4. Each CP uploads the previously downloaded video chunks in its own HDFS of the local domain for the processing.

Experiments (3) The average time required to retrieve domain information (tt1-t0) and to forward in parallel the request to federated CPs (t2-t1) is roughly 5 seconds.

Experiments (3) Distribution histogram of the mean times required to download 20MB, 10MB, and 7MB block sizes from Amazon S3 (t3-t2) in each CP considering one administrative domain. Looking at the summary distribution histogram, it is evident how thanks to the federation, increasing the number of administrative domains the download time can be reduced considering smaller chunks.

Experiments (4) The average upload time of chunks in the HDFS on each domain (t4-t3) changes according to the number of active Data Nodes and video file sizes. we can notice that increasing the number of Hadoop Data Nodes the upload time increases too. We can motivate this trend remembering that the Hadoop has been configured with a redundancy parameter equal to 2. In fact with a single active Data Node, the upload time has a very low value, because the system does not have the need to replicate the file. Due to Hadoop s data replication mechanisms, increasing the number of Data Nodes, we can notice a linear increase of the upload

Conclusion and Future Work The main result has been understanding how a MapReduce parallel processing system can be deployed in a federated cloud environment. Experiments highlighted the overhead of the system in job submission In future works we plan to integrate resource provisioning policies to make more flexible the federation relationship establishment between Cps. For who is interested in CLEVER, a guide on how to use the middleware and how to develop agents is available in the official web site http://clever.unime.it

Questions?