Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Similar documents

A Cost-Evaluation of MapReduce Applications in the Cloud

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Going Back and Forth: Efficient Multideployment and Multisnapshotting on Clouds

BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency

Chapter 7. Using Hadoop Cluster and MapReduce

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

A Service for Data-Intensive Computations on Virtual Clusters

Computing in clouds: Where we come from, Where we are, What we can, Where we go

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop IST 734 SS CHUNG

Viswanath Nandigam Sriram Krishnan Chaitan Baru

5 SCS Deployment Infrastructure in Use

Amazon EC2 Product Details Page 1 of 5

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Performance Analysis of Mixed Distributed Filesystem Workloads

Hadoop Distributed File System Propagation Adapter for Nimbus

Evaluation Methodology of Converged Cloud Environments

Cloud Computing Summary and Preparation for Examination

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Research Article Hadoop-Based Distributed Sensor Node Management System

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

How To Create A Multi Disk Raid

Hadoop Architecture. Part 1

A Survey on Cloud Storage Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

BlobCR: Efficient Checkpoint-Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots

A programming model in Cloud: MapReduce

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

BIG DATA TRENDS AND TECHNOLOGIES

MapReduce and Hadoop Distributed File System V I J A Y R A O

salsadpi: a dynamic provisioning interface for IaaS cloud

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Lustre * Filesystem for Cloud and Hadoop *

Cloud Computing based on the Hadoop Platform

Performance and Energy Efficiency of. Hadoop deployment models

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Deploying Business Virtual Appliances on Open Source Cloud Computing

Virtualizing Apache Hadoop. June, 2012

Mixing Hadoop and HPC Workloads on Parallel Filesystems

Improving MapReduce Performance in Heterogeneous Environments

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Plug-and-play Virtual Appliance Clusters Running Hadoop. Dr. Renato Figueiredo ACIS Lab - University of Florida

Federated Big Data for resource aggregation and load balancing with DIRAC

Resource Scalability for Efficient Parallel Processing in Cloud

Alfresco Enterprise on AWS: Reference Architecture

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data Management in the Clouds and HPC Systems

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

MapReduce Job Processing

Building your Big Data Architecture on Amazon Web Services

Cloud Security in Map/Reduce An Analysis July 31, Jason Schlesinger

Sriram Krishnan, Ph.D.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

J. Parallel Distrib. Comput. BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Hadoop & Spark Using Amazon EMR

OGF25/EGEE User Forum Catania, Italy 2 March 2009

2) Xen Hypervisor 3) UEC

Cloud computing - Architecting in the cloud

NoSQL and Hadoop Technologies On Oracle Cloud

Benchmarking Hadoop & HBase on Violin

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

HDFS Users Guide. Table of contents

Evalua&ng Streaming Strategies for Event Processing across Infrastructure Clouds (joint work)

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Log Mining Based on Hadoop s Map and Reduce Technique

Cloud Computing. Adam Barker

Scalable Services for Digital Preservation

Big Data on Microsoft Platform

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Brian Amedro CTO. Worldwide Customers

Apache Hadoop. Alexandru Costan

Towards a New Model for the Infrastructure Grid

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Storage node capacity in RAID0 is equal to the sum total capacity of all disks in the storage node.

OpenNebula Leading Innovation in Cloud Computing Management

PARALLELS CLOUD STORAGE

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Introduction to Cloud Computing

BIG DATA SOLUTION DATA SHEET

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Sistemi Operativi e Reti. Cloud Computing

MapReduce and Hadoop Distributed File System

Manjrasoft Market Oriented Cloud Computing Platform

Generic Log Analyzer Using Hadoop Mapreduce Framework

OpenNebula An Innovative Open Source Toolkit for Building Cloud Solutions

CloudStack and Big Data. Sebastien May 22nd 2013 LinuxTag, Berlin

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

THE HADOOP DISTRIBUTED FILE SYSTEM

HPC performance applications on Virtual Clusters

Migration Scenario: Migrating Batch Processes to the AWS Cloud

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

The Quest for Conformance Testing in the Cloud

Transcription:

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA

Outline 1 Cloud Computing 2 3 4 VM management MapReduce applications

MapReduce in the Cloud Shared computing and storage resources Easily accessible Pay-per-use model Elastic Reliable MapReduce Parallel programming model for large clusters Processes large amounts of data Provides a clean abstraction for the programmer Communication between nodes Parallelization (scheduling and data distribution) Fault tolerance

Global view of the experiment Nimbus

Nimbus

The BlobSeer data management system BlobSeer Data striping High throughput under concurrency Versioning-based concurrency control

BlobSeer deployment Scripts: /home/acarpena/bsscripts Configuration settings: blobseer/env.sh Deploy the system: launchdepl/runblobseer.sh Challenges: Creating dynamic configuration file on multiple sites Gathering results

Nimbus

The Nimbus cloud environment

Nimbus deployment Initial scripts: developed by Pierre Riteau Modifications: Cloud spanning multiple Grid 5000 sites BlobSeer as a backend for Cumulus Automatic de-activation of existing propagation mechanisms/ Replacement with BlobSeer : /nimbus/deploy-nimbus-cloud.rb Challenges: Integrating BlobSeer-related configuration files Networking constraints in Grid 5000

Nimbus

VM cluster configuration One-click clusters in Nimbus Modifications: Wrapper scripts to automatically configure clusters Deploy a customized image : Connect to the Nimbus client Create a VM cluster: /nimbus/cloud-client-scripts/run-all.sh

Nimbus

The Hadoop MapReduce framework

Nimbus

Running MapReduce applications in the cloud Distributed Sort Sort key-value pairs Most used benchmark

VM management MapReduce applications VM management challenges Typical scenario: The user uploads a customized VM image to the Cloud repository. The VM image is propagated on many compute nodes. The same VM image is deployed simultaneously all nodes. Limitations of existing approaches: Image propagation delays Huge storage space needed Important network traffic

VM management MapReduce applications VM management challenges Typical scenario: The user uploads a customized VM image to the Cloud repository. The VM image is propagated on many compute nodes. The same VM image is deployed simultaneously all nodes. Limitations of existing approaches: Image propagation delays Huge storage space needed Important network traffic

VM management MapReduce applications BlobSeer-based efficient VM image management Principles: Optimize VM disk access: on-demand image mirroring Reduce contention by striping the image Evaluation: Experiments performed on Grid 5000 50 storage nodes up to 150 compute 10 nodes 0 Avg. time/instance to boot (s) 80 70 60 50 40 30 20 taktuk pre-propagation qcow2 over PVFS, 256K stripe our approach, 256K chunks 0 20 40 60 80 100 120 Number of concurrent instances

VM management MapReduce applications BlobSeer-based cloud data service Features Cumulus: Open source implementation of the Amazon S3 API BlobSeer: Concurrency support, Improved scalability through multiple servers Evaluation: 8 Cumulus servers 10 storage nodes, 5 metadata nodes 1GB file transferred up to 60 concurrent clients Aggregated throughput (MB/s) 450 400 350 300 250 200 150 100 50 read write 0 0 10 20 30 40 50 60 Number of clients

VM management MapReduce applications Improving Grid 5000 utilization Evaluation: Measure run time for Grep 12.5 GB of input stored in HDFS Run Hadoop on a no of nodes/vms ranging from 1 to 200 Experimental setup: Grid 5000: 200 physical nodes Job completion time (s) 120 100 80 60 40 20 Nodes VMs 0 0 50 100 150 200 250 Number of machines Nimbus: 200 VMs, only 60 physical nodes

VM management MapReduce applications Q&A