A Cost-Evaluation of MapReduce Applications in the Cloud

Similar documents
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Performance and Energy Efficiency of. Hadoop deployment models

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Hadoop Parallel Data Processing

Hadoop Architecture. Part 1

Apache Hadoop. Alexandru Costan

MapReduce, Hadoop and Amazon AWS

MapReduce Job Processing

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Introduction to Cloud Computing

Mobile Cloud Computing for Data-Intensive Applications

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Open source Google-style large scale data analysis with Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Setup. 1 Cluster

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Computing in clouds: Where we come from, Where we are, What we can, Where we go

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 7. Using Hadoop Cluster and MapReduce


How To Use Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

CLOUD COMPUTING. When It's smarter to rent than to buy

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Introduction to Hadoop

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

CSE-E5430 Scalable Cloud Computing Lecture 2

Improving MapReduce Performance in Heterogeneous Environments

Sriram Krishnan, Ph.D.

H2O on Hadoop. September 30,

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Chapter 11 Cloud Application Development

Research Article Hadoop-Based Distributed Sensor Node Management System

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Mobile Storage and Search Engine of Information Oriented to Food Cloud

A Service for Data-Intensive Computations on Virtual Clusters

Data Semantics Aware Cloud for High Performance Analytics

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

MapReduce. Tushar B. Kute,

Introduction to Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduction to Cloud Computing

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Last time. Today. IaaS Providers. Amazon Web Services, overview

Cloud Security in Map/Reduce An Analysis July 31, Jason Schlesinger

Provisioning and Resource Management at Large Scale (Kadeploy and OAR)

Cloud Federation to Elastically Increase MapReduce Processing Resources

An improved task assignment scheme for Hadoop running in the clouds

Hadoop Technology HADOOP CLUSTER

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Cloud Computing Summary and Preparation for Examination

Apache Hadoop new way for the company to store and analyze big data

L1: Introduction to Hadoop

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Hadoop and Map-Reduce. Swati Gore

Performance Tuning and Scheduling of Large Data Set Analysis in Map Reduce Paradigm by Optimal Configuration using Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency

MapReduce with Apache Hadoop Analysing Big Data

This presentation covers virtual application shared services supplied with IBM Workload Deployer version 3.1.

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Extending Hadoop beyond MapReduce

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

BBM467 Data Intensive ApplicaAons

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Survey on Scheduling Algorithm in MapReduce Framework

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Autoscaling Hadoop Clusters

Map Reduce & Hadoop Recommended Text:

Cloud computing - Architecting in the cloud

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

The Performance of MapReduce: An In-depth Study

Big Data With Hadoop

Benchmarking Hadoop & HBase on Violin

Optimize the execution of local physics analysis workflows using Hadoop

Introduction to HDFS. Prasanth Kothuri, CERN

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

A Very Brief Introduction To Cloud Computing. Jens Vöckler, Gideon Juve, Ewa Deelman, G. Bruce Berriman

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

hadoop Running hadoop on Grid'5000 Vinicius Cogo Marcelo Pasin Andrea Charão

Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Lecture 10 - Functional programming: Hadoop and MapReduce

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

HadoopRDF : A Scalable RDF Data Analysis System

Networks and Services

Hadoop: Code Injection Distributed Fault Injection. Konstantin Boudnik Hadoop committer, Pig contributor

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Transcription:

1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team

2/23 1 MapReduce applications - case study 2 3 4 5

3/23 MapReduce applications - case study What is? Parallel programming model for large clusters Processes large amounts of data Provides a clean abstraction for the programmer Communication between nodes Parallelization (scheduling and data distribution) Fault tolerance

4/23 MapReduce applications - case study MapReduce applications - Distributed Grep Scans input to find occurences of a certain expression

5/23 MapReduce applications - case study MapReduce applications - Distributed Sort Sort key-value pairs Most used benchmark

6/23 Amazon Elastic Compute Cloud (EC2)... The most widely-used IaaS... Pay-per-use model: rented resources in the Cloud data transfers to/from the Cloud data transfers between VMs are free of charge

7/23 EC2 Costs Resource costs with per second charges Data transfers $0.10 per GB for incoming data $0.15 per GB for downloaded data free download for less than 1GB of data

8/23 Two goals: measure the overhead of porting MapReduce applications to the Cloud estimate the cost of running MapReduce applications in the Cloud Run MapReduce applications with on 2 platforms: Cloud

9/23 - Tools OAR Kadeploy fine-grain reservations deploy customized images API access resources through HTTP Taktuk launch parallel remote executions

10/23 Reference open-source IaaS cloud

11/23 Who runs on?

12/23 Yahoo! s implementation of MapReduce Open-source Java project Large scale computation and data processing Works on comodity hardware

13/23 Core Distributed File System (HDFS) MR framework

13/23 Core Distributed File System (HDFS) MR framework

14/23 In-production use at...

15/23 - Running on 220 nodes from Rennes and Orsay automatic deployment one namenode one jobtracker datanodes co-deployed with tasktrackers

15/23 - Running on 220 nodes from Rennes and Orsay automatic deployment one namenode one jobtracker datanodes co-deployed with tasktrackers

16/23 - Running on 60 nodes on parapide and parapluie deployment Ruby, API deployment customized image with and HDFS IP addresses belonging to Rennes routed private network public key authentication Client deploys a cluster of images

16/23 - Running on 60 nodes on parapide and parapluie deployment Ruby, API deployment customized image with and HDFS IP addresses belonging to Rennes routed private network public key authentication Client deploys a cluster of images

16/23 - Running on 60 nodes on parapide and parapluie deployment Ruby, API deployment customized image with and HDFS IP addresses belonging to Rennes routed private network public key authentication Client deploys a cluster of images

16/23 - Running on 60 nodes on parapide and parapluie deployment Ruby, API deployment customized image with and HDFS IP addresses belonging to Rennes routed private network public key authentication Client deploys a cluster of images

17/23 Performance evaluation Goal: compare s performance in the 2 setups Measure run time for Grep and Sort 12.5 GB of input stored in HDFS Run on a no of nodes/vms ranging from 1 to 200

18/23 Performance evaluation Grep Sort

19/23 Cost evaluation Goal: estimate the cost of running Grep and Sort in the Cloud 12.5 GB of input stored in HDFS Run on a no of VMs ranging from 1 to 200 Costs: CPU cost = no VMs runtime VM cost data transfers = (input size + output size) GB cost

20/23 Cost evaluation [1] Grep cost Sort cost

21/23 Cost evaluation [2] The cost of running Grep and Sort on 100 machines for two types of VMs The overhead of running Grep and Sort on compared to running on the Grid

22/23 Context: Executing MapReduce applications in grids and clouds 2 setups: 1 running on 2 running on the cloud deployed on Evaluation: performance costs impact of VM types

23/23 Thank you!