1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team
2/23 1 MapReduce applications - case study 2 3 4 5
3/23 MapReduce applications - case study What is? Parallel programming model for large clusters Processes large amounts of data Provides a clean abstraction for the programmer Communication between nodes Parallelization (scheduling and data distribution) Fault tolerance
4/23 MapReduce applications - case study MapReduce applications - Distributed Grep Scans input to find occurences of a certain expression
5/23 MapReduce applications - case study MapReduce applications - Distributed Sort Sort key-value pairs Most used benchmark
6/23 Amazon Elastic Compute Cloud (EC2)... The most widely-used IaaS... Pay-per-use model: rented resources in the Cloud data transfers to/from the Cloud data transfers between VMs are free of charge
7/23 EC2 Costs Resource costs with per second charges Data transfers $0.10 per GB for incoming data $0.15 per GB for downloaded data free download for less than 1GB of data
8/23 Two goals: measure the overhead of porting MapReduce applications to the Cloud estimate the cost of running MapReduce applications in the Cloud Run MapReduce applications with on 2 platforms: Cloud
9/23 - Tools OAR Kadeploy fine-grain reservations deploy customized images API access resources through HTTP Taktuk launch parallel remote executions
10/23 Reference open-source IaaS cloud
11/23 Who runs on?
12/23 Yahoo! s implementation of MapReduce Open-source Java project Large scale computation and data processing Works on comodity hardware
13/23 Core Distributed File System (HDFS) MR framework
13/23 Core Distributed File System (HDFS) MR framework
14/23 In-production use at...
15/23 - Running on 220 nodes from Rennes and Orsay automatic deployment one namenode one jobtracker datanodes co-deployed with tasktrackers
15/23 - Running on 220 nodes from Rennes and Orsay automatic deployment one namenode one jobtracker datanodes co-deployed with tasktrackers
16/23 - Running on 60 nodes on parapide and parapluie deployment Ruby, API deployment customized image with and HDFS IP addresses belonging to Rennes routed private network public key authentication Client deploys a cluster of images
16/23 - Running on 60 nodes on parapide and parapluie deployment Ruby, API deployment customized image with and HDFS IP addresses belonging to Rennes routed private network public key authentication Client deploys a cluster of images
16/23 - Running on 60 nodes on parapide and parapluie deployment Ruby, API deployment customized image with and HDFS IP addresses belonging to Rennes routed private network public key authentication Client deploys a cluster of images
16/23 - Running on 60 nodes on parapide and parapluie deployment Ruby, API deployment customized image with and HDFS IP addresses belonging to Rennes routed private network public key authentication Client deploys a cluster of images
17/23 Performance evaluation Goal: compare s performance in the 2 setups Measure run time for Grep and Sort 12.5 GB of input stored in HDFS Run on a no of nodes/vms ranging from 1 to 200
18/23 Performance evaluation Grep Sort
19/23 Cost evaluation Goal: estimate the cost of running Grep and Sort in the Cloud 12.5 GB of input stored in HDFS Run on a no of VMs ranging from 1 to 200 Costs: CPU cost = no VMs runtime VM cost data transfers = (input size + output size) GB cost
20/23 Cost evaluation [1] Grep cost Sort cost
21/23 Cost evaluation [2] The cost of running Grep and Sort on 100 machines for two types of VMs The overhead of running Grep and Sort on compared to running on the Grid
22/23 Context: Executing MapReduce applications in grids and clouds 2 setups: 1 running on 2 running on the cloud deployed on Evaluation: performance costs impact of VM types
23/23 Thank you!