Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA
Outline 1 Cloud Computing 2 3 4 VM management MapReduce applications
MapReduce in the Cloud Shared computing and storage resources Easily accessible Pay-per-use model Elastic Reliable MapReduce Parallel programming model for large clusters Processes large amounts of data Provides a clean abstraction for the programmer Communication between nodes Parallelization (scheduling and data distribution) Fault tolerance
Global view of the experiment Nimbus
Nimbus
The BlobSeer data management system BlobSeer Data striping High throughput under concurrency Versioning-based concurrency control
BlobSeer deployment Scripts: /home/acarpena/bsscripts Configuration settings: blobseer/env.sh Deploy the system: launchdepl/runblobseer.sh Challenges: Creating dynamic configuration file on multiple sites Gathering results
Nimbus
The Nimbus cloud environment
Nimbus deployment Initial scripts: developed by Pierre Riteau Modifications: Cloud spanning multiple Grid 5000 sites BlobSeer as a backend for Cumulus Automatic de-activation of existing propagation mechanisms/ Replacement with BlobSeer : /nimbus/deploy-nimbus-cloud.rb Challenges: Integrating BlobSeer-related configuration files Networking constraints in Grid 5000
Nimbus
VM cluster configuration One-click clusters in Nimbus Modifications: Wrapper scripts to automatically configure clusters Deploy a customized image : Connect to the Nimbus client Create a VM cluster: /nimbus/cloud-client-scripts/run-all.sh
Nimbus
The Hadoop MapReduce framework
Nimbus
Running MapReduce applications in the cloud Distributed Sort Sort key-value pairs Most used benchmark
VM management MapReduce applications VM management challenges Typical scenario: The user uploads a customized VM image to the Cloud repository. The VM image is propagated on many compute nodes. The same VM image is deployed simultaneously all nodes. Limitations of existing approaches: Image propagation delays Huge storage space needed Important network traffic
VM management MapReduce applications VM management challenges Typical scenario: The user uploads a customized VM image to the Cloud repository. The VM image is propagated on many compute nodes. The same VM image is deployed simultaneously all nodes. Limitations of existing approaches: Image propagation delays Huge storage space needed Important network traffic
VM management MapReduce applications BlobSeer-based efficient VM image management Principles: Optimize VM disk access: on-demand image mirroring Reduce contention by striping the image Evaluation: Experiments performed on Grid 5000 50 storage nodes up to 150 compute 10 nodes 0 Avg. time/instance to boot (s) 80 70 60 50 40 30 20 taktuk pre-propagation qcow2 over PVFS, 256K stripe our approach, 256K chunks 0 20 40 60 80 100 120 Number of concurrent instances
VM management MapReduce applications BlobSeer-based cloud data service Features Cumulus: Open source implementation of the Amazon S3 API BlobSeer: Concurrency support, Improved scalability through multiple servers Evaluation: 8 Cumulus servers 10 storage nodes, 5 metadata nodes 1GB file transferred up to 60 concurrent clients Aggregated throughput (MB/s) 450 400 350 300 250 200 150 100 50 read write 0 0 10 20 30 40 50 60 Number of clients
VM management MapReduce applications Improving Grid 5000 utilization Evaluation: Measure run time for Grep 12.5 GB of input stored in HDFS Run Hadoop on a no of nodes/vms ranging from 1 to 200 Experimental setup: Grid 5000: 200 physical nodes Job completion time (s) 120 100 80 60 40 20 Nodes VMs 0 0 50 100 150 200 250 Number of machines Nimbus: 200 VMs, only 60 physical nodes
VM management MapReduce applications Q&A