Performance and Energy Efficiency of. Hadoop deployment models



Similar documents
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

A Cost-Evaluation of MapReduce Applications in the Cloud

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Chapter 7. Using Hadoop Cluster and MapReduce

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Energy Efficient MapReduce

Evaluating MapReduce and Hadoop for Science

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Benchmarking Hadoop & HBase on Violin

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Architecture. Part 1

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Dell Reference Configuration for Hortonworks Data Platform

Mobile Cloud Computing for Data-Intensive Applications

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Big Fast Data Hadoop acceleration with Flash. June 2013

Apache Hadoop. Alexandru Costan

Big Data in the Enterprise: Network Design Considerations

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

HiBench Introduction. Carson Wang Software & Services Group

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Understanding Hadoop Performance on Lustre

Hadoop Parallel Data Processing

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Open source Google-style large scale data analysis with Hadoop


What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Apache Hadoop new way for the company to store and analyze big data

Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure

Hadoop Cluster Applications

An improved task assignment scheme for Hadoop running in the clouds


Parameterizable benchmarking framework for designing a MapReduce performance model

MapReduce, Hadoop and Amazon AWS

Adaptive Task Scheduling for Multi Job MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Increasing Hadoop Performance with SanDisk Solid State Drives (SSDs)

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Telecom Data processing and analysis based on Hadoop

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Big Data Analytics* Outline. Issues. Big Data

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Cloud Federation to Elastically Increase MapReduce Processing Resources

Hadoop Size does Hadoop Summit 2013

MAPREDUCE [1] is proposed by Google in 2004 and

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

PassTest. Bessere Qualität, bessere Dienstleistungen!

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Maximizing Hadoop Performance with Hardware Compression

SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM

NoSQL and Hadoop Technologies On Oracle Cloud

Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Hadoop Scalability and Performance Testing in Heterogeneous Clusters

HadoopRDF : A Scalable RDF Data Analysis System

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Big Data Introduction

MapReduce. Tushar B. Kute,

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Big Data Analytics for Net Flow Analysis in Distributed Environment using Hadoop

Introduction to Cloud Computing

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Intro to Map/Reduce a.k.a. Hadoop

MapReduce with Apache Hadoop Analysing Big Data

Introduction to Hadoop

A very short Intro to Hadoop

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Accelerating and Simplifying Apache

Apache Hadoop: Past, Present, and Future

Introduction to Cloud Computing

Open source large scale distributed data management with Google s MapReduce and Bigtable

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI

Performance measurement of a Hadoop Cluster

Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide

Transcription:

Performance and Energy Efficiency of Hadoop deployment models

Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary

MapReduce Introduced by Google Programming model for generating and processing large data sets Popular framework for large scale data analysis Data generated are often handled as large graphs

MapReduce Map() map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs Reduce() reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all values for a particular key Produces a set of merged output values

MapReduce Single Execution

MapReduce Parallel Execution

Apache Hadoop

Apache Hadoop Implementation of MapReduce An open source project Popular to the point of becoming the standard

Hadoop Deployment Models

Hadoop Deployment Models Traditional Model: Collocated data and compute services Alternate Model: Separate data and compute services

Hadoop Deployment Models Collocated Services Physical Clusters Virtual Clusters Separate Services Physical Clusters Virtual Clusters

Hadoop Deployment Models Master and Slaves can be either servers or VMs

Hadoop Deployment Models Traditional Compute: MapReduce Layer JobTracker manages MapReduce jobs based on available map/reduce capacity. Data: Hadoop Distributed File System (HDFS) NameNode system manages DataNode services.

Hadoop Deployment Models Alternate Compute: MapReduce Layer Data: Hadoop Distributed File System (HDFS) TaskTracker and DataNode services run on separate dedicated sets of nodes.

Hadoop Deployment Models Metrics Performance: Application Execution Time Power Consumption: Energy efficiency Power metered servers Performance-to-Power Ratio

Experiment

Experiment Benchmarks TeraGen Generates large amounts of data blocks Write intensive TeraSort Sorts data generated by TeraGen CPU bound during map phase I/O bound during reduce phase Wikipedia Data Processing Represents data intensive scientific application (filtering, reordering, merging)

Experiment Test Platform 33 HP DL165 G7 Servers of parapluie cluster 3 Sun Fire X2270 Servers for VM management under Snooze system 161 VMs External network file system (NFS) server hosting data sets for Wikipedia processing

Experiment Power Measurement Total power consumption of parapluie cluster

Experiment Metrics Application Execution Time Performance-to-Power Ratio Application progress correlation with power consumption

Experiment Metrics Performance-to-Power Ratio Compare power efficiency of Hadoop models Performance: inverse of execution time 1 / Texecution

Experiment Metrics Application progress correlation with power consumption Workload s power consumption profiles

Results

Results Traditional Deployment (Execution Time)

Results Traditional Deployment (Execution Time) Significant performance degradation on VMs On servers: Filter 1.3 to 3.2 times faster Reorder 2.1 to 2.5 times faster Merge 2.3 to 3.3 times faster TeraGen and TeraSort up to 2.7 times faster

Results Traditional Deployment (Execution Time) I/O heavy benchmarks perform poorly in virtualized environments Overhead compounded with multiple (read: 5) VMs per server Competing for resources

Results Alternate Deployment (Performance to Power Ratio)

Results Alternate Deployment (Performance to Power Ratio) Collocation consistently holds highest performance to power ratio Impact of separating data and compute services heavily depends on data-compute ratio Adding more compute servers did not yield significant improvement

Results Application Power Consumption Profiles TeraGen and TeraSort percentage of remaining map/reduce and power consumption with collocated data and compute layers on servers for 500GB. Map and reduce completion correlates with decrease in power consumption. Trends similar for other data sets not shown.

Results Application Power Consumption Profiles Remaining percentage of maps and reduces correlate with power consumption When map and reduce complete, power consumption decreases Indication of underutilized servers

Results Application Power Consumption Profiles TeraGen: High, steady power consumption between 100% and 40% high CPU utilization TeraSort: Similar behavior Long shuffle and reduce phase creates more fluctuations in power consumption

Results Application Power Consumption Profiles Different power profiles show granularity where energy saving mechanisms might be considered.

Results Application Power Consumption Profiles Remaining percentage of map/reduce and power consumption for Hadoop Wikipedia data processing with 80 data and 30 compute VMs. Power consumption drops as the map and reduce complete.

Results Application Power Consumption Profiles Similar results for collocated scenario and other ratios of separated data and compute services

Results Application Power Consumption Profiles Power consumption profile is significantly different from TeraGen and TeraSort Steady map phase Smooth reduce phase Indicates power consumption profiles are heavily application specific

Summary

Summary Key Findings Hadoop on VMs yields significant performance degradation with increasing data scales for both compute and data intensive applications

Summary Key Findings Separation of data and compute layers reduces the performance-to-power ratio Degree of reduction depends on: Application Data Size Data to Compute ratio

Summary Key Findings Power consumption profiles are application specific and correlate with the map and reduce phases Opportunities for applying energy saving mechanisms

The End Thank you