Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Similar documents

Virtualizing Apache Hadoop. June, 2012

Deploying Virtualized Hadoop Systems with VMware vsphere Big Data Extensions A DEPLOYMENT GUIDE

Hadoop Architecture. Part 1

Hadoop as a Service. VMware vcloud Automation Center & Big Data Extension

Performance and Energy Efficiency of. Hadoop deployment models

Adobe Deploys Hadoop as a Service on VMware vsphere

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Big Data Trends and HDFS Evolution

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Big Data Technology Core Hadoop: HDFS-YARN Internals

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Apache Hadoop. Alexandru Costan

Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure

VirtualclientTechnology 2011 July

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Nutanix Tech Note. Configuration Best Practices for Nutanix Storage with VMware vsphere

Introduction to Cloud Computing

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Apache Hadoop Storage Provisioning Using VMware vsphere Big Data Extensions TECHNICAL WHITE PAPER

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Chapter 7. Using Hadoop Cluster and MapReduce

REDEFINE SIMPLICITY TOP REASONS: EMC VSPEX BLUE FOR VIRTUALIZED ENVIRONMENTS

IOS110. Virtualization 5/27/2014 1

SQL Server Virtualization 101. David Klee, Group Principal and Practice Lead. SQL PASS Virtualization VC,

Cloud Optimize Your IT

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

Energy Efficient MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

VMware vsphere 5.1 Advanced Administration

YARN Apache Hadoop Next Generation Compute Platform

VMware vsphere 5.0 Boot Camp

Benchmarking Hadoop & HBase on Violin

Hadoop Virtualization

Storage Architectures for Big Data in the Cloud

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

GraySort and MinuteSort at Yahoo on Hadoop 0.23

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

VMware vsphere Big Data Extensions Administrator's and User's Guide

Cloud Infrastructure Licensing, Packaging and Pricing

Big Data - Infrastructure Considerations

7/15/2011. Monitoring and Managing VDI. Monitoring a VDI Deployment. Veeam Monitor. Veeam Monitor

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Frequently Asked Questions: EMC ViPR Software- Defined Storage Software-Defined Storage

Apache Hadoop new way for the company to store and analyze big data

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

VMware vsphere: [V5.5] Admin Training

Cloud computing - Architecting in the cloud

Extending Hadoop beyond MapReduce

Accelerating and Simplifying Apache

Microsoft SMB File Sharing Best Practices Guide

CLOUDERA REFERENCE ARCHITECTURE FOR VMWARE STORAGE VSPHERE WITH LOCALLY ATTACHED VERSION CDH 5.3

JOB ORIENTED VMWARE TRAINING INSTITUTE IN CHENNAI

EMC IRODS RESOURCE DRIVERS

Hadoop Scheduler w i t h Deadline Constraint

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A very short Intro to Hadoop

Providing Self-Service, Life-cycle Management for Databases with VMware vfabric Data Director

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Understanding Hadoop Performance on Lustre

Hadoop Cluster Applications

CSE-E5430 Scalable Cloud Computing Lecture 2

CURSO: ADMINISTRADOR PARA APACHE HADOOP

Proact whitepaper on Big Data

CDH 5 Quick Start Guide

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Hadoop 2.6 Configuration and More Examples

Can Storage Fix Hadoop

Big Fast Data Hadoop acceleration with Flash. June 2013

GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project

Apache Hadoop Cluster Configuration Guide

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon

Mobile Cloud Computing for Data-Intensive Applications

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

VMware vsphere Design. 2nd Edition

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

MapReduce, Hadoop and Amazon AWS

The next step in Software-Defined Storage with Virtual SAN

Management of VMware ESXi. on HP ProLiant Servers

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Maximizing SQL Server Virtualization Performance

EMC ENTERPRISE HYBRID CLOUD 2.5 FEDERATION SOFTWARE- DEFINED DATA CENTER EDITION

MapReduce with Apache Hadoop Analysing Big Data

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Transcription:

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware

2

Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference Architectures Isilon Architecture and Benefits vsphere Big Data Extensions Conclusion and Q&A

The Customer Journey with Hadoop

The Hadoop Journey Integrated Scale 0 node 10 s 100 s

Why Virtualize Hadoop?

Customer Example: Enterprise Adoption of Hadoop Production Production SLA: Jobs complete in 15 minutes Bandwidth limited to 30 nodes at peak Test Log files Test Issues: 1. Multiple clusters to manage 2. Redundant common data in separate clusters 3. Peak compute and I/O resource is limited to number of nodes in each independent cluster Experimentation Experimentation Dept A: recommendation engine Dept B: ad targeting Transaction data Social data Historical cust behavior

What if you could Recommendation engine Production Ad targeting Production One physical platform to support multiple virtual big data clusters Experimentation Test/Dev Test Test Production recommendation engine Production Ad Targeting Experimentation Experimentation Without Virtualization Multiple copies of common data (e.g. historical data, log data etc.) in separate Hadoop clusters Consolidate and virtualize Single copy of common data results in less storage requirements while maintaining good isolation between different MapReduce clusters

Big Data Extensions Value Propositions Operational Simplicity with Performance Maximize Resource Utilization Architect Scalable Platform Rapid Deployment Self service tools Performance Elastic scaling Avoid dedicated hardware -based isolation Increase resource utilization True multi-tenancy Deployment choice Maintain management flexibility at scale Control Costs Leverage vsphere features

Hadoop 2.0 Yet Another Resource Negotiator

A Virtualized Hadoop 2.0 Cluster

vsphere Big Data Extensions - Deploy Hadoop Clusters in Minutes Server preparation OS installation Network Configuration Hadoop Installation and Configuration From a manual process To fully automated, using the GUI

Elastic, Multi-Tenant Hadoop with Virtualization Hadoop Node Compute T1 T2 Combined Compute and Storage Storage Storage Unmodified Hadoop node in a lifecycle determined by Datanode Limited elasticity Separate Compute from Storage Separate compute from data Stateless compute Elastic compute Separate Virtual Compute Clusters per tenant Separate virtual compute Compute cluster per tenant Stronger -grade security and resource isolation

Performance and Reference Architectures

Native vs. Virtual, 32 hosts, 16 disks per host Source: http://www.vmware.com/resources/techresources/10360

Reference Architecture: 32-Server Performance Test Up to four s per server vcpus per fit within socket size (e.g. 4 s x 4 vcpus, 2 X 8) Memory per - fit within NUMA node size 2013 Tests done using Hadoop 1.0 16

I/O Profile of a Hadoop MapReduce Job (TeraSort example application) Map Task Job Map Task Map Task Map Output file.out Reduce Reduce Spills Sort Map Task DFS Input Data 12% of Bandwidth Spills & Logs spill*.out 75% of Disk Bandwidth Shuffle Map_*.out Combine Intermediate.out DFS Output Data 12% of Bandwidth HDFS

The Combined Model Standard Deployment Hadoop Virtual Node NodeManager Datanode Virtualization Host OS Image DK DK DK DK DK DK DK DK DK Shared storage SAN/NAS Local disks

Combined Model Two Virtual Machines on a Host Server Hadoop Virtual Node 1 NodeManager DataNode Hadoop Virtual Node 2 NodeManager DataNode Virtualization Host OS Image DK OS Image DK DK DK DK DK DK DK DK DK Shared storage SAN/NAS Local disks

The Data-Compute Separation Deployment Model Hadoop Virtual Node 1 NodeManager Hadoop Virtual Node 2 DataNode Virtualization Host OS Image DK OS Image DK DK DK DK DK DK DK DK DK DK DK DK DK DK DK DK DK Shared storage SAN/NAS Local disks

Data Paths: Combined vs Data- Compute Separation Combined Model Separated Model Hadoop Virtual Node Hadoop Virtual Node 1 Hadoop Virtual Node 2 NodeManager NodeManager DataNode DataNode Virtualization Host Virtual Switch Virtualization Host Virtual Switch

Alternative Storage for Data/Compute Separation DataNode NodeManager Hadoop Virtual Node 1 Hadoop Virtual Node 2 Virtualization Host OS Image DK OS Image DK DK DK DK DK DK DK DK DK DK DK Shared storage SAN/NAS Local disks

DK Isolation for Performance NodeManager Datanode Hadoop Virtual Node 1 Hadoop Virtual Node 2 Virtualization Host OS Image DK OS Image DK DK DK DK DK DK DK DK Shared storage SAN/NAS Local disks for Temp Data JBOD Local disks for HDFS Data JBOD

Data/Compute Separation With Isilon ResourceManager NodeManager NodeManager Hadoop Virtual Node 1 Hadoop Virtual Node 2 Hadoop Virtual Node 3 Virtualization Host OS Image OS DK Image OS DK Image DK DK DK DK Temp Shared storage SAN/NAS Temp NN NN NN NN NN data node NN Isilon

Larger Architecture with Data Compute Separated 25

Hybrid storage model - the best of both worlds Shared Storage Local Storage NFS for HDFS Master nodes NameNode, ResourceManager, ZooKeeper etc. on shared storage Leverage vsphere vmotion, HA and FT Worker nodes NodeManager/DataNode on local storage Lower cost, scalable bandwidth Temp data is written to local storage for best performance NFS storage for HDFS data is a very good alternative to local

vsphere Big Data Extensions and Project Serengeti

Big Data Extensions - Highlights Serengeti Serengeti Open source project Tool to simplify virtualized Hadoop deployment & operations Virtualization changes for core Hadoop Contributed back to Apache Hadoop Hadoop Virtualization Extensions (HVE) Virtual Hadoop Manager (VHM) Advanced resource management on vsphere

Introducing vsphere Big Data Extensions (BDE) vcenter Plugin Hadoop as a Service with vcloud Automation Center

Brief Tour of Big Data Extensions

One Click to Scale out the Cluster on the Fly

BDE Allows Flexible Configurations Number of nodes and resource configuration Storage configuration Choice of shared or local High Availability option placement policies

External HDFS : Simple to Set Up

How BDE works

vsphere Configuration Provision the virtual machines at the right size Reserve 6% of physical memory on the ESXi Server for vsphere usage Avoid over-commitment Enable NUMA and keep the virtual machine memory and cpu size within the NUMA node NUMA scheduler is important for virtualized Hadoop performance Poor configuration can result in performance degradation Data preferably should be distributed across NUMA nodes

ware vsphere BDE and Hadoop Resources ware vsphere BDE web site http://www.vmware.com/bde Virtualized Hadoop Performance with ware vsphere 5.1 http://www.vmware.com/resources/techresources/10220 Benchmarking Case Study of Virtualized Hadoop Performance on vsphere 5 http://vmware.com/files/pdf/w-hadoop-performance-vsphere5.pdf Hadoop Virtualization Extensions (HVE) : http://www.vmware.com/files/pdf/hadoop-virtualization-extensions-on-ware-vsphere-5.pdf Apache Hadoop High Availability Solution on ware vsphere 5.1 http://vmware.com/files/pdf/apache-hadoop-ware-ha-solution.pdf

Conclusions Hadoop workloads work very well on ware vsphere Various performance studies have shown that any difference between virtualized performance and native performance is minimal Follow the general best practice guidelines that ware has published vsphere Big Data Extensions enhances your Hadoop experience on the ware virtualization platform Rapid provisioning tool for deployment of Hadoop components in virtual machines Algorithms for best layout of your Hadoop data and cluster components are built into the BDE HVE components Design patterns such as data-compute separation can be used to provide elasticity of your Hadoop cluster on demand. User self service available with Hadoop using tools such as vcloud Automation Center integrated with BDE

Thank You jmurray@vmware.com Justin Murray

Backup Slides

Today s Challenges on Hadoop Infrastructure Fixed compute and storage coupling Compute Node Compute leads to low utilization and inflexibility Node Compute Node Compute Node Compute and storage linked together Data Node Data Node Data with fixed ratio based on the hardware Node Storage Server Node Server specification Server Server Not all jobs are created equal (data vs. compute intensive) Inflexible infrastructure leads to waste Too little compute power slow processing Too much compute power sitting idle Problem compounds with larger clusters So what happens?

Getting more out of your infrastructure Compute Node Storage Node Server Compute layer Decouple the linkage between compute and storage Stateless compute can grow and shrink elastically Data locality is preserved, place the compute where data resides Run Hadoop Run other workloads Extra compute capacity can be used for other workloads Compute Compute Compute Compute Compute Compute Compute Storage layer Storage Storage Compute Storage Compute Storage Compute Storage Compute Storage Storage

Elasticity and Scalability

Elastic Scalability & Multiple Workloads Deploy separate compute clusters for different tenants sharing HDFS. Commission/decommission compute nodes according to priority and available resources Resource Manager Resource Manger Compute layer Compute Compute Compute Compute Compute Compute Compute Compute Dynamic resource pool Experimentation Production Production recommendation engine Data layer ware vsphere + Big Data Extensions

Hadoop 1.0 Job Input File JobTracker NameNode Split 1 64MB Split 2 64MB Worker Node 1 Worker Node 2 Worker Node 3 TaskTracker TaskTracker TaskTracker Split 3 64MB Task - 1 Task - 2 Task - 3 DataNode DataNode DataNode Block 1 64MB Block 2 64MB Block 3 64MB