Can Storage Fix Hadoop



Similar documents
Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

EMC IRODS RESOURCE DRIVERS

Protecting Big Data Data Protection Solutions for the Business Data Lake

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

Proact whitepaper on Big Data

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Agenda. Big Data & Hadoop ViPR HDFS Pivotal Big Data Suite & ViPR HDFS ViON Customer Feedback #EMCVIPR

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Hadoop Architecture. Part 1

Sujee Maniyam, ElephantScale

Big Data - Infrastructure Considerations

IT and Storage for Big Data Analytics

DATA LAKE FOUNDATION 2.0 JEUDI 19 NOVEMBRE Denis FRAVAL-OLIVIER : ISD Presales Manager

NetApp Solutions for Hadoop Reference Architecture

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Virtualizing Apache Hadoop. June, 2012

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Object Storage: Out of the Shadows and into the Spotlight

How To Manage A Single Volume Of Data On A Single Disk (Isilon)

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Big Data and Apache Hadoop Adoption:

Storage Architectures for Big Data in the Cloud

Design and Evolution of the Apache Hadoop File System(HDFS)

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop. Alexandru Costan

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

Deploying Hadoop with Manager

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

CommVault Backup Appliance with NetApp

Distributed Storage Systems

Frequently Asked Questions: EMC ViPR Software- Defined Storage Software-Defined Storage

Big Fast Data Hadoop acceleration with Flash. June 2013

EMC BACKUP MEETS BIG DATA

Understanding Enterprise NAS

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot.

Building Storage as a Service with OpenStack. Greg Elkinbard Senior Technical Director

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

EMC SOLUTION FOR AGILE AND ROBUST ANALYTICS ON HADOOP DATA LAKE WITH PIVOTAL HDB

Big + Fast + Safe + Simple = Lowest Technical Risk

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Accelerating and Simplifying Apache

Extending Hadoop beyond MapReduce

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

A very short Intro to Hadoop

10th TF-Storage Meeting

I/O Considerations in Big Data Analytics

Hadoop & its Usage at Facebook

High Availability on MapR

Adobe Deploys Hadoop as a Service on VMware vsphere

Solid State Storage in the Evolution of the Data Center

BIG DATA-AS-A-SERVICE

CommVault Backup Appliance with NetApp

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Apache HBase. Crazy dances on the elephant back

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

<Insert Picture Here> Big Data

HadoopRDF : A Scalable RDF Data Analysis System

The Hadoop Distributed File System

Mambo Running Analytics on Enterprise Storage

Storage Solutions For the DIY-types

Integrated Grid Solutions. and Greenplum

NoSQL and Hadoop Technologies On Oracle Cloud

Deploying Virtualized Hadoop Systems with VMware vsphere Big Data Extensions A DEPLOYMENT GUIDE

Solving performance and data protection problems with active-active Hadoop SOLUTIONS BRIEF

Integrating Kerberos into Apache Hadoop

Open source Google-style large scale data analysis with Hadoop

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Performance and Energy Efficiency of. Hadoop deployment models

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop & its Usage at Facebook

WHITE PAPER. Get Ready for Big Data:

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Data movement for globally deployed Big Data Hadoop architectures

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Transcription:

Can Storage Fix Hadoop John Webster, Senior Partner 9/18/2013 1

Agenda What is the Internet Data Center and how is it different from Enterprise Data Center? How is the Apache Software Foundation (ASF) addressing the issues? What needs fixing from the perspective of Enterprise Storage vendors and the Enterprise Storage world? What are the proposed fixes? Can Hadoop fix Enterprise storage? Can the Internet Data Center/Enterprise Data Center Chasm be Crossed? FYI: I will use vendor names and products as examples only no explicit or implied endorsements 9/18/2013 2

The Data Center Chasm Internet Data Center Enterprise Data Center 9/18/2013 3

Defining the Data Center Chasm Internet Data Center Embraces open source Automates IT Comfortable with systems that run in failure mode Cheep and deep hardware inefficiency not an obvious issue More willing to build their own systems and self-support Manages storage from a systems perspective Enterprise Data Center Prefers proprietary but learning open source Approaches IT automation conservatively Doesn t get failure mode Hardware efficiency-conscious More willing to buy from proprietary vendors and deal with them for support Sees value in storage environment as a place for data and storage management 9/18/2013 4

9/18/2013 5

What has the ASF Fixed in HDFS? NameNode SPOF NameNode active/standby failover support Snapshot Read-only Copy on Write (COW) included in latest v2 Beta (2.1.0) NFS support Support for NFSv3 in latest v2 Beta (2.1.0) DR Support Distributed Copy (distcp) 9/18/2013 6

What Needs to be Fixed the Enterprise Storage Vendor Perspective Hadoop NameNode is a single point of failure in V1. Manual failover in v2 (Beta). JobTracker is also a single point of failure For data integrity and protection, HDFS creates three full clone copies of data 3x the storage for each file slow and inefficient If all three copies are corrupted, you re still hosed (reload and start over) 60% of Enterprise Hadoop projects fail or are put on hold Steep learning curve six months is not uncommon for those that actually go from pilot to production No storage tiering Limited (if any) ways to respond to corporate security and data governance policies Difficult to move between cloud and data center Fundamentally a batch process Data in/out processes can take longer than the actual query process Inability to dis-aggregate storage from compute so that the two can be scaled independently Dearth of applications built on top Dearth of people available in the job market to run this beast and the ones that can go for big bucks.and more leading some analysts to believe that Big Data has entered the trough of disillusion

What Needs to be Fixed the Enterprise Storage Vendor Perspective Hadoop NameNode is a single point of failure JobTracker is also a single point of failure For data integrity and protection, HDFS creates three full clone copies of data 3x the storage for each file slow and inefficient If all three copies are corrupted, you re still hosed (reload and start over) 60% of Enterprise Hadoop projects fail or are put on hold Steep learning curve six months is not uncommon for those that actually go from pilot to production No storage tiering Limited (if any) ways to respond to corporate security and data governance policies Difficult to move between cloud and data center Fundamentally a batch process Data in/out processes can take longer than the actual query process Inability to dis-aggregate storage from compute so that the two can be scaled independently Dearth of applications built on top Dearth of people available in the job market to run this beast and the ones that can go for big bucks.and more leading some analysts to believe that Big Data has entered the trough of disillusion

Hadoop External Storage EMC Isilon Example Shared storage replaces node-level DAS HDFS implemented as over the wire protocol on OneFS Isilon cluster nodes emulate NameNodes and DataNodes NameNode SPOF eliminated Decoupled storage and compute layers Data protection and DR by OneFS 9/18/2013 9

Hadoop External Storage NetApp Example MAPREDUCE Hadoop 10Gb E Namenode Secondary Namenode 10Gb E Datanodes/ Tasktrackers FAS Series Data ONTAP NFS Metadata Store E-Series Data Stores 4 separate, sharednothing partitions per chassis 6Gb SAS, 4-lane Direct Connect Preserves shared nothing architecture and HDFS Decouple compute and storage Hardware RAID: reduction in copies from 3 to 2 NameNode metadata in separate array for faster NameNode recovery Datanode drive failures do not blacklist the Datanode Apply built-in enterprise data and storage management functionality to Hadoop data Source: NetApp 9/18/2013 10

Shared Storage as Secondary Storage Network Layer B8GMR3 1 Link 2 3 Link 4 5 Link 6 7 Link 8 Active Active Active Active Pwr Active Console Compute Layer C O N T R O L N O D E N O D E N O D E N O D E Primary Storage Layer 1 n Secondary Storage Layer SAN/NAS Secondary Storage Layer Data mirrored or migrated from primary to secondary storage Storage services also live here

Progression of Hadoop @ Yahoo! Source: Yahoo! 9/18/2013 12

Is Hadoop a new Storage Platform? No It s a distributed computing platform for analytics Yes HDFS - Embedded, distributed file system (like scale-out NAS) Data protection and management built-in (like Enterprise Storage) Storage performance at scale and low cost and with native intelligence Growing use case as data repository for existing enterprise BI and Data Warehousing apps the Data Lake

What Does the Enterprise Want from Big Data? If we could harness all of our data, we would be a much stronger business. * * From CompTIA survey where two thirds of respondents either agreed or strongly agreed with the statement 9/18/2013 14

Can the Chasm Be Crossed? Internet Data Center Enterprise Data Center 9/18/2013 15