Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp



Similar documents
Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

NetApp Solutions for Hadoop Reference Architecture

What s New with VMware Virtual Infrastructure

June Blade.org 2009 ALL RIGHTS RESERVED

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Can Storage Fix Hadoop

Hadoop: Embracing future hardware

FlexPod Big Data Solutions for Hadoop

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Windows Server 2008 R2 Hyper-V Live Migration

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Software-defined Storage Architecture for Analytics Computing

Building a Scalable Storage with InfiniBand

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Cisco IT Hadoop Journey

Dell Reference Configuration for Hortonworks Data Platform

Microsoft SMB File Sharing Best Practices Guide

Nutanix Tech Note. Configuration Best Practices for Nutanix Storage with VMware vsphere

The Future of Data Management

Worry-free Storage. E-Series Simple SAN Storage

Virtualizing Apache Hadoop. June, 2012

Hadoop Size does Hadoop Summit 2013

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Hortonworks Data Platform Reference Architecture

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Apache Hadoop Cluster Configuration Guide

How To Evaluate Netapp Ethernet Storage System For A Test Drive

Integrated Grid Solutions. and Greenplum

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HGST Object Storage for a New Generation of IT

Microsoft Windows Server in a Flash

HadoopTM Analytics DDN

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Hadoop & its Usage at Facebook

Windows Server 2008 R2 Hyper-V Live Migration

A virtual SAN for distributed multi-site environments

Big Fast Data Hadoop acceleration with Flash. June 2013

Hadoop Architecture. Part 1

Storage Environment Technology Refresh. Richard R. Siemers Senior Storage Administrator Pier 1 Imports, Inc.

Server and Storage Virtualization with IP Storage. David Dale, NetApp

The Data Placement Challenge

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

10th TF-Storage Meeting

Oracle Big Data SQL Technical Update

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

Apache Hadoop: Past, Present, and Future

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

Lab Evaluation of NetApp Hybrid Array with Flash Pool Technology

White Paper Storage for Big Data and Analytics Challenges

SAP Solutions on VMware Infrastructure 3: Customer Implementation - Technical Case Study

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Using Hadoop to Expand Data Warehousing

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

Microsoft Private Cloud Fast Track

RED HAT ENTERPRISE VIRTUALIZATION FOR SERVERS: COMPETITIVE FEATURES

Big Data - Infrastructure Considerations

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

Best Practices for Virtualised SharePoint

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

Hadoop & Spark Using Amazon EMR

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

EMC Unified Storage for Microsoft SQL Server 2008

Dell In-Memory Appliance for Cloudera Enterprise

Enabling High performance Big Data platform with RDMA

NetApp Open Solution for Hadoop Solutions Guide

Storage Architectures for Big Data in the Cloud

Platfora Big Data Analytics

EMC: The Virtual Data Center

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Maxta Storage Platform Enterprise Storage Re-defined

Virtual SAN Design and Deployment Guide

Oracle Big Data Handbook

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Please give me your feedback

Implement Hadoop jobs to extract business value from large and varied data sets

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

Server and Storage Consolidation with iscsi Arrays. David Dale, NetApp

EMC VFCACHE ACCELERATES ORACLE

Big Data: What You Should Know. Mark Child Research Manager - Software IDC CEMA

Apache Hadoop FileSystem and its Usage in Facebook

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Realizing the True Potential of Software-Defined Storage

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Data movement for globally deployed Big Data Hadoop architectures

Zadara Storage Cloud A

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Why Choose VMware vsphere for Desktop Virtualization? WHITE PAPER

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Virtualize Without Compromise. Protecting and Storing Virtualized Data

Protecting Big Data Data Protection Solutions for the Business Data Lake

Transcription:

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Agenda Hadoop and storage Alternative storage architecture for Hadoop Use cases and customer examples Guidelines and best practices NFS Connector for Hadoop Conclusion and next steps 2

Hadoop and Storage 3

Traditional Hadoop Storage Flow Ingest to data-node-a Ingest is replicated to data-nodes-b and data-nodes-c Network Switch Ingest logs, images, text data1 data2 data3 data4 Name Node Data Node A Data Node B Data Node C Replication R=3 data1 data2 data3 data4 replicate data1 data2 data3 data4 replicate data1 data2 data3 data4 4

Implications of three copies Network Congestion Server Congestion, RAM Utilization Server A Server B Server C Server A CPU network Memory (RAM, DIMM) Memory Controller I/O Controller Server B Server C LUN - A (master) LUN - B (copy) LUN - C (copy) LUN - A (copy) LUN - B (master) LUN - C (copy) LUN - A (copy) LUN - B (copy) LUN - C (master) Start Disk Drive(s) network Hadoop uses server-based replication to keep three copies Causes high levels of I/O over server system bus Causes poor disk utilization (1/3 of raw capacity) Hadoop and Memory Memory issues large part of support calls (root cause = server memory contention) Reducing server replication reduces memory consumption for a more reliable, faster cluster Server replication can be messy 5

Alternative DAS Architecture Dedicated storage with E-series External DAS architecture Higher capacity and density 180TB in 4U Less footprint in datacenter Two copies of data (not three) Less network congestion, better throughput Less data to manage, higher effic High availability for Hadoop Reliable NameNode protection Jobs continue when nodes go off-line Faster cluster recovery 6

NetApp Storage Layout for HDFS Two 7-disk RAID 5 groups with two LUNs per node Dedicated set of disks per DataNode Shared-nothing architecture Spare disks shared globally 7

Use Cases 8

Service Provider Leveraging Hadoop Significant growth in network log data from remote data centers couldn t be consolidated Analytical queries can t be done with existing tools stakeholders couldn t access data Analysts Business Users UI + Search Tool Analytics Solution Archiving & Indexing Tools Hadoop HDFS/MapReduce Faster consolidation, indexing, searching of log data Information needed for auditing and compliance New analytics capabilities Eight note Hadoop cluster with open source search, indexing tools Remote Servers Central Servers Remote Servers 9

Security Use Case in Government Challenges Protect IT/data assets from cyber attacks Implementation: how to combine big data with cyber analytics Customer analytics application Benefits Defensive perimeter around financial data to thwart potential attacks Better situational awareness Required both Hadoop and custom analytical application for complete solution 10

Alternative Architecture in Healthcare Challenges Extract Transform Load offload for increasing amounts of unstructured data Integration of Hadoop with traditional systems Images, Insurance claims patient records Business Intelligence Data Warehouse Benefits Cost effective ingest solution of semi and unstructured data New treatment analytics capabilities Highly available Hadoop cluster Hadoop 11

Other customers and use cases Healthcare Hospitals, pharmaceutical, managed healthcare, clinical testing Transportation Airline, automotive Government Education, security Telco/SP Wireless hotspots, logs analysis Consumer Retail, household goods Financial Services Insurance, banking, mobile payments Manufacturing Electronics, industrial coating High Tech Semiconductor design and packaging, networking 12

Advantages of Alternative Architecture Feature External or Managed DAS White Box DAS Replication count Application availability Performance Fan-In Ratio Solution Architecture 2 Reduction of hardware required by one third Single copy planned Enterprise Hardware RAID 5,6 & Dynamic Disk Pools Much higher uptime (five nines) Consistent performance during healthy and unhealthy modes of operation 33% less network traffic Up to 8:1 (nodes per E-Series) SAS options: I-Band, FC Validated designs, Technical Reports expediting time to market, reducing risk Growth Flexibility Storage and compute decoupled Non-disruptive lifecycle management DataNode Management Non Disruptive DataNode replacement No rebalancing or migration 3 minimum Slower recovery from disk drive failure, NameNode failure Less uptime Degraded of up to 240% with single drive failure Limited scalability only with internal drives Iterative time-consuming tuning process, multiple failure points, and resource intensive Can only grow both simultaneously Disruptive migration and rebalancing Disruptive DataNode Replacement must rebalance and / or migrate content 13

Best practices from customer use cases Start with the use case or business problem to everage new data sources Determine the workload, technologies, infrastructure Enhance or update your datawarehouse and BI tools (ETL offload and active archiving) Think about redesigning or updating the analytic platform 14

Best Practices Minimize network overhead Replication factor of 2 and RAID 5 Use compression wherever possible Storage and Hadoop optimization Start with 4:1 storage to compute ratio Allocate 30% of storage capacity to map output Disk group layout Turn on rack awareness 15

Best Practices Use E5560 (or later) as storage array, supporting four DataNodes Use FAS22xx for diskless and network boot, storage administration Separate network for data; separate for node interconnect Use Jumbo Frames and 10GbE Determine DataNodes by storage and job run requirements 16

Best practices (continued) Start a POC or pilot sooner than later POC is for business validation Pilot is for technology validation Focus on performance after deployment Application and cluster size determine most of the configuration 17

Putting the Stack Together Reporting/Dashboard/ Visualization Applications and Analytics Data Management Servers, Networking, Hardware Storage and File Systems

Scenario for storage and analytics Enterprise Data 4 Map- Reduce HBase Spark 1 YARN NetApp FAS Storage NFS-based 3 HDFS Hadoop Analytics 2 1) Data is sitting on FAS, NFS-based storage 2) If Hadoop or Map Reduce analysis is needed, HDFS-based storage has to be created 3) Data has to be moved to newly created Hadoop storage 4) Analysis can now be done on data Hadoop diagram courtesy Hortonworks 19

Map- Reduce HBase Introducing NetApp NFS Connector YARN Spark HDFS Enterprise Data Hadoop Analytics NFS Connector NetApp FAS Storage NFS-based Directly on NFS Data Map Reduce analytics natively on data sitting on FAS, NFS-based storage NFS Connector is a thin software application between Map Reduce and NFS Hadoop diagram courtesy Hortonworks 20

Next Steps Download information at netapp.com/hadoop Technical Reports, Solution Guides, Cisco Validated Designs, Solution Briefs Start a POC Engage NetApp or partner Contact us gustav.horn@netapp.com or iyerv@netapp.com or NetApp System Engineer 21

Thank You! 22