EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

Similar documents
White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

EMC Isilon: Data Lake 2.0

Integrated Grid Solutions. and Greenplum

WHITE PAPER. Get Ready for Big Data:

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

Protecting Big Data Data Protection Solutions for the Business Data Lake

TRANSFORM YOUR BUSINESS: BIG DATA AND ANALYTICS WITH VCE AND EMC

BIG DATA-AS-A-SERVICE

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

HGST Object Storage for a New Generation of IT

White. Paper. Addressing NAS Backup and Recovery Challenges. February 2012

I D C T E C H N O L O G Y S P O T L I G H T. T i m e t o S c ale Out, Not Scale Up

EMC IRODS RESOURCE DRIVERS

The Challenge. ESG Case Study

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

Big + Fast + Safe + Simple = Lowest Technical Risk

Direct Scale-out Flash Storage: Data Path Evolution for the Flash Storage Era

AUTOMATED DATA RETENTION WITH EMC ISILON SMARTLOCK

Next Generation NAS: A market perspective on the recently introduced Snap Server 500 Series

Enterprise-class Backup Performance with Dell DR6000 Date: May 2014 Author: Kerry Dolan, Lab Analyst and Vinny Choinski, Senior Lab Analyst

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

MULTI VENDOR ANALYSIS

Big data management with IBM General Parallel File System

Consolidate and Virtualize Your Windows Environment with NetApp and VMware

Can Storage Fix Hadoop

EMC BACKUP MEETS BIG DATA

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

How To Manage A Single Volume Of Data On A Single Disk (Isilon)

Proact whitepaper on Big Data

EMC PERFORMANCE OPTIMIZATION FOR MICROSOFT FAST SEARCH SERVER 2010 FOR SHAREPOINT

Managing the Unmanageable: A Better Way to Manage Storage

CONVERGE APPLICATIONS, ANALYTICS, AND DATA WITH VCE AND PIVOTAL

This ESG White Paper was commissioned by DH2i and is distributed under license from ESG.

Field Audit Report. Asigra. Hybrid Cloud Backup and Recovery Solutions. May, By Brian Garrett with Tony Palmer

I/O Considerations in Big Data Analytics

DAS, NAS or SAN: Choosing the Right Storage Technology for Your Organization

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Achieving High Availability & Rapid Disaster Recovery in a Microsoft Exchange IP SAN April 2006

Lab Validation Report

An Oracle White Paper November Backup and Recovery with Oracle s Sun ZFS Storage Appliances and Oracle Recovery Manager

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

Actifio Big Data Director. Virtual Data Pipeline for Unstructured Data

Enabling High performance Big Data platform with RDMA

A Comparative TCO Study: VTLs and Physical Tape. With a Focus on Deduplication and LTO-5 Technology

White. Paper. Improving Backup Effectiveness and Cost-Efficiency with Deduplication. October, 2010

System Requirements Version 8.0 July 25, 2013

The Convergence of Big Data Processing and Integrated Infrastructure

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Lab Validation Report

Trends in Private Cloud Infrastructure

NetApp Big Content Solutions: Agile Infrastructure for Big Data

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Virtualizing Apache Hadoop. June, 2012

THE EMC ISILON SCALE-OUT DATA LAKE

Introduction to NetApp Infinite Volume

Lab Validation Report

Symantec OpenStorage Date: February 2010 Author: Tony Palmer, Senior ESG Lab Engineer

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Making the Move to Desktop Virtualization No More Reasons to Delay

Cisco Unified Data Center: The Foundation for Private Cloud Infrastructure

How To Protect Data On Network Attached Storage (Nas) From Disaster

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads

BIG DATA TRENDS AND TECHNOLOGIES

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS

EMC ISILON ONEFS OPERATING SYSTEM

ACCELERATING YOUR IT TRANSFORMATION WITH EMC NEXT-GENERATION UNIFIED STORAGE AND BACKUP

OPTIMIZING SERVER VIRTUALIZATION

Technology Insight Series

Application Brief: Using Titan for MS SQL

SunGard Enterprise Cloud Services Date: March 2012 Author: Mark Bowker, Senior Analyst

Getting on the Road to SDN. Attacking DMZ Security Issues with Advanced Networking Solutions

How to Manage Critical Data Stored in Microsoft Exchange Server By Hitachi Data Systems

Online File Sharing and Collaboration: Deployment Model Trends

EMC Unified Storage for Microsoft SQL Server 2008

Transcription:

White Paper EMC s Enterprise Hadoop Solution Isilon Scale-out NAS and Greenplum HD By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst February 2012 This ESG White Paper was commissioned by EMC Corporation and is distributed under license from ESG.

White Paper: EMC s Enterprise Hadoop Solution: Isilon Scale-out NAS and Greenplum HD 2 Contents Introduction... 3 Market Trends for Big Data Analytics Needs... 3 Leveraging Hadoop for Big Data Analytics... 4 Utilizing Scale-out NAS for Big Data Analytics... 5 Completing the Package: EMC s Isilon and Greenplum HD Consolidated Platform... 7 The Bigger Truth... 9 All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change from time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of The Enterprise Strategy Group, Inc., is in violation of U.S. copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at 508.482.0188.

Introduction White Paper: EMC s Enterprise Hadoop Solution: Isilon Scale-out NAS and Greenplum HD 3 As organizations focus on taking full advantage of the value contained in their information assets, they are finding collecting that data to be a double-edged sword. In fact, faced with the challenge of managing data that is growing at an almost overwhelming rate Facebook alone now collects more than 100 terabytes of data per day most organizations regard managing data growth, provisioning storage, and performing fast, reliable big data analytics as their top priorities. ESG defines big data analytics as the practice of analyzing entire data sets at a time not limited by how the data is structured using purpose-built technology to complete simple-to-complex data analytics tasks in a timely, cost-effective manner. In a recent ESG survey conducted with more than 100 organizations, ESG found that these organizations as they race to keep up with data growth rates consistently pinpointed data management and data storage as key challenges and key obstacles to developing refined data analytics capabilities within their organizations. All agreed that, while definitely adding intrinsic value to their organizations wealth of knowledge, big data is also placing their current IT infrastructures under extreme stress, and many are frantically searching for a means of developing a scalable infrastructure within their data centers. So what exactly are these organizations looking for? With their traditional platforms severely limited in their ability to support big data analytics, more and more companies are researching new solutions to address these challenges, and are concluding that their solution must be a consolidated, scalable platform that can support big data applications with enterprise-class service. An emerging MapReduce platform layered on a distributed file system Hadoop and HDFS is one of the solutions more recently being selected by companies to address their big data analytics needs. It is a solution specifically designed to scale out, allowing it to maintain consistent performance levels as the data set to be processed/analyzed grows. Market Trends for Big Data Analytics Needs According to ESG s latest data management survey, more than half (55%) of respondent organizations identified improved business agility as the main benefit they expected from deploying a new data analytics solution (see Figure 1). 1 Following at a close second was the ability to complete analytics in a shorter period of time. With both the need for speed and agility serving as key factors in the selection process, organizations are looking at the tools and technology that will support and grow with them as they struggle under the information onslaught and analytical platforms based on MapReduce and Apache Hadoop seem to be the frontrunners in the field. In fact, Apache Hadoop has rapidly taken the lead as the preferred solution for big data analytics across unstructured data, and Hadoop-based batch processing of unstructured and structured data at massive levels has led to a drastic change in the way organizations approach big data analytics. 1 Source: ESG Research Report, The Impact of Big Data on Data Analytics, September 2011.

White Paper: EMC s Enterprise Hadoop Solution: Isilon Scale-out NAS and Greenplum HD 4 Figure 1. Top Expected Benefits from Deploying a New Analytics Platform Which of the following benefits does your organization expect to derive from deploying a new data analytics solution? (Percent of respondents, N=102, multiple responses accepted) Improved business agility Ability to complete analytics in a shorter period of time Easier to manage 44% 43% 55% Ability to complete analytics on larger data sets Reduced deployment time and cost Ability to leverage existing resources (i.e., staff) Reduced infrastructure costs Simplified data integration Ability to accommodate new data types 34% 34% 30% 26% 26% 22% Leveraging Hadoop for Big Data Analytics 0% 10% 20% 30% 40% 50% 60% Source: Enterprise Strategy Group, 2011. Why are so many organizations leveraging Hadoop in their analytics processing? For one thing, Hadoop gives organizations the ability to store and analyze large volumes of data independent of whether the data is structured, unstructured, relational, or non-relational. Hadoop acts as a compliment to traditional data analytics platforms such as relational databases. It augments their ability to perform interactive SQL analysis on data sources that were difficult or impossible to access or process due to constraints the data placed on current platforms. Hadoop combines a MapReduce framework with the Hadoop Distributed File System (HDFS) and supports data processing and analytics tasks that are equally suited for relational and non-relationally structured data (e.g., text-based data, log files, machine-generated data, or web traffic details). As with any emerging technology, Hadoop still has some innate issues to overcome. For example: The Apache open source Hadoop distribution offers a single point of failure with the NameNode managing file metadata stored in HDFS. The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and serves as the tracking device that indicates where, across the cluster, data resides. Client applications leverage the NameNode when they wish to locate, add, copy, move, or delete a file. The NameNode responds to these requests by presenting a list of servers where the data resides. However, there is a distinct danger with this approach: when the NameNode goes down, the file system goes offline. This is an area that the Hadoop open source community is looking to address, but in the meantime, it is a concern to many organizations looking for a high availability component to their big data analytics platform. Hadoop currently lacks enterprise-class data protection features. Developers must manually set the data replication parameter for HDFS (default is 3) to identify the number of copies HDFS should make of each file

White Paper: EMC s Enterprise Hadoop Solution: Isilon Scale-out NAS and Greenplum HD 5 for data protection, rather than leverage RAID. Relying on developers to determine the number of copies to make could lead to very inefficient use of storage. Hadoop requires an incremental investment in people who are knowledgeable and skilled at Hadoop. Traditional Hadoop deployments may not be easily integrated into existing enterprise applications. Data is accessed via Hadoop and HDFS protocols or via a SQL interface, requiring the development of programming interfaces. Hadoop requires an investment in building dedicated compute clusters, which often produce isolated storage/compute resources and poor utilization of storage or CPU resources, depending on which resource you need. In a common Hadoop reference architecture, the compute cluster leverages direct attached storage (DAS) that is not easily shareable with other applications. Organizations must either deploy the Hadoop compute cluster for high compute requirements and/or high storage capacity requirements during initial deployment. In fact, in the common DAS storage model, organizations cannot easily change the ratio of compute-to-storage after the cluster is deployed. Organizations that want to use Hadoop for different types of workloads will need to deploy multiple Hadoop clusters to align with their needs. For organizations that would like to have a tiered storage layout to optimize the cost of data residing in a Hadoop cluster, administrators would need to manually configure clusters and HDFS folders to align with the tiers, and Hadoop developers would need to develop programs to migrate data between tiers. For organizations that require a data protection strategy with a disaster recovery (DR) plan, developers would need to coordinate with administrators to build in backup processes at the HDFS layer. For DR, developers typically write data sets to two separate co-located Hadoop clusters. This may pose a challenge for organizations that want to incorporate existing data protection and DR policies to data residing in a Hadoop cluster. While strong in many areas, Hadoop deployments may represent serious challenges to already-overworked IT teams based on these tenets. Utilizing Scale-out NAS for Big Data Analytics Enter scale-out network attached storage (NAS). NAS systems are easy to install and deploy, are affordable, and are reliable, keeping support time to a minimum. Scale-out NAS is a good match for big data analytics environments. ESG defines big data in general as using data sets that exceed the boundaries and sizes of normal processing capabilities, forcing you to take a non-traditional approach. Scale-out NAS is that non-traditional approach from an external networked storage standpoint. It is designed to scale beyond the limits of traditional scale-up systems while maintaining performance and availability as the data set grows something traditional storage systems just can t match. In fact, when using a shared networked storage system for supporting big data analytics, a scale-out architecture is a core requirement. Scale-out NAS offers ease of capacity expansion and ease of administration. Recent ESG research revealed that 86% of midmarket organizations and 84% of enterprise organizations are using NAS for some tier of storage. 2 As shown in Figure 2, the capacity of NAS storage (45% of respondents total disk storage capacity) outweighs storage area network (SAN) storage (36%) and direct attached storage (DAS) (31%). 3 SAN systems transfer data over the network in the form of disk blocks, while NAS systems transfer file data. SAN and NAS systems are networked, while DAS is dedicated to the server to which it is attached. 2 Source: ESG Research Report, Scale-out Storage Market Trends, December 2010. 3 Ibid.

White Paper: EMC s Enterprise Hadoop Solution: Isilon Scale-out NAS and Greenplum HD 6 Figure 2. Enterprise Storage Capacity, by Storage Type Approximately what percentage of your organization s total disk-based storage capacity would you say is associated with each storage type? (Mean, N=306) 100% 90% 80% 70% Storage area network (SAN), 36% 60% 50% 40% 30% 20% 10% 0% Network-attached storage (NAS), 45% Direct-attached storage (DAS), 19% Source: Enterprise Strategy Group, 2011. Scale-out NAS systems can scale horizontally across nodes and in front-end processing power while delivering backend capacity via newly-added processor or capacity nodes. Scale-out storage platforms can increase performance, capacity, or throughput by adding resources (e.g., processors, memory, host interfaces) as loosely coupled systems composed of nodes that work side-by-side, in parallel. Scale-out NAS systems are designed to scale performance and capacity independently, allowing users to maintain performance in rapid data growth file environments that require handling of very large file sizes without the performance or management limitations associated with scaleup storage systems. Benefits include better performance, higher availability, higher storage utilization, easier overall management, and the need for fewer administrators (see Figure 3). Commonalities between Hadoop and Scale-out NAS offer an obvious opportunity for convergence into a solution that can ensure organizations will be able to: Take advantage of enterprise data protection features such as snapshots, replication, and backup with the shared storage model applied to Hadoop analytic processing Achieve higher utilization from storage and compute resources Leverage Hadoop storage resources for additional applications and data center operations

White Paper: EMC s Enterprise Hadoop Solution: Isilon Scale-out NAS and Greenplum HD 7 Figure 3. Benefits Realized in Scale-out Storage Deployments Which of the following benefits has your organization realized as the result of deploying scale-out storage? (Percent of respondents, N=56, multiple responses accepted) Improved scalability 57% Improved performance (I/Os) 45% Improved performance (throughput) Improved data availability Faster deployments/provisioning times 36% 34% 39% Improved storage hardware utilization Reduced operational expenditures Ability to more effectively support specific applications Improved data management Ability to manage more storage capacity with fewer administrator resources Reduced capital expenditures Reduced training time/costs 14% 18% 29% 27% 27% 27% 25% 0% 10% 20% 30% 40% 50% 60% Source: Enterprise Strategy Group, 2011. Completing the Package: EMC s Isilon and Greenplum HD Consolidated Platform EMC, in response to the growing adoption of Hadoop, is addressing a market need for a Hadoop platform that provides enterprise data protection and management features on purpose-built infrastructure. EMC is delivering HDFS protocol support for its Isilon scale-out NAS storage platform. In addition to supporting the HDFS protocol on Isilon s enterprise storage platform, EMC s solution includes testing and certification with Greenplum s distribution of Apache Hadoop to provide computation capabilities for the data stored in Isilon. EMC is offering a single vendor support model for Hadoop via Isilon s Scale-Out NAS storage infrastructure, Greenplum HD s analytics capabilities (including parallel data access to the Greenplum database), and EMC consulting, configuration, training, and support services. The consolidated platform combines EMC Isilon s native storage support of the HDFS protocol integrated with Greenplum HD s data processing and analytics framework; organizations can take advantage of multiple benefits over Hadoop deployments based on the direct attached storage reference architecture. High availability to support a wide range of use cases: Isilon s high availability OneFS operating system essentially eliminates the NameNode as a single point of failure in the HDFS storage layer of Hadoop. Because the metadata is distributed across the Isilon cluster, every Isilon node acts as a NameNode. This expands the addressable use cases for Hadoop big data analytics to allow support for business and missioncritical applications that require higher levels of availability than a standard Hadoop configuration using DAS can provide.

White Paper: EMC s Enterprise Hadoop Solution: Isilon Scale-out NAS and Greenplum HD 8 Improved Data Access and Loading: Isilon OneFS supports HDFS as well as industry standard protocols such as NFS and CIFS, among others. That improves the painful and resource-intensive data staging and loading process inherent to traditional Apache Hadoop deployments that just use HDFS, and eliminates the need for excess copies of files. Organizational big data is loaded onto Isilon s OneFS storage layer via industry standard protocols like NFS, CIFS, HTTP, or FTP, but Hadoop applications can directly access that data via HDFS without having to copy or move the original source of data into a different file system. OneFS then manages the protection of the information, eliminating the need for developers and administrators to be responsible for properly configuring the performance and protection parameters. Ease-of-use and deployment: By providing an integrated Hadoop platform delivered on purpose-built infrastructure, the Greenplum HD and Isilon solution will be faster and easier to deploy than the DIY hardware and software configuration process required of traditional Hadoop deployments, which can be time-consuming and resource-intensive. Scale and efficiency: EMC s segmented Hadoop storage and compute solution enables higher storage and CPU utilization rates in comparison to traditional DAS deployments of Hadoop by enabling customers to independently scale performance or capacity as needed and create separate storage pools that meet varying performance requirements within a single system. If the customer s analytics needs require a more compute-intensive configuration versus a more capacity-driven configuration, both can be satisfied by the combined implementation of Greenplum HD and Isilon OneFS. Storage efficiencies are further extended through the use of Isilon and its SmartPool s policy-based automated tiering. By matching the value and performance requirements of data to the appropriate performance tier of storage (rather than leaving all of it on relatively expensive primary storage in a traditional DAS model), costs are reduced. Thanks to the ease of deploying Isilon Scale-out NAS, extending that simplicity to Hadoop eliminates the need to develop applications via programming file I/O interfaces to store and analyze massive data volumes. It enables organizations to deploy an enterprise-class NAS storage platform that can scale with the volumes, but is also general purpose so that organizations can store data only once and support other business users and workloads.

The Bigger Truth White Paper: EMC s Enterprise Hadoop Solution: Isilon Scale-out NAS and Greenplum HD 9 As Hadoop continues to evolve, organizations are quickly discovering how it can be used to create a substantial competitive advantage by using its capabilities to access and mine valuable big data. However, along with its power and massive capabilities, Hadoop also introduces issues commonly found in any new and evolving technology. Companies are, therefore, looking for a consolidated, scalable platform that can support big data applications with enterprise-class service attributes. With EMC s delivery of its Isilon scale-out NAS storage platform and its support for the HDFS protocol, organizations now have an option that addresses the Hadoop issues while fully supporting its innate capabilities. But that is just the first step. The second step is leveraging HDFS support to deliver a fully integrated and tested solution. Playing upon the virtues of Hadoop, EMC s Isilon and GreenPlum HD solution adds the capabilities of simple data loading and access using a native network file system interface, end-to-end manageability including simple cluster deployment, automatic failure detection and notification, multi-site management, and rolling upgrades. Finally, with the commitment to fully supporting the entire stack, EMC offers a single vendor support model for each component including hardware: storage, compute, and network and software: Greenplum HD and Greenplum database for comprehensive data analytics. EMC s Isilon and GreenPlum HD solution seems destined to become a viable option for enterprise-class Hadoop-based deployments for any organization harvesting big data and seeking to deploy big data analytics applications to support more mission-critical business processes.

20 Asylum Street Milford, MA 01757 Tel: 508.482.0188 Fax: 508.482.0218 www.enterprisestrategygroup.com