White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014



Similar documents
EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

EMC Isilon: Data Lake 2.0

HGST Object Storage for a New Generation of IT

Protecting Big Data Data Protection Solutions for the Business Data Lake

Direct Scale-out Flash Storage: Data Path Evolution for the Flash Storage Era

EMC IRODS RESOURCE DRIVERS

Enterprise-class Backup Performance with Dell DR6000 Date: May 2014 Author: Kerry Dolan, Lab Analyst and Vinny Choinski, Senior Lab Analyst

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Online File Sharing and Collaboration: Deployment Model Trends

The Challenge. ESG Case Study

A Comparative TCO Study: VTLs and Physical Tape. With a Focus on Deduplication and LTO-5 Technology

TRANSFORM YOUR BUSINESS: BIG DATA AND ANALYTICS WITH VCE AND EMC

EMC SOLUTION FOR SPLUNK

IBM: An Early Leader across the Big Data Security Analytics Continuum Date: June 2013 Author: Jon Oltsik, Senior Principal Analyst

I D C T E C H N O L O G Y S P O T L I G H T. T i m e t o S c ale Out, Not Scale Up

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

Field Audit Report. Asigra. Hybrid Cloud Backup and Recovery Solutions. May, By Brian Garrett with Tony Palmer

How To Manage A Single Volume Of Data On A Single Disk (Isilon)

White. Paper. The SMB Market is Ready for Data Encryption. January, 2011

Big Data at Cloud Scale

Trends in Private Cloud Infrastructure

Whitepaper. NexentaConnect for VMware Virtual SAN. Full Featured File services for Virtual SAN

Can Storage Fix Hadoop

Enterprise Strategy Group Getting to the bigger truth. By Bill Lundell, Senior Research Analyst and John McKnight, VP Research and Analysts

How To Improve Storage Efficiency With Ibm Data Protection And Retention

Solution Impact. Analysis. NEC Powers ServIT's Custom Hosting Solutions. September, 2011

How to Make Oracle Databases Faster and More Efficient with Pure Storage. By Scott Sinclair, Storage Analyst and Nik Rouda, Senior Big Data Analyst

Future-Proofed Backup For A Virtualized World!

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

EMC ISILON ONEFS OPERATING SYSTEM

THE EMC ISILON SCALE-OUT DATA LAKE

Effective Data Integration - where to begin. Bryte Systems

This ESG White Paper was commissioned by DH2i and is distributed under license from ESG.

The Future of Data Management

Innovative technology for big data analytics

White. Paper. Addressing NAS Backup and Recovery Challenges. February 2012

White. Paper. Rethinking Endpoint Security. February 2015

Hyperconverged Transformation: Getting the Software-defined Data Center Right

Caringo Swarm 7: beyond the limits of traditional storage. A new private cloud foundation for storage needs at scale

Cybersecurity Skills Shortage: A State of Emergency

Simple. Extensible. Open.

Integrated Grid Solutions. and Greenplum

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Agenda. Big Data & Hadoop ViPR HDFS Pivotal Big Data Suite & ViPR HDFS ViON Customer Feedback #EMCVIPR

Getting on the Road to SDN. Attacking DMZ Security Issues with Advanced Networking Solutions

A Tipping Point for Automation in the Data Warehouse.

Hadoop in the Hybrid Cloud

Microsoft Analytics Platform System. Solution Brief

SwiftStack Filesystem Gateway Architecture

Your Data, Any Place, Any Time.

Total year-over-year spending change in networking, (Percent of respondents) 37% 36% 35% 37% 29% 26% 16% 13% 0% 20% 40% 60% 80%

The Convergence of Big Data Processing and Integrated Infrastructure

Varonis: Secure Enterprise Collaboration and File Sharing Date: June 2015 Author: Terri McClure, Senior Analyst; and Leah Matuson, Research Analyst

Understanding Enterprise NAS

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

A HIGH-PERFORMANCE, SCALABLE BIG DATA APPLIANCE LAURA CHU-VIAL, SENIOR PRODUCT MARKETING MANAGER JOACHIM RAHMFELD, VP FIELD ALLIANCES OF SAP

Threat Intelligence and Its Role Within Enterprise Cybersecurity Practices

The Impact of PaaS on Business Transformation

IBM BigInsights for Apache Hadoop

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Calculating ROI for Business Intelligence Solutions in Small and Mid-Sized Businesses

6 Best Practices for Reducing Your Data Storage Footprint

Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014

Research Report. Abstract: Social Enterprise Adoption Trends. June 2012

The Challenge of Securing and Managing Data While Meeting Compliance

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

White. Paper. The Rise of Network Functions Virtualization. Implications for I/O Strategies in Service Provider Environments.

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Lab Validation Report

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

Transcription:

White Paper EMC Isilon: A Scalable Storage Platform for Big Data By Nik Rouda, Senior Analyst and Terri McClure, Senior Analyst April 2014 This ESG White Paper was commissioned by EMC Isilon and is distributed under license from ESG.

White Paper: EMC Isilon: A Scalable Storage Platform for Big Data 2 Contents Big Data Needs Big Storage... 3 Businesses Want Big Data to Have a Big Impact... 3 Storage Selection Criteria for Big Data... 4 Advantages of Isilon Scale-out Storage for Hadoop... 5 Many Protocols, but Only One Copy of Data... 6 In-place Analytics with Your Favorite Flavor of Hadoop... 7 Enterprise-class Storage Increases Efficiency and Safety... 7 The Bigger Truth... 7 All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change from time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of The Enterprise Strategy Group, Inc., is in violation of U.S. copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at 508.482.0188.

White Paper: EMC Isilon: A Scalable Storage Platform for Big Data 3 Big Data Needs Big Storage A rising tide of information is being collected, processed, and analyzed by enterprises around the world, but this flood of data brings as many challenges as it does solutions. As companies become more data driven in a wide range of activities, they will need their production big data implementations to meet common enterprise requirements such as high performance, scalability, availability, security, and compliance. Underlying the databases and analytics engines, other parts of the technology stack are critical components for providing these qualities, not the least of which involves the choice of storage platforms. The storage systems can make or break a big data implementation. EMC Isilon is a leader in scale-out storage and offers many advantages as a foundation for big data analytics, which are built on the company s years of experience in large enterprise data centers. Businesses Want Big Data to Have a Big Impact There has been no shortage of press on the many practical applications of big data in all industries and across all lines of business. Many of these stories are compelling anecdotes, and are often specific to the particular organization s goals and activities. However, some common data analytics trends can be found across industries. ESG recently conducted its 2014 IT Spending Intentions Survey and identified the top business benefits desired by respondent organizations from their investments in business intelligence and analytics. 1 Figure 1. Business Benefits from Data Investments What business benefits do you expect to gain from your investments in the area of business intelligence, analytics, and big data? (Percent of respondents, N=187, multiple responses accepted) Improved operational efficiency 59% Reduced risk around business decisions and strategy Higher quality products/services Incremental cost savings More insights into future scenarios or outcomes More insights into historical results Faster tactical response to shifting customer views Uncover new market opportunities Quicker time to market for products/services Reduced risk of product defects 42% 41% 39% 36% 35% 34% 31% 30% 26% 0% 10% 20% 30% 40% 50% 60% 70% Source: Enterprise Strategy Group, 2014. Implicit in all these goals is the need to serve the business not just with more data, but also with timelier reporting. For many companies, the time needed to get an answer is the key criterion for the adoption of data-driven decision 1 Source: ESG Research Report, 2014 IT Spending Intentions Survey, February 2014. All ESG references and charts in this white paper have been taken from this research report.

White Paper: EMC Isilon: A Scalable Storage Platform for Big Data 4 making. No longer do quarterly batch reports meet the needs instead, daily updates, real-time alerts, and ad hoc querying are becoming standard requirements for analysts and executives. Vendors are now bringing a breadth of data technologies into play, ranging from traditional relational databases to NoSQL and Hadoop, and from advanced analytics applications to data visualization and reporting tools. With these tools come accompanying options in architecture models: commodity servers, ready-made appliances, or cloud services, and open source or proprietary software. Each of these choices will have an impact on the overall capabilities of the solution, affecting end-user perceptions of performance, flexibility, and availability. These high expectations from business executives put a lot of pressure on enterprise IT departments to deliver a well-implemented solution. This isn t usually an easy task considering that big data initiatives often involve the integration of many new data sources, big data platforms, and analytics applications with existing data warehouses and transactional databases. This architectural complexity spans many IT disciplines, with dependencies on everything, including applications, servers, networks, and storage. Looking at the ESG research in Figure 2, it is clear that many of the top ten most-cited IT priorities will be directly related to proper management of enterprise data, including big data. Figure 2. Top Ten Most Important IT Priorities for 2014 Which of the following would you consider to be your organization s most important IT priorities over the next 12 months? (Percent of respondents, N=562, ten responses accepted) Increase use of server virtualization Information security initiatives 32% 32% Improve data backup and recovery 29% Manage data growth Desktop virtualization Regulatory compliance initiatives Major application deployments or upgrades Use cloud infrastructure services Business intelligence/data analytics initiatives Build a private cloud infrastructure 25% 24% 23% 23% 23% 23% 22% 0% 5% 10% 15% 20% 25% 30% 35% Source: Enterprise Strategy Group, 2014. These issues are sometimes glibly underestimated with the assumption that the Hadoop Distributed File System (HDFS) provides cheap and cheerful provisions for storing and managing massive volumes of big data. The truth is that storage requirements for the enterprise are becoming increasingly demanding, especially as more decision makers become reliant on big data insights. Storage Selection Criteria for Big Data As noted, the choice of storage platform underpins the overall efficacy of the technology stack, and will have ramifications that must be carefully evaluated. There are a number of factors to consider, including:

White Paper: EMC Isilon: A Scalable Storage Platform for Big Data 5 Scalability and efficiency will have an obvious impact on the ability to ingest and store data. Particular attention should be paid to mechanisms that reduce total footprint, such as deduplication, compression, and the redundancy required to preserve data against loss. Human capital required to manage the system should also be analyzed in the efficiency category because organizations cannot afford to continue to add staff to manage the environment as data grows. Total cost of ownership (TCO) matters as the big data initiative benefits are weighed against both capital and operating expense, including maintenance, support, footprint, and human capital. A reduced cost structure should lead to more data stored (because organizations can now afford to) and more valuable insights realized (as a benefit of having more data to analyze). Performance seems like an obvious requirement, but it can be elusive as more users do more comprehensive analysis with larger data volumes. Finding a storage system that can handle the I/O demands of the environment, including any extract, transform, and load (ETL) to other data repositories, is critical. Data location has a significant impact on this, especially if large amounts need to be moved around before analytics processing. Data protection, security, and governance utilities are becoming mandatory for big data environments. As data lakes or data hubs start to encapsulate all manner of sensitive data in one central location, this clearly needs to be treated with great care. Compliance with relevant government and industry regulations must be addressed directly and explicitly. As a newer technology, Hadoop by itself isn t as mature in these areas as enterprises may require. Accessibility may be one of the least recognized attributes of the storage decision, but it can provide significant advantages in flexibility of models for enabling different groups or tools to harness the data without moving it into other platforms before processing can begin. Access controls also must be well developed and granular. These are all important factors for deciding how appropriate a storage platform is with big data environments. Again, a range of traditional options for storage platforms include: commodity direct attached (DAS), storage area network (SAN), and network attached storage (NAS). Conventional wisdom has been to use commodity storage in the form of internal drives, but when weighing the impact of storage infrastructure choices on data management and analytics conventional wisdom falls short on delivery. Advantages of Isilon Scale-out Storage for Hadoop Today, there is still a relative immaturity of functionality and robustness in many big data technology stacks when it comes to storage. Although Hadoop and HDFS can simplify the model for scaling on commodity servers with DAS, some alternatives provide compelling advantages for the enterprise and help overcome some of the challenges associated with using the traditional approach. Challenges with using the embedded storage/das approach include data protection, data leverage, elongated business process, and, surprisingly, cost. On the data protection front, HDFS uses multiple copies of data to provide data protection, meaning it consumes a lot of storage. Both data leverage and business processes are impacted by the fact that data is only accessible via HDFS and is not accessible to other applications that require other interfaces (i.e., RESTful object-based applications or NFS/CIFS/SMB file-based applications). This means ETL operations need to be performed to ingest data or leverage data in other business processes, thus elongating those processes each time the ETL process needs to be performed. This also means that organizations must have multiple data repositories for the same data in multiple data formats to support different business processes. So on the surface, using commodity DAS configurations may sound attractive and may indeed be a good fit for many organizations, but those companies that need to analyze data from multiple sources or leverage it to support multiple business processes incur further costs for additional infrastructure and may need to investigate alternative approaches. One alternative approach that helps overcome these challenges is the adoption of a shared storage platform that has been designed to meet enterprise IT operations requirements. EMC Isilon is a prime example of this case, which

White Paper: EMC Isilon: A Scalable Storage Platform for Big Data 6 brings Hadoop to your data, instead of moving all your data to Hadoop clusters. It lets users create a central data hub that supports multiple applications and business processes, reducing costs and business cycles by eliminating most ETL requirements. Many Protocols, but Only One Copy of Data Isilon is a flexible storage platform that supports multiprotocol access to a single data object, eliminating the upfront protocol decision because NFS, RESTful objects, HTTP, FTP, SMB, and HDFS are all supported. So users can ingest an object from a web app and access it via NFS to edit it. Or better yet, a user might access web logs directly from a web application, rather than exporting to a spreadsheet, and access these via the native HDFS interface to run analytics. This ability to make only one copy available for multiple uses is a major benefit for reduction in overall storage costs and cycle time because it means there is no need to export data to multiple systems for the various use cases. A single repository also greatly simplifies compliance audit requirements, rather than chasing after many distinct locations and sources. Figure 3. Multi-protocol Access to Isilon s OneFS Operating System Source: EMC/Isilon, 2014.

White Paper: EMC Isilon: A Scalable Storage Platform for Big Data 7 In-place Analytics with Your Favorite Flavor of Hadoop Building on the support for multiprotocol access to a single data object, organizations can effectively do in-place analytics on data without needing a lengthy data ingest from other primary storage data sources to the Hadoop system, which very often leads to a faster overall time to results. Although more specialized data layouts and approaches can sometimes be faster in querying and analysis, with Isilon, data analysis can be started immediately, and the reduced effort and start time delay without ETL can often overcome the difference. In addition, concurrent instances of different Hadoop distributions could be run in parallel on the same underlying storage system, giving much more flexibility to leverage the relative strengths of each, again, without the need to move large quantities of data around. Enterprise-class Storage Increases Efficiency and Safety Although HDFS may be a reliable and scalable model for collecting and storing the high volumes and varieties of data in a typical big data environment, it isn t necessarily the most efficient. Some features that provide that robustness on commodity hardware may actually detract from overall efficiency. Mirroring with Hadoop direct attached storage is a good example, causing typically three to five times redundancy, which significantly affects the effective usage ratio of total drive capacity. Isilon, with built-in data protection, high availability, and general robustness, can instead run at 80% utilization levels of capacity (compared with 20-33% with HDFS) and this is further improved by data reduction of up to 30% with SmartDedupe. All this helps reduce the storage footprint, bringing associated cost reductions in energy and space consumption in the data center. Separating server and storage by growing each independently instead of always adding another fixed unit commodity server also allows more targeted scaling of the environment to meet the actual workloads. From a governance and security point of view, the Isilon storage system offers write once, read many (WORM) compliance for archival to meet government and industry regulations, standard Kerberos authentication, and access control lists (ACLs) to make sure the user touching the central data hub is authorized. All of these features combine to reduce initial cost of purchase, ongoing operational costs, and risk of failure or security breach of sensitive information. The Bigger Truth Having explored the rapid growth of big data in adoption and importance, and the potential impacts of the underlying infrastructure, it is clear that enterprises should rethink the architectural implications of their storage choices for their big data initiatives. There are multiple advantages in taking a shared storage approach, covering a wide range of desired characteristics including increased efficiency, reduced total cost, overall speed to answer, reduced risk of data loss or inappropriate access, and analytics flexibility. Isilon is breaking ground in challenging the default storage paradigm assumptions of big data practitioners, and its approach is well worth evaluation for its merits compared with the de facto standard of direct attached storage in commodity server hardware. Coming from a long history of building flexible, scalable storage platforms for demanding enterprise requirements serves Isilon well in addressing many common challenges of big data storage, and this experience should serve customers well. Particularly, current Isilon customers should experiment with running Hadoop on their existing systems; they may well find that the right answer is already in place.

20 Asylum Street Milford, MA 01757 Tel: 508.482.0188 Fax: 508.482.0218 www.esg-global.com