Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com



Similar documents
HadoopTM Analytics DDN

Enabling High performance Big Data platform with RDMA

Improving Time to Results for Seismic Processing with Paradigm and DDN. ddn.com. DDN Whitepaper. James Coomer and Laurent Thiers

Big Fast Data Hadoop acceleration with Flash. June 2013

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

With DDN Big Data Storage

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

SMB Direct for SQL Server and Private Cloud

Direct Scale-out Flash Storage: Data Path Evolution for the Flash Storage Era

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Block based, file-based, combination. Component based, solution based

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Software-defined Storage Architecture for Analytics Computing

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

ANY SURVEILLANCE, ANYWHERE, ANYTIME

High-Performance Networking for Optimized Hadoop Deployments

Microsoft Windows Server in a Flash

Microsoft Windows Server Hyper-V in a Flash

Integrated Grid Solutions. and Greenplum

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Scala Storage Scale-Out Clustered Storage White Paper

Can High-Performance Interconnects Benefit Memcached and Hadoop?

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Platfora Big Data Analytics

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

WHITE PAPER Addressing Enterprise Computing Storage Performance Gaps with Enterprise Flash Drives

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Building a Flash Fabric

Optimizing Web Infrastructure on Intel Architecture

RevoScaleR Speed and Scalability

Please give me your feedback

Protecting Big Data Data Protection Solutions for the Business Data Lake

WHITE PAPER. Get Ready for Big Data:

Hadoop. Sunday, November 25, 12

How To Get The Most Out Of A Large Data Set

Hadoop: Embracing future hardware

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Accelerating and Simplifying Apache

BIG DATA-AS-A-SERVICE

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Easier - Faster - Better

Virtualizing Apache Hadoop. June, 2012

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

ioscale: The Holy Grail for Hyperscale

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS

ECMWF HPC Workshop: Accelerating Data Management

Energy Efficient MapReduce

In-Memory Analytics for Big Data

Technology Insight Series

Proact whitepaper on Big Data

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

White Paper Solarflare High-Performance Computing (HPC) Applications

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

Microsoft Windows Server Hyper-V in a Flash

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Building Enterprise-Class Storage Using 40GbE

Performance Analysis: Scale-Out File Server Cluster with Windows Server 2012 R2 Date: December 2014 Author: Mike Leone, ESG Lab Analyst

Maxta Storage Platform Enterprise Storage Re-defined

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

StarWind Virtual SAN for Microsoft SOFS

Hadoop Cluster Applications

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Building a Scalable Storage with InfiniBand

Maximum performance, minimal risk for data warehousing

Mellanox Accelerated Storage Solutions

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Connecting Flash in Cloud Storage

Networking in the Hadoop Cluster

3G Converged-NICs A Platform for Server I/O to Converged Networks

SQL Server 2012 Parallel Data Warehouse. Solution Brief

TRANSFORM YOUR BUSINESS: BIG DATA AND ANALYTICS WITH VCE AND EMC

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

DataStax Enterprise, powered by Apache Cassandra (TM)

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

MaxDeploy Hyper- Converged Reference Architecture Solution Brief

High Performance MySQL Cluster Cloud Reference Architecture using 16 Gbps Fibre Channel and Solid State Storage Technology

How To Handle Big Data With A Data Scientist

Windows TCP Chimney: Network Protocol Offload for Optimal Application Scalability and Manageability

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Transcription:

DDN Technical Brief Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. A Fundamentally Different Approach To Enterprise Analytics Architecture: A Scalable Unit Design Leveraging Shared High-Throughput Storage To Minimize Compute TCO Abstract: In this paper the author attempts to educate the user on the limitations of a traditional Hadoop architecture that is built on commodity compute with Direct Attached Storage [DAS]. The paper reviews the design imperatives of DataDirect Networks hscaler Apache Hadoop appliance architecture and how it has been engineered to try to eliminate the limitations that plague today s purely commodity approaches. 2013 DataDirect Networks. All Rights Reserved.

The Impetus For Today s Hadoop Design At a time when commodity networking operated at 10MB/s and disks were each capable of achieving 80MB/s of data transfer performance (and whereas multiple disks can be configured either on a network or in a server chassis), the obvious mismatch in performance attributes identified by data center engineers and analysts highlighted severe efficiency challenges in then-current systems designs and the need for better approaches to data-intensive computing. As a result of the imbalance between network and storage resources in standard data centers and the perceived high costs of enterprise shared storage, data-intensive processing organizations began to embrace new methods of processing data, where the processing routines are brought to the data, which lives in commodity computers that participate in distributed processing of large analytic queries. The most popular of approach to this style of processing is, today, Apache Hadoop [Hadoop]. Hadoop supports the distribution of applications across commodity hardware in a shared-nothing fashion where each commodity server independently owns its data and where data is replicated across several commodity nodes for resiliency and performance purposes. Hadoop implements a computational process known as map/reduce. This is the process of dividing data sets into several fragments, distributing these fragments uniformly across a commodity processing cluster and processing across nodes in parallel. This approach was developed to minimize the cost and performance overhead of data movement across commodity networks and accelerate data processing. Since the emergence of Hadoop, the limitations associated with hard drive physics have created a new imbalance, where hard drive performance advancements have not kept pace with increases in networking and processing performance [see table 1]. Today, as high speed data center networking approaches 100Gb/s, the gradual increase in disk performance has resulted in a new imbalance; whereby inefficient spinning disk technologies have become the new data processing bottleneck for large-scale Hadoop implementations. While today s systems are still capable of economically utilizing the performance of spinning media (as opposed to SSDs, since the workload is still predominately throughput-oriented), the classic Hadoop function-shipping model of today is challenged by the ever-growing need for more node-local spinning disks and the performance utilization of this media is being challenged by the scale-out approaches of today s Hadoop data protection and distribution software. 2003 2013 Delta HDD Bandwidth MB/s 40 120 3x CPU Cores / Socket 2 16 8x Ethernet Gb/s 1 40 40x Table 1: Computing Commodity Advancements 2013 DataDirect Networks. All Rights Reserved. 2

Hadoop Systems Components & Bottlenecks To illustrate the various areas of optimization that are possible with Apache Hadoop, we will review the core design tenets and the associated configuration impact to cluster efficiency. Data Protection: Today s data protection layer in Hadoop is commonly implemented in a three-way replicated storage configuration where HDFS (the Hadoop File System a Java-based namespace and data protection framework) receives writes in a sequential fashion from the host to each of the unique nodes. This method of data protection can benefit from relinquishing the responsibility of replication via HDFS. By treating HDFS as a conventional file system, centralized storage can be employed to reduce the number of data copies to 1, using highspeed RAID or Erasure Coding techniques to protect the data, freeing the compute node from the burden of data replication in order to increase Hadoop node performance by up to 50%. The ancillary benefit to this approach also includes a reduction in hard drives in the Hadoop architecture by as much as 60%, which has resulting economic, data center and environmental benefits. Job Affinity: In large cluster configurations, Hadoop jobs are routinely challenged to process data which is not local to itself, breaking the paradigm of map/reduce processing. The amount of data that is retrieved from other nodes on the network, in a particular Hadoop job can be as high as 30%. The use of centralized, RDMA storage can result in an 80% decrease in I/O wait times for remote data retrieval, as compared to transferring data via TCP/IP. Map/Reduce Shuffle Efficiency: Whereas commodity networks are now capable of delivering performance at rates of 56Gb/s and greater, conventional network protocols are unable of encapsulating data efficiently and TCP/IP overhead continues to consume substantial portions of CPU cycles from these data-intensive operations. Historically, SAN and HPC networking technologies have been applied to resolving this problem and making compute nodes more efficient through the use of protocols that maximize bandwidth, while minimizing CPU overhead. Dataset 1 x 40GbE 1 x 56Gb IB Gain 80GB 40 439 43% 500GB 628 865 75% Table 2: Hadoop Compute Comparisons (in sec) 2013 DataDirect Networks. All Rights Reserved. 3

Whereas it is counter-intuitive to think that a Hadoop system demands high-speed networking when the processing is shipped to the data, in fact, the Shuffle process in map/reduce operations can reorient a large amount of data across a Hadoop cluster and the speed of this operation is a direct byproduct of networking and protocol choices made during the time of cluster architecture. Today, RDMA encapsulation of Shuffle data, using InfiniBand or RDMA over Converged Ethernet networking, is proving to provide dramatic efficiency gains for Hadoop clusters. Data Nodes and Compute Nodes: Let us look at the I/O profile of a normal Hadoop job. As shown in the system profile on the left, a Hadoop job will pause and wait for the CPU before trying to fetch the next set of data. This process serialization causes the I/O subsystem to go alternately from saturated to idle. This inefficiency wastes about 30% of a job's run time. The establishment of computeonly nodes in a Hadoop environment can present material benefits vs. a conventional one-node-fits-all approach. This model presents opportunities to provide much better sequential access to data storage, while dramatically reducing job resets/pauses. This parallelization is a radical new approach to job-processing, and can speed-up jobs at a hyper-linear rate, thereby making the cluster faster as it grows. By leveraging high-throughput, RDMA-connected storage, compute-only nodes can save as much as 30% of the time they would otherwise be spending on data pipelining. Data Center Packaging: When discussing efficiency, it s often easy to overlook the data center impact of commodity HW. At a time when whole data centers are being built for map/reduce computing, the economics are increasingly difficult to ignore. By turning Hadoop systems' design convention on it s head and implementing a highly-efficient and highly-dense architecture (where compute and disk resources are minimized), the resulting effect can be dramatic. Efficient configurations of Hadoop scalable compute + storage units, have demonstrated the ability to minimize data center impact by as much as 60%. 2013 DataDirect Networks. All Rights Reserved. 4

Introducing hscaler: A Fundamentally New Approach To Enterprise Analytics hscaler, is a highly engineered and tightly integrated HW/SW appliance that features the Hortonworks distribution of the Apache Hadoop platform. It leverages DDN s Storage Fusion Architecture family of high-throughput, RDMA-attached storage systems to address many levels of inefficiencies, which exist in today s Apache Hadoop environments. These inefficiencies continue to grow as CPU and networking advances outpace the legacy methods of data storage management and delivery in commodity Hadoop clusters. DDN s hscaler product was, first and foremost engineered to be a simple-to-deploy, simple-to-operate, scale-out analytics platform, which features high-availability, and is factory delivered to minimize time-to-insight. To be competitive in a market that is dominated by commodity economics, hscaler leverages the power of the world s fastest storage technology to exploit the power of industry-standard componentry. Key aspects of the product include: Turnkey appliance and Hadoop process management through DDN s DirectMon analytics cluster management utility. Fully-integrated Hadoop and high-speed ETL tools, all supported and managed by DDN in a"one throat to choke" model. A scalable unit design, where compute and DDN s SFA storage is built into an appliance bundle. These appliances can be iterated out onto a network to achieve an aggregated performance and capacity equivalent to an 8,000 node Hadoop cluster. Configuration is flexible. Compute and storage can be added to each scalable unit independently, to ensure that the least amount of infrastructure is consumed for any performance & capacity profile. A unique approach to Hadoop whereby compute nodes and data nodes are scaled independently. This reengineering of the system and job scheduling design opens up the ComputeNode, to provide much more complex transforms of the data. This is in a nearly embarrassingly parallel scalability method that alone accelerates cluster performance by upwards of 30%. At the core of hscaler, is DDN s flagship SFA12K-40 storage appliance. The system is capable of delivering throughput up to 40GB/s, over 1.4M IOPS, making it the world s fastest storage appliance. The system is configurable with both spinning and Flash disks. This enables Hadoop to efficiently deliver the performance that is customized to the composition of the data and processing requirements. The system also features the highest levels of data center density in the industry, by housing up to 1,680 HDDs in just two data center racks. The SFA12K-40 is up to 300% more dense than competing storage systems. DDN SFA products demonstrate up to 800% greater performance than legacy enterprise storage and uniquely enables configurations where powerful, high-throughput storage can be cost-effectively coupled with today s data-hungry Hadoop compute nodes at speeds greater than direct-attached storage speeds. Real-time SFA performance enables mitigation of drive or enclosure failure impact to performance to preserve sustained cluster processing performance. 2013 DataDirect Networks. All Rights Reserved. 5

Summary While Hadoop and the map/reduce paradigm have resulted in advances in time to insight by orders of magnitude, today s enterprises still remain challenged to adopt Hadoop technology. This is due to the complexity of adopting so many new Hadoop concepts and the substantial challenges associated with implementing them on commodity clusters. The root cause of today s hesitation in adopting Hadoop lies within the complex deployment methods. This causes IT departments to take a hands-off approach, due to the fact that the majority of the architecture work is done by highly-skilled data scientists. With hscaler, DDN has engineered simplicity and efficiency into this next-generation Hadoop appliance. This delivers a Hadoop experience which is not only IT friendly, but focuses on deriving business value at scale. By offloading every aspect of Hadoop I/O, data protection and packaging a cluster with a highly-resilient, dense and highthroughput data storage platform. Now, DDN has increased map/reduce performance by up to 700%. This enables hscaler to deliver new efficiencies and substantial savings to your bottom line. DDN About Us DataDirect Networks (DDN) is the world leader in massively scalable storage. We are the leading provider of data storage and processing solutions and professional services that enable contentrich and high-growth IT environments to achieve the highest levels of systems scalability, efficiency and simplicity. DDN enables enterprises to extract value and deliver results from their information. Our customers include the world s leading online content and social networking providers, high-performance cloud and grid computing, life sciences, media production organizations and security and intelligence organizations. Deployed in thousands of mission-critical environments, worldwide, DDN s solutions have been designed, engineered and proven in the world s most scalable data centers to ensure competitive business advantage for today s information-powered enterprise. For more information, go to www. or call +1-800-837-2298 2013, DataDirect Networks, Inc. All Rights Reserved. DataDirect Networks, hscaler, DirectMon, Storage Fusion Architecture, SFA, and SFA12K are trademarks of DataDirect Networks. All other trademarks are the property of their respective owners. Version-1 2/13 2013 DataDirect Networks. All Rights Reserved. 6