DDN Technical Brief Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. A Fundamentally Different Approach To Enterprise Analytics Architecture: A Scalable Unit Design Leveraging Shared High-Throughput Storage To Minimize Compute TCO Abstract: In this paper the author attempts to educate the user on the limitations of a traditional Hadoop architecture that is built on commodity compute with Direct Attached Storage [DAS]. The paper reviews the design imperatives of DataDirect Networks hscaler Apache Hadoop appliance architecture and how it has been engineered to try to eliminate the limitations that plague today s purely commodity approaches. 2013 DataDirect Networks. All Rights Reserved.
The Impetus For Today s Hadoop Design At a time when commodity networking operated at 10MB/s and disks were each capable of achieving 80MB/s of data transfer performance (and whereas multiple disks can be configured either on a network or in a server chassis), the obvious mismatch in performance attributes identified by data center engineers and analysts highlighted severe efficiency challenges in then-current systems designs and the need for better approaches to data-intensive computing. As a result of the imbalance between network and storage resources in standard data centers and the perceived high costs of enterprise shared storage, data-intensive processing organizations began to embrace new methods of processing data, where the processing routines are brought to the data, which lives in commodity computers that participate in distributed processing of large analytic queries. The most popular of approach to this style of processing is, today, Apache Hadoop [Hadoop]. Hadoop supports the distribution of applications across commodity hardware in a shared-nothing fashion where each commodity server independently owns its data and where data is replicated across several commodity nodes for resiliency and performance purposes. Hadoop implements a computational process known as map/reduce. This is the process of dividing data sets into several fragments, distributing these fragments uniformly across a commodity processing cluster and processing across nodes in parallel. This approach was developed to minimize the cost and performance overhead of data movement across commodity networks and accelerate data processing. Since the emergence of Hadoop, the limitations associated with hard drive physics have created a new imbalance, where hard drive performance advancements have not kept pace with increases in networking and processing performance [see table 1]. Today, as high speed data center networking approaches 100Gb/s, the gradual increase in disk performance has resulted in a new imbalance; whereby inefficient spinning disk technologies have become the new data processing bottleneck for large-scale Hadoop implementations. While today s systems are still capable of economically utilizing the performance of spinning media (as opposed to SSDs, since the workload is still predominately throughput-oriented), the classic Hadoop function-shipping model of today is challenged by the ever-growing need for more node-local spinning disks and the performance utilization of this media is being challenged by the scale-out approaches of today s Hadoop data protection and distribution software. 2003 2013 Delta HDD Bandwidth MB/s 40 120 3x CPU Cores / Socket 2 16 8x Ethernet Gb/s 1 40 40x Table 1: Computing Commodity Advancements 2013 DataDirect Networks. All Rights Reserved. 2
Hadoop Systems Components & Bottlenecks To illustrate the various areas of optimization that are possible with Apache Hadoop, we will review the core design tenets and the associated configuration impact to cluster efficiency. Data Protection: Today s data protection layer in Hadoop is commonly implemented in a three-way replicated storage configuration where HDFS (the Hadoop File System a Java-based namespace and data protection framework) receives writes in a sequential fashion from the host to each of the unique nodes. This method of data protection can benefit from relinquishing the responsibility of replication via HDFS. By treating HDFS as a conventional file system, centralized storage can be employed to reduce the number of data copies to 1, using highspeed RAID or Erasure Coding techniques to protect the data, freeing the compute node from the burden of data replication in order to increase Hadoop node performance by up to 50%. The ancillary benefit to this approach also includes a reduction in hard drives in the Hadoop architecture by as much as 60%, which has resulting economic, data center and environmental benefits. Job Affinity: In large cluster configurations, Hadoop jobs are routinely challenged to process data which is not local to itself, breaking the paradigm of map/reduce processing. The amount of data that is retrieved from other nodes on the network, in a particular Hadoop job can be as high as 30%. The use of centralized, RDMA storage can result in an 80% decrease in I/O wait times for remote data retrieval, as compared to transferring data via TCP/IP. Map/Reduce Shuffle Efficiency: Whereas commodity networks are now capable of delivering performance at rates of 56Gb/s and greater, conventional network protocols are unable of encapsulating data efficiently and TCP/IP overhead continues to consume substantial portions of CPU cycles from these data-intensive operations. Historically, SAN and HPC networking technologies have been applied to resolving this problem and making compute nodes more efficient through the use of protocols that maximize bandwidth, while minimizing CPU overhead. Dataset 1 x 40GbE 1 x 56Gb IB Gain 80GB 40 439 43% 500GB 628 865 75% Table 2: Hadoop Compute Comparisons (in sec) 2013 DataDirect Networks. All Rights Reserved. 3
Whereas it is counter-intuitive to think that a Hadoop system demands high-speed networking when the processing is shipped to the data, in fact, the Shuffle process in map/reduce operations can reorient a large amount of data across a Hadoop cluster and the speed of this operation is a direct byproduct of networking and protocol choices made during the time of cluster architecture. Today, RDMA encapsulation of Shuffle data, using InfiniBand or RDMA over Converged Ethernet networking, is proving to provide dramatic efficiency gains for Hadoop clusters. Data Nodes and Compute Nodes: Let us look at the I/O profile of a normal Hadoop job. As shown in the system profile on the left, a Hadoop job will pause and wait for the CPU before trying to fetch the next set of data. This process serialization causes the I/O subsystem to go alternately from saturated to idle. This inefficiency wastes about 30% of a job's run time. The establishment of computeonly nodes in a Hadoop environment can present material benefits vs. a conventional one-node-fits-all approach. This model presents opportunities to provide much better sequential access to data storage, while dramatically reducing job resets/pauses. This parallelization is a radical new approach to job-processing, and can speed-up jobs at a hyper-linear rate, thereby making the cluster faster as it grows. By leveraging high-throughput, RDMA-connected storage, compute-only nodes can save as much as 30% of the time they would otherwise be spending on data pipelining. Data Center Packaging: When discussing efficiency, it s often easy to overlook the data center impact of commodity HW. At a time when whole data centers are being built for map/reduce computing, the economics are increasingly difficult to ignore. By turning Hadoop systems' design convention on it s head and implementing a highly-efficient and highly-dense architecture (where compute and disk resources are minimized), the resulting effect can be dramatic. Efficient configurations of Hadoop scalable compute + storage units, have demonstrated the ability to minimize data center impact by as much as 60%. 2013 DataDirect Networks. All Rights Reserved. 4
Introducing hscaler: A Fundamentally New Approach To Enterprise Analytics hscaler, is a highly engineered and tightly integrated HW/SW appliance that features the Hortonworks distribution of the Apache Hadoop platform. It leverages DDN s Storage Fusion Architecture family of high-throughput, RDMA-attached storage systems to address many levels of inefficiencies, which exist in today s Apache Hadoop environments. These inefficiencies continue to grow as CPU and networking advances outpace the legacy methods of data storage management and delivery in commodity Hadoop clusters. DDN s hscaler product was, first and foremost engineered to be a simple-to-deploy, simple-to-operate, scale-out analytics platform, which features high-availability, and is factory delivered to minimize time-to-insight. To be competitive in a market that is dominated by commodity economics, hscaler leverages the power of the world s fastest storage technology to exploit the power of industry-standard componentry. Key aspects of the product include: Turnkey appliance and Hadoop process management through DDN s DirectMon analytics cluster management utility. Fully-integrated Hadoop and high-speed ETL tools, all supported and managed by DDN in a"one throat to choke" model. A scalable unit design, where compute and DDN s SFA storage is built into an appliance bundle. These appliances can be iterated out onto a network to achieve an aggregated performance and capacity equivalent to an 8,000 node Hadoop cluster. Configuration is flexible. Compute and storage can be added to each scalable unit independently, to ensure that the least amount of infrastructure is consumed for any performance & capacity profile. A unique approach to Hadoop whereby compute nodes and data nodes are scaled independently. This reengineering of the system and job scheduling design opens up the ComputeNode, to provide much more complex transforms of the data. This is in a nearly embarrassingly parallel scalability method that alone accelerates cluster performance by upwards of 30%. At the core of hscaler, is DDN s flagship SFA12K-40 storage appliance. The system is capable of delivering throughput up to 40GB/s, over 1.4M IOPS, making it the world s fastest storage appliance. The system is configurable with both spinning and Flash disks. This enables Hadoop to efficiently deliver the performance that is customized to the composition of the data and processing requirements. The system also features the highest levels of data center density in the industry, by housing up to 1,680 HDDs in just two data center racks. The SFA12K-40 is up to 300% more dense than competing storage systems. DDN SFA products demonstrate up to 800% greater performance than legacy enterprise storage and uniquely enables configurations where powerful, high-throughput storage can be cost-effectively coupled with today s data-hungry Hadoop compute nodes at speeds greater than direct-attached storage speeds. Real-time SFA performance enables mitigation of drive or enclosure failure impact to performance to preserve sustained cluster processing performance. 2013 DataDirect Networks. All Rights Reserved. 5
Summary While Hadoop and the map/reduce paradigm have resulted in advances in time to insight by orders of magnitude, today s enterprises still remain challenged to adopt Hadoop technology. This is due to the complexity of adopting so many new Hadoop concepts and the substantial challenges associated with implementing them on commodity clusters. The root cause of today s hesitation in adopting Hadoop lies within the complex deployment methods. This causes IT departments to take a hands-off approach, due to the fact that the majority of the architecture work is done by highly-skilled data scientists. With hscaler, DDN has engineered simplicity and efficiency into this next-generation Hadoop appliance. This delivers a Hadoop experience which is not only IT friendly, but focuses on deriving business value at scale. By offloading every aspect of Hadoop I/O, data protection and packaging a cluster with a highly-resilient, dense and highthroughput data storage platform. Now, DDN has increased map/reduce performance by up to 700%. This enables hscaler to deliver new efficiencies and substantial savings to your bottom line. DDN About Us DataDirect Networks (DDN) is the world leader in massively scalable storage. We are the leading provider of data storage and processing solutions and professional services that enable contentrich and high-growth IT environments to achieve the highest levels of systems scalability, efficiency and simplicity. DDN enables enterprises to extract value and deliver results from their information. Our customers include the world s leading online content and social networking providers, high-performance cloud and grid computing, life sciences, media production organizations and security and intelligence organizations. Deployed in thousands of mission-critical environments, worldwide, DDN s solutions have been designed, engineered and proven in the world s most scalable data centers to ensure competitive business advantage for today s information-powered enterprise. For more information, go to www. or call +1-800-837-2298 2013, DataDirect Networks, Inc. All Rights Reserved. DataDirect Networks, hscaler, DirectMon, Storage Fusion Architecture, SFA, and SFA12K are trademarks of DataDirect Networks. All other trademarks are the property of their respective owners. Version-1 2/13 2013 DataDirect Networks. All Rights Reserved. 6