DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate their Hadoop infrastructure and gain a deeper understanding of their business. 2012 DataDirect Networks. All Rights Reserved.
The Big Data Challenge and Opportunity in Hadoop Analytics Organizations in a wide range of industries rely on advanced analytics to gain important insights from rapidly growing data sets and to make faster, more informed decisions. The ability to perform detailed and complex analytics on Big Data using Apache Hadoop is integral to success in fields such as Life Sciences, Financial Services, and Government. Life Sciences. Hadoop-based analytics is being used to detect drug interactions, identify the best courses of treatment, and determine a patient s likelihood of developing a disease. Financial Services. Visionary hedge funds, proprietary trading firms, and other leading institutions are turning to Hadoop for market monitoring, risk modeling, fraud detection and compliance reporting. Government. Federal and local government agencies are turning to Hadoop to satisfy diverse mission goals. The Intelligence community is working daily to find hidden associations across multiple large data sets. Law enforcement is analyzing all available information as a way to better deploy limited resources and ensure that police are positioned to respond rapidly while providing a visible presence to deter crime. As Hadoop-based analytics becomes an essential part of operations like these, the performance, scalability, and reliability of the infrastructure that supports Hadoop has become increasingly business critical. Science projects are no longer enough to get the job done. Hadoop infrastructure has to be efficient, reliable, and IT friendly. Shared storage solutions can eliminate bottlenecks to increase Hadoop performance while providing greater reliability and the familiar feature set that IT teams depend on. As a leader in Big Data storage, DDN is the perfect storage partner for your Hadoop infrastructure needs. What is Analytics and How Does Hadoop Enable It? Analytics is about taking data and turning it into actionable information. This process includes finding important details, making associations, and working towards a recommendation that you can execute on. Hadoop is becoming the preferred platform for efficiently processing the large volumes of data from diverse sources that is needed to drive these decisions. Hadoop provides storage and analysis of large, semi-structured and unstructured data sets, and it offers a rich ecosystem of add-on components such as Apache HCatalog, Mahout, Pig and Accumulo that allow for integration with other platforms and simplified access to complex data sets. Hadoop has established itself as a standard for organizations working to solve Big Data challenges, because it is: Scalable in both performance & capacity Has a growing solution ecosystem that increases capability and flexibility Provides established APIs and interfaces that accelerate development 2012 DataDirect Networks. All Rights Reserved. 2
The Hadoop software consists of two main components: MapReduce. An algorithm for processing problems against huge data sets. Problems are divided into parallel tasks by a job tracker and each task is assigned to a task tracker on a Hadoop node for execution. In the map part of the process, queries are processed in parallel on many nodes. During the reduce part of the process, results are gathered, organized, and presented. Hadoop Distributed File System (HDFS). The distributed file system used for data management by Hadoop. A single name node manages metadata for a Hadoop cluster while a data node process on each cluster node is responsible for the subset of the total data set that resides there. A standard Hadoop installation runs on a cluster of compute nodes in which each node contains compute and internal (direct-attached) storage. For data protection and disaster recovery, Hadoop maintains three copies of all data on separate nodes. This operational model is quite different from what most IT teams are accustomed to, and, as data sets grow in size, storing three copies of data consumes a huge amount of storage not to mention electricity and floor space. As more mainstream organizations adopt Hadoop, a new set of capabilities is needed to allow Hadoop to integrate better with standard IT practices. Replacing Hadoop s direct-attached storage model with shared storage may be the fastest way to rationalize Hadoop deployment and simplify the integration of Hadoop solutions with your existing IT infrastructure and practices. Shared Storage Accelerates Hadoop and Increases Operational Efficiency As Hadoop becomes an integral part of business processes, IT teams are looking for Hadoop infrastructure solutions that deliver: Enterprise-class hardware Enterprise integration High availability Efficient CAPEX and OPEX scaling Resource management, SLAs and QoS Moving to a shared storage infrastructure for Hadoop can address these concerns and provide significant advantages in terms of performance, scaling, and reliability while creating a more IT-friendly infrastructure. Making the investment in the enterprise-class hardware necessary to support shared infrastructure significantly reduces ongoing operational expenses and allows you to share resources across multiple applications and business units. Performance: Shared storage can achieve better storage performance with far fewer spinning disks. Each Hadoop node typically includes a few commodity disk drives organized either as a JBOD or in a single RAID group. Solidstate disks are rarely used to keep per node costs down. The storage performance of any single Hadoop node is 2012 DataDirect Networks. All Rights Reserved. 3
low, and high aggregate storage performance is only achieved by having a large number of nodes and a very large number of disks. In recent testing using TestDFSIO, a distributed I/O benchmark tool that writes and reads random data from HDFS, the DDN Storage Fusion Architecture (SFA) demonstrated a 50% to 100% or more improvement in HDFS performance versus commodity servers with local storage. Scalability: By deploying shared storage for Hadoop, compute and storage capacity scale independently, with greater flexibility to choose the best solution for each. Pairing dense compute infrastructure with dense shared storage can significantly shrink your overall Hadoop footprint. Data growth in Hadoop environments often exceeds growth in computing demand, so scaling out compute and disk in lockstep as in a standard Hadoop deployment means paying for CPU capacity in order to get storage. Because a standard Hadoop installation has three data copies, it requires 3X the storage to satisfy a given requirement, making the addition of new capacity an expensive proposition. Shared storage provides storage resiliency with much lower capacity overhead. Reliability: Placing the storage for Hadoop s Name Node and Job Tracker--which are particularly vulnerable to failures--on reliable, shared storage protects both performance and availability; service can be restored more quickly should one of these services fail. All things being equal, having 3X the disks means 3X the disk failures. With each disk failure, a Hadoop node is compromised. While Hadoop can continue to run, at some point performance suffers and results are delayed. Shared storage provides the same usable capacity from far fewer disk spindles with better overall reliability. When a disk does fail, advanced storage systems like the DDN SFA12K can generate missing data from parity without impacting Hadoop performance. When a compute node fails, it s easy to re-assign its storage to a spare. IT Friendliness: Shared storage is familiar and IT friendly. Which would you rather manage, 1,000 disks in a single discrete storage system, or 3,000 disks spread across hundreds of compute nodes? Shared storage eliminates the mismatch between Hadoop and the rest of your IT infrastructure, making it easier to integrate Hadoop with your operations. Management. Manage all storage from a single interface. Scale capacity quickly. Data protection. Take advantage of built-in data integrity and data protection functions such as RAID, Snapshots, and off-site replication, offloading that work from Hadoop. Flexibility. Pull compute resources into a Hadoop cluster for intensive jobs and release them when the job is complete. Multiple workloads. Support other workloads without affecting Hadoop performance. Cost. Fewer spinning disks, a smaller storage footprint, reduced complexity and simplified management decrease energy consumption, save datacenter space, and decrease management costs. 2012 DataDirect Networks. All Rights Reserved. 4
Award Winning SFA12K Innovative, award winning and proven in the world s largest and most demanding production environments, DDN s Storage Fusion Architecture (SFA) utilizes the most advanced processor technology, busses and memory with an optimized RAID engine and sophisticated data management algorithms. The SFA12K product family is designed to derive peak performance from Hadoop investments with a massive I/O infrastructure and multi-media disk drives that maximize system performance and lower storage investment costs. The SFA12K product family is purpose-built to simplify and tame Big Data growth, enabling you to architect and scale your Hadoop environment more intelligently, efficiently and cost effectively. For architects in global businesses coping with complex big data solutions, the SFA platform for Hadoop infrastructure is an extremely reliable and high performing platform that will accelerate workflows to enable you to analyze growing amounts of data without increasing costs. Performance and scalability: Our state-of-the-art SFA12K storage engine is almost eight times faster than legacy enterprise storage. With an SFA12K you can leverage industry leading SFA storage performance to satisfy Hadoop storage requirements with the fewest storage systems. A single system delivers up to 40GB/second of system bandwidth and bandwidth scales with each additional SFA12K system. It s possible to achieve an aggregate bandwidth of 1TB/second in just 25 systems. The SFA platform is the fastest platform for Big Data, with the ability to extract the highest performance from all media. Higher performance means that you can deliver exceptional performance from smaller Hadoop clusters. Density: Reduce your Hadoop footprint, reclaim your datacenter, and resolve space and power limitations with the industry's densest storage platform. Each enclosure houses 60 drives in just 4U to deliver 2.4PB of raw storage per rack. (with 4TB SATA drives). Our world leading density and power efficiency means that organizations can reduce their TCO requirements. Reliability: The SFA12K delivers world-class reliability features that protect the availability and integrity of your Hadoop data. A unique multi-raid architecture combines up to 1,680 SATA, SAS and SSD drives into a simply managed, multipetabyte platform. The system is able to perform multiple levels of parity generation and real-time data integrity verification and error correction in the background without impacting Hadoop performance. DirectProtect further increases data resiliency and reliability by automatically detecting and correcting silent data corruption. 2012 DataDirect Networks. All Rights Reserved. 5
Lowest Total Cost of Ownership (TCO) in the industry: TCO that s 50% lower than other enterprise storage solutions makes the SFA12K a smarter choice for Hadoop shared storage infrastructure, and you can support workloads in addition to Hadoop from the same storage. The leading edge SFA12K brings to your datacenter, industry-leading performance, capacity, density and reliability. The SFA12K is a performance powerhouse. The power, speed and scalability of SFA delivers unparalleled performance improvements for Hadoop, in an IT-friendly platform with lower TCO. For business executives seeking to understand how an organization is perceived by customers and the world, the SFA platform for Hadoop infrastructure helps you gain insights and understand your business better and faster than ever before. Because the DDN SFA12K is the faster shared storage platform, it is the ideal choice for accelerating Hadoop-based analytics to power better decisions. DDN About Us DataDirect Networks (DDN) is the world leader in massively scalable storage. We are the leading provider of data storage and processing solutions and professional services that enable contentrich and high-growth IT environments to achieve the highest levels of systems scalability, efficiency and simplicity. DDN enables enterprises to extract value and deliver results from their information. Our customers include the world s leading online content and social networking providers, high-performance cloud and grid computing, life sciences, media production organizations and security and intelligence organizations. Deployed in thousands of mission-critical environments, worldwide, DDN s solutions have been designed, engineered and proven in the world s most scalable data centers to ensure competitive business advantage for today s information-powered enterprise. For more information, go to www. or call +1-800-837-2298 2012, DataDirect Networks, Inc. All Rights Reserved. DataDirect Networks, SFA, and Storage Fusion Architecture are trademarks of DataDirect Networks. All other trademarks are the property of their respective owners. Version-1 1112 2012 DataDirect Networks. All Rights Reserved. 6