HadoopTM Analytics DDN



Similar documents
Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

With DDN Big Data Storage

ANY SURVEILLANCE, ANYWHERE, ANYTIME

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Improving Time to Results for Seismic Processing with Paradigm and DDN. ddn.com. DDN Whitepaper. James Coomer and Laurent Thiers

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

How To Use Hp Vertica Ondemand

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

SQL Server 2012 Parallel Data Warehouse. Solution Brief

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

SYMANTEC NETBACKUP APPLIANCE FAMILY OVERVIEW BROCHURE. When you can do it simply, you can do it all.

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

MaxDeploy Hyper- Converged Reference Architecture Solution Brief

Enabling High performance Big Data platform with RDMA

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

ioscale: The Holy Grail for Hyperscale

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

FLASH STORAGE SOLUTION

Protecting Big Data Data Protection Solutions for the Business Data Lake

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

IBM System x reference architecture solutions for big data

Unlock the value of data with smarter storage solutions.

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Maximum performance, minimal risk for data warehousing

An Oracle White Paper November Backup and Recovery with Oracle s Sun ZFS Storage Appliances and Oracle Recovery Manager

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

EMC XTREMIO EXECUTIVE OVERVIEW

Cray: Enabling Real-Time Discovery in Big Data

Direct Scale-out Flash Storage: Data Path Evolution for the Flash Storage Era

Microsoft Private Cloud Fast Track

White Paper Storage for Big Data and Analytics Challenges

WOS. High Performance Object Storage

Oracle Big Data SQL Technical Update

Microsoft Windows Server Hyper-V in a Flash

Integrated Grid Solutions. and Greenplum

I/O Considerations in Big Data Analytics

Software-defined Storage Architecture for Analytics Computing

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS

Maxta Storage Platform Enterprise Storage Re-defined

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

DDN updates object storage platform as it aims to break out of HPC niche

Apache Hadoop: The Big Data Refinery

How To Handle Big Data With A Data Scientist

Intel RAID SSD Cache Controller RCS25ZB040

StarWind Virtual SAN for Microsoft SOFS

Kaminario K2 All-Flash Array

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT

SQL Server 2012 Performance White Paper

How To Backup With Ec Avamar

Easier - Faster - Better

Pivot3 Desktop Virtualization Appliances. vstac VDI Technology Overview

Virtualizing Apache Hadoop. June, 2012

T a c k l i ng Big Data w i th High-Performance

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Dell Reference Configuration for Hortonworks Data Platform

CDH AND BUSINESS CONTINUITY:

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

BIG DATA TRENDS AND TECHNOLOGIES

IBM Enterprise Linux Server

Make the Most of Big Data to Drive Innovation Through Reseach

Accelerating and Simplifying Apache

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Big Fast Data Hadoop acceleration with Flash. June 2013

Dell In-Memory Appliance for Cloudera Enterprise

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Protecting Information in a Smarter Data Center with the Performance of Flash

5 KEY QUESTIONS FOR BIG DATA STORAGE STRATEGIES

Seagate Lustre Update. Peter Bojanic

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

CSE-E5430 Scalable Cloud Computing Lecture 2

How Solace Message Routers Reduce the Cost of IT Infrastructure

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

Agile Business Intelligence Data Lake Architecture

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Platfora Big Data Analytics

Five Technology Trends for Improved Business Intelligence Performance

Hadoop: Embracing future hardware

June Blade.org 2009 ALL RIGHTS RESERVED

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Transcription:

DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate their Hadoop infrastructure and gain a deeper understanding of their business. 2012 DataDirect Networks. All Rights Reserved.

The Big Data Challenge and Opportunity in Hadoop Analytics Organizations in a wide range of industries rely on advanced analytics to gain important insights from rapidly growing data sets and to make faster, more informed decisions. The ability to perform detailed and complex analytics on Big Data using Apache Hadoop is integral to success in fields such as Life Sciences, Financial Services, and Government. Life Sciences. Hadoop-based analytics is being used to detect drug interactions, identify the best courses of treatment, and determine a patient s likelihood of developing a disease. Financial Services. Visionary hedge funds, proprietary trading firms, and other leading institutions are turning to Hadoop for market monitoring, risk modeling, fraud detection and compliance reporting. Government. Federal and local government agencies are turning to Hadoop to satisfy diverse mission goals. The Intelligence community is working daily to find hidden associations across multiple large data sets. Law enforcement is analyzing all available information as a way to better deploy limited resources and ensure that police are positioned to respond rapidly while providing a visible presence to deter crime. As Hadoop-based analytics becomes an essential part of operations like these, the performance, scalability, and reliability of the infrastructure that supports Hadoop has become increasingly business critical. Science projects are no longer enough to get the job done. Hadoop infrastructure has to be efficient, reliable, and IT friendly. Shared storage solutions can eliminate bottlenecks to increase Hadoop performance while providing greater reliability and the familiar feature set that IT teams depend on. As a leader in Big Data storage, DDN is the perfect storage partner for your Hadoop infrastructure needs. What is Analytics and How Does Hadoop Enable It? Analytics is about taking data and turning it into actionable information. This process includes finding important details, making associations, and working towards a recommendation that you can execute on. Hadoop is becoming the preferred platform for efficiently processing the large volumes of data from diverse sources that is needed to drive these decisions. Hadoop provides storage and analysis of large, semi-structured and unstructured data sets, and it offers a rich ecosystem of add-on components such as Apache HCatalog, Mahout, Pig and Accumulo that allow for integration with other platforms and simplified access to complex data sets. Hadoop has established itself as a standard for organizations working to solve Big Data challenges, because it is: Scalable in both performance & capacity Has a growing solution ecosystem that increases capability and flexibility Provides established APIs and interfaces that accelerate development 2012 DataDirect Networks. All Rights Reserved. 2

The Hadoop software consists of two main components: MapReduce. An algorithm for processing problems against huge data sets. Problems are divided into parallel tasks by a job tracker and each task is assigned to a task tracker on a Hadoop node for execution. In the map part of the process, queries are processed in parallel on many nodes. During the reduce part of the process, results are gathered, organized, and presented. Hadoop Distributed File System (HDFS). The distributed file system used for data management by Hadoop. A single name node manages metadata for a Hadoop cluster while a data node process on each cluster node is responsible for the subset of the total data set that resides there. A standard Hadoop installation runs on a cluster of compute nodes in which each node contains compute and internal (direct-attached) storage. For data protection and disaster recovery, Hadoop maintains three copies of all data on separate nodes. This operational model is quite different from what most IT teams are accustomed to, and, as data sets grow in size, storing three copies of data consumes a huge amount of storage not to mention electricity and floor space. As more mainstream organizations adopt Hadoop, a new set of capabilities is needed to allow Hadoop to integrate better with standard IT practices. Replacing Hadoop s direct-attached storage model with shared storage may be the fastest way to rationalize Hadoop deployment and simplify the integration of Hadoop solutions with your existing IT infrastructure and practices. Shared Storage Accelerates Hadoop and Increases Operational Efficiency As Hadoop becomes an integral part of business processes, IT teams are looking for Hadoop infrastructure solutions that deliver: Enterprise-class hardware Enterprise integration High availability Efficient CAPEX and OPEX scaling Resource management, SLAs and QoS Moving to a shared storage infrastructure for Hadoop can address these concerns and provide significant advantages in terms of performance, scaling, and reliability while creating a more IT-friendly infrastructure. Making the investment in the enterprise-class hardware necessary to support shared infrastructure significantly reduces ongoing operational expenses and allows you to share resources across multiple applications and business units. Performance: Shared storage can achieve better storage performance with far fewer spinning disks. Each Hadoop node typically includes a few commodity disk drives organized either as a JBOD or in a single RAID group. Solidstate disks are rarely used to keep per node costs down. The storage performance of any single Hadoop node is 2012 DataDirect Networks. All Rights Reserved. 3

low, and high aggregate storage performance is only achieved by having a large number of nodes and a very large number of disks. In recent testing using TestDFSIO, a distributed I/O benchmark tool that writes and reads random data from HDFS, the DDN Storage Fusion Architecture (SFA) demonstrated a 50% to 100% or more improvement in HDFS performance versus commodity servers with local storage. Scalability: By deploying shared storage for Hadoop, compute and storage capacity scale independently, with greater flexibility to choose the best solution for each. Pairing dense compute infrastructure with dense shared storage can significantly shrink your overall Hadoop footprint. Data growth in Hadoop environments often exceeds growth in computing demand, so scaling out compute and disk in lockstep as in a standard Hadoop deployment means paying for CPU capacity in order to get storage. Because a standard Hadoop installation has three data copies, it requires 3X the storage to satisfy a given requirement, making the addition of new capacity an expensive proposition. Shared storage provides storage resiliency with much lower capacity overhead. Reliability: Placing the storage for Hadoop s Name Node and Job Tracker--which are particularly vulnerable to failures--on reliable, shared storage protects both performance and availability; service can be restored more quickly should one of these services fail. All things being equal, having 3X the disks means 3X the disk failures. With each disk failure, a Hadoop node is compromised. While Hadoop can continue to run, at some point performance suffers and results are delayed. Shared storage provides the same usable capacity from far fewer disk spindles with better overall reliability. When a disk does fail, advanced storage systems like the DDN SFA12K can generate missing data from parity without impacting Hadoop performance. When a compute node fails, it s easy to re-assign its storage to a spare. IT Friendliness: Shared storage is familiar and IT friendly. Which would you rather manage, 1,000 disks in a single discrete storage system, or 3,000 disks spread across hundreds of compute nodes? Shared storage eliminates the mismatch between Hadoop and the rest of your IT infrastructure, making it easier to integrate Hadoop with your operations. Management. Manage all storage from a single interface. Scale capacity quickly. Data protection. Take advantage of built-in data integrity and data protection functions such as RAID, Snapshots, and off-site replication, offloading that work from Hadoop. Flexibility. Pull compute resources into a Hadoop cluster for intensive jobs and release them when the job is complete. Multiple workloads. Support other workloads without affecting Hadoop performance. Cost. Fewer spinning disks, a smaller storage footprint, reduced complexity and simplified management decrease energy consumption, save datacenter space, and decrease management costs. 2012 DataDirect Networks. All Rights Reserved. 4

Award Winning SFA12K Innovative, award winning and proven in the world s largest and most demanding production environments, DDN s Storage Fusion Architecture (SFA) utilizes the most advanced processor technology, busses and memory with an optimized RAID engine and sophisticated data management algorithms. The SFA12K product family is designed to derive peak performance from Hadoop investments with a massive I/O infrastructure and multi-media disk drives that maximize system performance and lower storage investment costs. The SFA12K product family is purpose-built to simplify and tame Big Data growth, enabling you to architect and scale your Hadoop environment more intelligently, efficiently and cost effectively. For architects in global businesses coping with complex big data solutions, the SFA platform for Hadoop infrastructure is an extremely reliable and high performing platform that will accelerate workflows to enable you to analyze growing amounts of data without increasing costs. Performance and scalability: Our state-of-the-art SFA12K storage engine is almost eight times faster than legacy enterprise storage. With an SFA12K you can leverage industry leading SFA storage performance to satisfy Hadoop storage requirements with the fewest storage systems. A single system delivers up to 40GB/second of system bandwidth and bandwidth scales with each additional SFA12K system. It s possible to achieve an aggregate bandwidth of 1TB/second in just 25 systems. The SFA platform is the fastest platform for Big Data, with the ability to extract the highest performance from all media. Higher performance means that you can deliver exceptional performance from smaller Hadoop clusters. Density: Reduce your Hadoop footprint, reclaim your datacenter, and resolve space and power limitations with the industry's densest storage platform. Each enclosure houses 60 drives in just 4U to deliver 2.4PB of raw storage per rack. (with 4TB SATA drives). Our world leading density and power efficiency means that organizations can reduce their TCO requirements. Reliability: The SFA12K delivers world-class reliability features that protect the availability and integrity of your Hadoop data. A unique multi-raid architecture combines up to 1,680 SATA, SAS and SSD drives into a simply managed, multipetabyte platform. The system is able to perform multiple levels of parity generation and real-time data integrity verification and error correction in the background without impacting Hadoop performance. DirectProtect further increases data resiliency and reliability by automatically detecting and correcting silent data corruption. 2012 DataDirect Networks. All Rights Reserved. 5

Lowest Total Cost of Ownership (TCO) in the industry: TCO that s 50% lower than other enterprise storage solutions makes the SFA12K a smarter choice for Hadoop shared storage infrastructure, and you can support workloads in addition to Hadoop from the same storage. The leading edge SFA12K brings to your datacenter, industry-leading performance, capacity, density and reliability. The SFA12K is a performance powerhouse. The power, speed and scalability of SFA delivers unparalleled performance improvements for Hadoop, in an IT-friendly platform with lower TCO. For business executives seeking to understand how an organization is perceived by customers and the world, the SFA platform for Hadoop infrastructure helps you gain insights and understand your business better and faster than ever before. Because the DDN SFA12K is the faster shared storage platform, it is the ideal choice for accelerating Hadoop-based analytics to power better decisions. DDN About Us DataDirect Networks (DDN) is the world leader in massively scalable storage. We are the leading provider of data storage and processing solutions and professional services that enable contentrich and high-growth IT environments to achieve the highest levels of systems scalability, efficiency and simplicity. DDN enables enterprises to extract value and deliver results from their information. Our customers include the world s leading online content and social networking providers, high-performance cloud and grid computing, life sciences, media production organizations and security and intelligence organizations. Deployed in thousands of mission-critical environments, worldwide, DDN s solutions have been designed, engineered and proven in the world s most scalable data centers to ensure competitive business advantage for today s information-powered enterprise. For more information, go to www. or call +1-800-837-2298 2012, DataDirect Networks, Inc. All Rights Reserved. DataDirect Networks, SFA, and Storage Fusion Architecture are trademarks of DataDirect Networks. All other trademarks are the property of their respective owners. Version-1 1112 2012 DataDirect Networks. All Rights Reserved. 6