Big Data Cloud Storage Technology Comparison. Tony Pearson IBM Master Inventor and Senior Managing Consultant. June 26, IBM Corporation

Big Data Cloud Storage Technology Comparison Tony Pearson IBM Master Inventor and Senior Managing Consultant June 26, 2012 2011 IBM Corporation

Agenda What is Big Data? InfoSphere BigInsights Infrastructure and Storage Considerations Concluding Thoughts 2

An Explosion of Data 1.3 Billion RFID tags in 2005 30 Billion RFID today 2 Billion Internet users by 2011 4.6 Billon Mobile Phones World Wide Capital market data volumes grew 1,750%, 2003-06 Twitter process 7 terabytes of data every day World Data Centre for Climate 220 Terabytes of Web data 9 Petabytes of additional data Facebook processes 10 terabytes of data every day 3

Information Overload But Lacking Insight 44x as much Data and Content Over Coming Decade 2020 35 Zettabytes Business leaders frequently 1in3 make decisions based on information they don t trust, or don t have Business leaders say they don t have access to the information 1in2 they need to do their jobs 2009 800,000 Petabytes 80% Of world s data is unstructured 83% of CIOs cited Business intelligence and analytics as part of their visionary plans to enhance competitiveness of CEOs need to do a better job capturing and understanding information rapidly in order to 60% make swift business decisions 4

The Big Data Opportunity Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible. Variety: Velocity: Volume: Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text Streaming data and large volume data movement Scale from Terabytes to Zettabytes 5

Where did this begin Apache Hadoop Open source framework for harnessing large volumes of unstructured-data - Inspired by Google technologies (MapReduce, GFS) - Originally built to address scalability problems of web search and analytics Processing Storage Enables applications to run on thousands of nodes and leverage Petabytes of data in a highly parallel, cost effective manner - CPU + Disks = Hadoop Node - Nodes can be combined into clusters - New nodes can be added dynamically - Provides simple scalable growth 6

How IBM BigInsights extends Hadoop capabiltity Delivering enterprise-ready software Risk Exposure Failure Analysis Text Processing Advanced Analytics Log Analytics Performance & Availability Extreme storage capacity Security Hardened Architecture Climate modelling Scientific Research Management Disciplines Developer Value InfoSphere BigInsights (Internet Scale Analytics) Traditional / Non-traditional data sources 7

Infrastructure for the range of BigInsights deployments Value Enterprise Performance Characteristics Optimized for cost effective scale-out Classic Hadoop architecture Redundancy provided by Hadoop Typical customer use cases Customer sentiment analysis Internet behavior and buying pattern analysis Characteristics Enterprise class features Options to support business critical workloads Typical customer use cases Financial Fraud Detection Risk analysis Data warehouse offload for cold data Characteristics Highest performance Compute and I/O intensive workload options Typical customer use cases Email compliance analysis Credit card fraud detection Media analytics 8

Technology Comparison Internal Storage in System x Servers - Block-level access - Use GPFS-Shared Nothing Cluster (SNC) - Typical for most Hadoop installations External Storage DCS3700 - Block-level access - 60 drives in 4U drawer - Designed for Sequential workloads - Use GPFS-Shared Nothing Cluster Based on the IBM System x3630 M3: Ultra-dense, storage-rich server for Big Data SONAS - File-level access - Designed for unstructured data content used in Big Data analytics 9

BigInsights Hardware Foundation Rack-Level Features Up to 20 System x3630 M3 nodes Up to 840TB storage Up to 240 cores Up to 3,840GB memory Up to two 10Gb Ethernet or 40Gb InfiniBand switches Scalable to multi-rack configurations Available Enterprise and Performance Features Redundant storage Redundant networking High performance cores Increased memory High performance networking 10

BigInsights Value Node Features Value Data Node IBM System x3630 M3 Two Intel Xeon E5620 CPUs Data: 12 x 2TB NL SAS HDDs OS: 1 x 2TB NL SAS HDD 48GB DDR3 RDIMMs Value Management Node (JobTracker, NameNode, Console) IBM System x3630 M3 Two Intel Xeon E5620 CPUs Data: 4 x 2TB NL SAS HDDs OS: 2 x 2TB NL SAS HDD, RAID1 96GB DDR3 RDIMMs 11

IBM Storage Product Positioning Primary Data Enterprise Midrange SSD XIV SSD DS5000 SVC DS8000 Flash & Stash SSD SSD SSD Storwize V7000 N7000 SSD SSD N6000 SONAS Storwize V7000 Unified Mainframe Optimized NAS for all servers Distributed High Performance Computing, Big Data DCS3700 Entry Level DS3500 Unified Storage N3000 Random Sequential 12 12

Query languages like Pig and JAQL need good random I/O performance Sort requires better sequential throughput GPFS is twice HDFS for both of the above For document index lookups, client side caching is a big win 17x throughput speedup 2000 1500 1000 500 0 " & '( Proven data integrity Replicated metadata services *"# # %# %"! +,-.%# /01#% +2-! "#$% # %# $)%$ #! +,-#%$3 4 $ 2005 +2-678 %8 $8 9$.%: 13

!" File System GPFS HDFS Robust No single point of failure NameNode vulnerability Data Integrity High Evidence of data loss Scale Thousands of nodes Thousands of nodes POSIX Compliance Full supports a wide range of applications Limited Data Management Security, Backup, Replication Limited MapReduce Performance Good Good Workload Isolation Supports disk isolation No support Traditional Application Performance Good Poor performance with random reads and writes 14

Evolution of the global namespace: GPFS Active File Management (AFM) GPFS GPFS GPFS GPFS GPFS GPFS GPFS introduced concurrent file system access from multiple nodes. Multi-cluster expands the global namespace by connecting multiple sites AFM takes global namespace truly global by automatically managing asynchronous replication of data 1993 2005 2011 15

IBM NWA High level view of Scale-Out NAS Storage (SONAS) Benchmark Performance: 403,326 IOPS single file system (SPECsfs2008.nfs) SONAS Release 1.2 Single File System over 900TB usable 10 Interface Nodes; each with: - Maximum 144 GB of memory - One active 10GbE port 8 Storage Pods; each with: - 2 Storage nodes and 240 drives - Drive type: 15K RPM SAS hard drives - Data Protection: the drives were configured in RAID ranks 16 16

IBM Scale Out Network Attached Storage (SONAS) Enterprise Class Solution for IP-based File System Storage One global repository for application and user files - One huge file system, or up to 256 file systems per SONAS Enterprise solution for all applications, departments and users - Provision and monitor usage by application, file, department or whatever makes sense to the business - Includes ability to report usage and access patterns for chargeback - Capacity managed centrally - Extremely high utilization rates Simplified management of petabytes of storage Independently scalable performance and capacity eliminates trade-offs 17 IBM SONAS Cloud-ready

Concluding Thought: IBM s Value A complete stack for Big Data - Others require multi-vendor solutions Embracing the open source community - Product support and additional offerings - In-field expertise to ensure client success Enterprise-class focus - Performance tested - Administrative and development tooling - Deep integration with information management - software inside and outside IBM - Security and governance - High availability and backup System x and System Storage - Industry leading innovation and technology - Best in class reliability and availability - #1 in customer satisfaction 18

Thank You! June 26, 2012 2011 IBM Corporation

About the Speaker Mr. Tony Pearson Master Inventor, Senior Managing Consultant IBM System Storage Tony Pearson Master Inventor, Senior Managing Consultant IBM System Storage 9000 S. Rita Road Bldg 9070 Mail 9070 Tucson, AZ 85744 +1 520-799-4309 (Office) tpearson@us.ibm.com Tony Pearson is a Master Inventor and Senior managing consultant for the IBM System Storage product line. Tony joined IBM Corporation in 1986 in Tucson, Arizona, USA, and has lived there ever since. In his current role, Tony presents briefings on storage topics covering the entire System Storage product line, Tivoli storage software products, and topics related to Cloud Computing. He interacts with clients, speaks at conferences and events, and leads client workshops to help clients with strategic planning for IBM s integrated set of storage management software, hardware, and virtualization products. Tony writes the Inside System Storage blog, which is read by hundreds of clients, IBM sales reps and IBM Business Partners every week. This blog was rated one of the top 10 blogs for the IT storage industry by Networking World magazine, and #1 most read IBM blog on IBM s developerworks. The blog has been published in series of books, Inside System Storage: Volume I through IV. Over the past years, Tony has worked in development, marketing and customer care positions for various storage hardware and software products. Tony has a Bachelor of Science degree in Software Engineering, and a Master of Science degree in Electrical Engineering, both from the University of Arizona. Tony holds 19 IBM patents for inventions on storage hardware and software products. 20

Additional Resources Email: tpearson@us.ibm.com Twitter: http://twitter.com/az99øtony Blog: http://ibm.co/braezø Books: http://www.lulu.com/spotlight/99ø_tony IBM Expert Network: http://www.slideshare.net/az99øtony 21 21 21

Trademarks and disclaimers Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries. Other product and service names might be trademarks of IBM or other companies. Information is provided "AS IS" without warranty of any kind. The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-ibm products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-ibm list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-ibm products. Questions on the capability of non-ibm products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers'future planning. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact your IBM representative or Business Partner for the most current pricing in your geography. Photographs shown may be engineering prototypes. Changes may be incorporated in production models. IBM Corporation 2012. All rights reserved. References in this document to IBM products or services do not imply that IBM intends to make them available in every country. Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on the World Wide Web at http://www.ibm.com/legal/copytrade.shtml. ZSP03490-USEN-00 22