Big Data Cloud Storage Technology Comparison. Tony Pearson IBM Master Inventor and Senior Managing Consultant. June 26, IBM Corporation

Similar documents
Cloud Computing with xcat on z/vm 6.3

How to Deliver Measurable Business Value with the Enterprise CMDB

How To Manage Energy At An Energy Efficient Cost

Energy Management in a Cloud Computing Environment

Title. Click to edit Master text styles Second level Third level

IBM Smart Business Storage Cloud

Session 1494: IBM Tivoli Storage FlashCopy Manager

IBM Systems Director Navigator for i5/os New Web console for i5, Fast, Easy, Ready

Maximo Business Intelligence Reporting Roadmap Washington DC Users Group

Version 8.2. Tivoli Endpoint Manager for Asset Discovery User's Guide

Forecasting Performance Metrics using the IBM Tivoli Performance Analyzer

Dell Reference Configuration for Hortonworks Data Platform

Tip and Technique on creating adhoc reports in IBM Cognos Controller

Session Title: Cloud Computing 101 What every z Person must know

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Data Transfer Tips and Techniques

Practical Web Services for RPG IBM Integrated Web services for i

How To Write An Architecture For An Bm Security Framework

z/osmf Software Deployment Application- User Experience Enhancement Update

Maximum performance, minimal risk for data warehousing

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

z/vm Capacity Planning Overview

DEPLOYING IBM DB2 FOR LINUX, UNIX, AND WINDOWS DATA WAREHOUSES ON EMC STORAGE ARRAYS

Lenovo Database Configuration for Microsoft SQL Server TB

Microsoft Private Cloud Fast Track

IBM Endpoint Manager. Security and Compliance Analytics Setup Guide

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Microsoft Private Cloud Fast Track Reference Architecture

The predictive power of Big Data in healthcare

IBM Storwize V5000. Designed to drive innovation and greater flexibility with a hybrid storage solution. Highlights. IBM Systems Data Sheet

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

Chapter 7. Using Hadoop Cluster and MapReduce

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

IBM Storwize V7000: For your VMware virtual infrastructure

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Oracle on System z Linux- High Availability Options Session ID 252

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Creating a Cloud Backup Service. Deon George

Positioning the Roadmap for POWER5 iseries and pseries

Hadoop Size does Hadoop Summit 2013

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Performance and scalability of a large OLTP workload

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

CSE-E5430 Scalable Cloud Computing Lecture 2

Minimize cost and risk for data warehousing

IBM Maximo Asset Configuration Manager

Enabling High performance Big Data platform with RDMA

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Accelerate with ATS DS8000 Hardware Management Console (HMC) Best practices and Remote Support Configuration September 23rd, 2014.

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Hadoop Architecture. Part 1

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Featuring: GUI screen designer to edit DDS source for 5250 Display Files

IBM i Network Install using Network File System

SUN ORACLE DATABASE MACHINE

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

IBM Software Services for Collaboration

Big Data & Analytics. A boon under certain conditions. Dr. Christian Keller General Manager IBM Switzerland IBM Corporation

HadoopTM Analytics DDN

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

IBM Cognos 10: Enhancing query processing performance for IBM Netezza appliances

Migrating LAMP stack from x86 to Power using the Server Consolidation Tool

TSM for Virtual Environments Data Protection for VMware

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Managed Services - A Paradigm for Cloud- Based Business Continuity

SEAIP 2009 Presentation

IBM Storwize Rapid Application Storage solutions

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Java Application Performance Analysis and Tuning on IBM System i

IBM MOBILE SECURITY SOLUTIONS - Identity and Access Management Focus

Virtualizing Apache Hadoop. June, 2012

Backups in the Cloud Ron McCracken IBM Business Environment

IBM System Storage Portfolio Overview

Big data management with IBM General Parallel File System

zday 2010 Smart Analytics Grzegorz T. Kolecki, Sales Leader zim, IBM CEE Information Management 2010 IBM Corporation

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Endpoint Manager for Mobile Devices Setup Guide

TSM (Tivoli Storage Manager) Backup and Recovery. Richard Whybrow Hertz Australia System Network Administrator

IOmark- VDI. HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC b Test Report Date: 27, April

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Accelerating and Simplifying Apache

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Sterling Business Intelligence. Concepts Guide

IBM Storage Technical Strategy and Trends

Scala Storage Scale-Out Clustered Storage White Paper

Big Data and Natural Language: Extracting Insight From Text

IBM System x reference architecture solutions for big data

IBM System x SAP HANA

Deploying a private database cloud on z Systems

System z Batch Network Analyzer Tool (zbna) - Because Batch is Back!

Determining which Solutions are the Best Fit for Linux on System z Workloads

Open source Google-style large scale data analysis with Hadoop

What s the best disk storage for my i5/os workload?

EMC Unified Storage for Microsoft SQL Server 2008

Transcription:

Big Data Cloud Storage Technology Comparison Tony Pearson IBM Master Inventor and Senior Managing Consultant June 26, 2012 2011 IBM Corporation

Agenda What is Big Data? InfoSphere BigInsights Infrastructure and Storage Considerations Concluding Thoughts 2

An Explosion of Data 1.3 Billion RFID tags in 2005 30 Billion RFID today 2 Billion Internet users by 2011 4.6 Billon Mobile Phones World Wide Capital market data volumes grew 1,750%, 2003-06 Twitter process 7 terabytes of data every day World Data Centre for Climate 220 Terabytes of Web data 9 Petabytes of additional data Facebook processes 10 terabytes of data every day 3

Information Overload But Lacking Insight 44x as much Data and Content Over Coming Decade 2020 35 Zettabytes Business leaders frequently 1in3 make decisions based on information they don t trust, or don t have Business leaders say they don t have access to the information 1in2 they need to do their jobs 2009 800,000 Petabytes 80% Of world s data is unstructured 83% of CIOs cited Business intelligence and analytics as part of their visionary plans to enhance competitiveness of CEOs need to do a better job capturing and understanding information rapidly in order to 60% make swift business decisions 4

The Big Data Opportunity Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible. Variety: Velocity: Volume: Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text Streaming data and large volume data movement Scale from Terabytes to Zettabytes 5

Where did this begin Apache Hadoop Open source framework for harnessing large volumes of unstructured-data - Inspired by Google technologies (MapReduce, GFS) - Originally built to address scalability problems of web search and analytics Processing Storage Enables applications to run on thousands of nodes and leverage Petabytes of data in a highly parallel, cost effective manner - CPU + Disks = Hadoop Node - Nodes can be combined into clusters - New nodes can be added dynamically - Provides simple scalable growth 6

How IBM BigInsights extends Hadoop capabiltity Delivering enterprise-ready software Risk Exposure Failure Analysis Text Processing Advanced Analytics Log Analytics Performance & Availability Extreme storage capacity Security Hardened Architecture Climate modelling Scientific Research Management Disciplines Developer Value InfoSphere BigInsights (Internet Scale Analytics) Traditional / Non-traditional data sources 7

Infrastructure for the range of BigInsights deployments Value Enterprise Performance Characteristics Optimized for cost effective scale-out Classic Hadoop architecture Redundancy provided by Hadoop Typical customer use cases Customer sentiment analysis Internet behavior and buying pattern analysis Characteristics Enterprise class features Options to support business critical workloads Typical customer use cases Financial Fraud Detection Risk analysis Data warehouse offload for cold data Characteristics Highest performance Compute and I/O intensive workload options Typical customer use cases Email compliance analysis Credit card fraud detection Media analytics 8

Technology Comparison Internal Storage in System x Servers - Block-level access - Use GPFS-Shared Nothing Cluster (SNC) - Typical for most Hadoop installations External Storage DCS3700 - Block-level access - 60 drives in 4U drawer - Designed for Sequential workloads - Use GPFS-Shared Nothing Cluster Based on the IBM System x3630 M3: Ultra-dense, storage-rich server for Big Data SONAS - File-level access - Designed for unstructured data content used in Big Data analytics 9

BigInsights Hardware Foundation Rack-Level Features Up to 20 System x3630 M3 nodes Up to 840TB storage Up to 240 cores Up to 3,840GB memory Up to two 10Gb Ethernet or 40Gb InfiniBand switches Scalable to multi-rack configurations Available Enterprise and Performance Features Redundant storage Redundant networking High performance cores Increased memory High performance networking 10

BigInsights Value Node Features Value Data Node IBM System x3630 M3 Two Intel Xeon E5620 CPUs Data: 12 x 2TB NL SAS HDDs OS: 1 x 2TB NL SAS HDD 48GB DDR3 RDIMMs Value Management Node (JobTracker, NameNode, Console) IBM System x3630 M3 Two Intel Xeon E5620 CPUs Data: 4 x 2TB NL SAS HDDs OS: 2 x 2TB NL SAS HDD, RAID1 96GB DDR3 RDIMMs 11

IBM Storage Product Positioning Primary Data Enterprise Midrange SSD XIV SSD DS5000 SVC DS8000 Flash & Stash SSD SSD SSD Storwize V7000 N7000 SSD SSD N6000 SONAS Storwize V7000 Unified Mainframe Optimized NAS for all servers Distributed High Performance Computing, Big Data DCS3700 Entry Level DS3500 Unified Storage N3000 Random Sequential 12 12

Query languages like Pig and JAQL need good random I/O performance Sort requires better sequential throughput GPFS is twice HDFS for both of the above For document index lookups, client side caching is a big win 17x throughput speedup 2000 1500 1000 500 0 " & '( Proven data integrity Replicated metadata services *"# # %# %"! +,-.%# /01#% +2-! "#$% # %# $)%$ #! +,-#%$3 4 $ 2005 +2-678 %8 $8 9$.%: 13

!" File System GPFS HDFS Robust No single point of failure NameNode vulnerability Data Integrity High Evidence of data loss Scale Thousands of nodes Thousands of nodes POSIX Compliance Full supports a wide range of applications Limited Data Management Security, Backup, Replication Limited MapReduce Performance Good Good Workload Isolation Supports disk isolation No support Traditional Application Performance Good Poor performance with random reads and writes 14

Evolution of the global namespace: GPFS Active File Management (AFM) GPFS GPFS GPFS GPFS GPFS GPFS GPFS introduced concurrent file system access from multiple nodes. Multi-cluster expands the global namespace by connecting multiple sites AFM takes global namespace truly global by automatically managing asynchronous replication of data 1993 2005 2011 15

IBM NWA High level view of Scale-Out NAS Storage (SONAS) Benchmark Performance: 403,326 IOPS single file system (SPECsfs2008.nfs) SONAS Release 1.2 Single File System over 900TB usable 10 Interface Nodes; each with: - Maximum 144 GB of memory - One active 10GbE port 8 Storage Pods; each with: - 2 Storage nodes and 240 drives - Drive type: 15K RPM SAS hard drives - Data Protection: the drives were configured in RAID ranks 16 16

IBM Scale Out Network Attached Storage (SONAS) Enterprise Class Solution for IP-based File System Storage One global repository for application and user files - One huge file system, or up to 256 file systems per SONAS Enterprise solution for all applications, departments and users - Provision and monitor usage by application, file, department or whatever makes sense to the business - Includes ability to report usage and access patterns for chargeback - Capacity managed centrally - Extremely high utilization rates Simplified management of petabytes of storage Independently scalable performance and capacity eliminates trade-offs 17 IBM SONAS Cloud-ready

Concluding Thought: IBM s Value A complete stack for Big Data - Others require multi-vendor solutions Embracing the open source community - Product support and additional offerings - In-field expertise to ensure client success Enterprise-class focus - Performance tested - Administrative and development tooling - Deep integration with information management - software inside and outside IBM - Security and governance - High availability and backup System x and System Storage - Industry leading innovation and technology - Best in class reliability and availability - #1 in customer satisfaction 18

Thank You! June 26, 2012 2011 IBM Corporation

About the Speaker Mr. Tony Pearson Master Inventor, Senior Managing Consultant IBM System Storage Tony Pearson Master Inventor, Senior Managing Consultant IBM System Storage 9000 S. Rita Road Bldg 9070 Mail 9070 Tucson, AZ 85744 +1 520-799-4309 (Office) tpearson@us.ibm.com Tony Pearson is a Master Inventor and Senior managing consultant for the IBM System Storage product line. Tony joined IBM Corporation in 1986 in Tucson, Arizona, USA, and has lived there ever since. In his current role, Tony presents briefings on storage topics covering the entire System Storage product line, Tivoli storage software products, and topics related to Cloud Computing. He interacts with clients, speaks at conferences and events, and leads client workshops to help clients with strategic planning for IBM s integrated set of storage management software, hardware, and virtualization products. Tony writes the Inside System Storage blog, which is read by hundreds of clients, IBM sales reps and IBM Business Partners every week. This blog was rated one of the top 10 blogs for the IT storage industry by Networking World magazine, and #1 most read IBM blog on IBM s developerworks. The blog has been published in series of books, Inside System Storage: Volume I through IV. Over the past years, Tony has worked in development, marketing and customer care positions for various storage hardware and software products. Tony has a Bachelor of Science degree in Software Engineering, and a Master of Science degree in Electrical Engineering, both from the University of Arizona. Tony holds 19 IBM patents for inventions on storage hardware and software products. 20

Additional Resources Email: tpearson@us.ibm.com Twitter: http://twitter.com/az99øtony Blog: http://ibm.co/braezø Books: http://www.lulu.com/spotlight/99ø_tony IBM Expert Network: http://www.slideshare.net/az99øtony 21 21 21

Trademarks and disclaimers Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries. Other product and service names might be trademarks of IBM or other companies. Information is provided "AS IS" without warranty of any kind. The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-ibm products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-ibm list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-ibm products. Questions on the capability of non-ibm products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers'future planning. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact your IBM representative or Business Partner for the most current pricing in your geography. Photographs shown may be engineering prototypes. Changes may be incorporated in production models. IBM Corporation 2012. All rights reserved. References in this document to IBM products or services do not imply that IBM intends to make them available in every country. Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on the World Wide Web at http://www.ibm.com/legal/copytrade.shtml. ZSP03490-USEN-00 22