Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1
Big Data and the Data Warehouse Potential All internal operational data External web site traffic Mobile apps traffic Customer interactions from facebook, twitter etc Sensor data Deeper customer insights Better analytics better offerings, retention, fraud detection etc Increased profit, growth Less risk Reality DW slow to adapt Hard to fit into night window Can t support real-time loading Long running queries are killed A lot of hand-tuning, hints, indexes, materialized views etc Sprawl of data duplication and shadow systems Analytics done offline in small silos Can t integrate with newer Big Data sources 2
3
4
Building The Industry s Only Complete Big Data Analytics Stack Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum Database Enterprise & Community Editions World s Most Scalable MPP Database Platform Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data 5
Building The Industry s Only Complete Big Data Analytics Stack Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum Database Enterprise & Community Editions World s Most Scalable MPP Database Platform Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data 6
GREENPLUM DATABASE Industry-Leading Massively Parallel Processing (MPP) Performance 7
Database Architecture Matters Scale-Out vs. Scale-Up Greenplum is a Scale-Out Cloud Architecture on standard commodity hardware Others use a Mainframe Scale-Up Architecture on proprietary hardware 8
Greenplum Database Extreme Performance on Commodity HW Optimized for BI and Analytics Provides automatic parallelization Just load and query like any database Tables are automatically distributed across nodes No need for manual partitioning or tuning Interconnect Extremely scalable MPP shared-nothing Architecture All nodes can scan and process in parallel Linear scalability by adding nodes Flexible physical layout Column-oriented or row-oriented with various levels of compression Loading 9
Greenplum Database Most Powerful Data Loading Capabilities Industry leading performance: >10TB per hour per rack Innovative, parallel-everything architecture: Scatter-Gather Streaming provides true linear scaling Support for both large-batch and continuous real-time loading strategies Enable complex data transformations in-flight Transparent interfaces to loading via support files, application and services 10
Platform Independence Delivers Choice and Flexibility Data Computing Appliance Optimized Price/Performance Minimum time-to-value Ideal for Production Environments Software-Only On your x86 hardware Flexibility for any workload Ideal for Q/A or DR Virtualized Infrastructure Pool resources Elastic scalability Ideal for Test & Development 11
Building The Industry s Only Complete Big Data Analytics Stack Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum Database Enterprise & Community Editions World s Most Scalable MPP Database Platform Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data 12
EMC GREENPLUM HD Delivering Enterprise-Ready Apache Hadoop 13
What is Hadoop? Open Source Apache Project (written in Java) Provides distributed data and processing over commodity servers for unstructured data Hadoop core components: Distributed File System - Distributes data Map/Reduce - Distributes computation (near the data) HDFS MapReduce Pig Zookeeper Hive HBase Oozie Mahout Hadoop Distributed File System Framework for writing scalable data applications Procedural language that abstracts lower level MapReduce Highly reliable distributed coordination Data warehouse infrastructure built on top of Hadoop Database for random, real time read/write access workflow/coordination to manage jobs Scalable machine learning libraries 14
Hadoop Example: Yahoo! Search Assist Insight: Related concepts appear close together in text corpus. Input: Web pages 1 Billion Pages, 10K bytes each 10 TB of input data Output: List(word, List(related words)) 15
Greenplum HD: Enterprise Edition Enterprise-Ready Hadoop Platform for Unstructured Data Faster 2 5x Faster than Apache Hadoop Reliable Easier to Use High Availability Mirroring, Snapshots NFS mountable System Management 16
Hadoop and Database Co-Processing Analytic Productivity Applications, Tools, Chorus Data Computing Interfaces SQL, MapReduce, In-Database Analytics, Parallel Data Loading (batch or real-time) Greenplum Database Hadoop Compute Storage parallel data exchange Compute Storage SQL DB Engine parallel data exchange MapReduce Engine Network unstructured data structured data temporal data All Data Types geospatial data sensor data spatial data 17
Building The Industry s Only Complete Big Data Analytics Stack Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum Database Enterprise & Community Editions World s Most Scalable MPP Database Platform Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data 18
Greenplum Data Computing Appliances Application Specific Configurations DATABASE HADOOP Purpose-built, highly scalable data warehousing appliance that delivers leading price performance Greenplum Database combined with SAS high-performance computing to enable analytics on all the data Greenplum Database combined with Hadoop to enable co-processing of structured and unstructured data EMC* makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, Roadmap Information ). Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby. 19
Connecting Functional Modules GP DB Module GPDB GREENPLUM DATABASE MODULE 4 servers optimally preconfigured with GP DB software for simple plug and play expansion of the database cluster GP HD Module HD DIA Module DIA GREENPLUM HD MODULE 4 servers optimally preconfigured with GP HD software for simple plug and play expansion of HDFS cluster DATA INTEGRATION ACCELERATOR MODULE 4 servers available for 3 rd party software that benefits from being on shared interconnect for high speed data access 20
Example 3 Rack Configuration HD HD DIA GPDB GPDB HD GPDB HD HD 21
Sample Configuration with Greenplum Database Modules Module Type GP DB Standard Module GP DB High Capacity Module Number of Modules Number of Racks Usable Capacity (uncompressed) Usable Capacity (compressed) 4 24 4 24 1 6 1 6 36 TB 216 TB 124 TB 744 TB 144 TB 864 TB 496 TB 2,976 TB Scan Rate 24 GB/Sec 144 GB/Sec 14 GB/Sec 84 GB/Sec Data Load Rate 10 TB/Hour 60 TB/Hour 10 TB/Hour 60 TB/Hour 22
Greenplum Data Computing Appliances Seamless Infrastructure Integration EMC Data Domain Efficient Backup & Restore Isilon Scale Out Storage For Big Data Staging EMC VMAX SAN Mirror For Advanced Storage Management EMC VMAX SRDF EMC Data Domain Replication For Disaster Recovery 23
Building The Industry s Only Complete Big Data Analytics Stack Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum Database Enterprise & Community Editions World s Most Scalable MPP Database Platform Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data 24
GREENPLUM CHORUS The World s First Enterprise Data Cloud Platform 25
Greenplum Chorus Self-Service Analytic Infrastructure Self-service provisioning Data services Collaborative analytics 26
How Do You Get Started? Unlock the business value in big data Our advanced analytics services will help you combine new, rich big data sources in powerful ways to discover new business insights Analytics Assessment Greenplum Analytics Lab Vision Workshop Big Data Advisory Service 27
Building The Industry s Only Complete Big Data Analytics Stack Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum Database Greenplum HD Enterprise & Community Editions World s Most Scalable MPP Database Platform Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data 28
Powerful Big Data Partner Ecosystem 29
Greenplum: Current Success and Market Momentum Leaders Quadrant in Gartner DW 2011 Mission critical deployments across multiple industries Installations from small (TBs) to very large (PBs) Scalable analytics platform to complement EDW 30 30
Customer Examples Sample use cases across industries with Greenplum Database Telecom Media & Entertainment Analyze user behavior to eliminate network abuses Retail Direct marketing/crm Financial Services Detect and prevent fraud and credit scoring and analysis to reduce credit risk Pharmaceutical Analytics for drug discovery and development Internet Clickstream analytics for ad targeting and market research 31
THANK YOU 32
THANK YOU 33