Greenplum Database Getting Started with Big Data Analytics Ofir Manor Pre Sales Technical Architect, EMC Greenplum 1
Agenda Introduction to Greenplum Greenplum Database Architecture Flexible Database Configuration Beyond SQL Flexible Analytics Flexible Deployment Other considerations 2
!!! Big Data Is Less About Size, And More About Freedom!!! Techcrunch!!! Findings: Big Data Is More Extreme Than Volume Total data: bigger than big data 451 Group Gartner!!!!!!!!!!!!!!! Big Data! It s Real, It s Real-time, and It s Already Changing Your World IDC 3
!!!!!! Big Data Is Less About Size, And More About Freedom Techcrunch THE ERA OF Findings: Big Data Is More Extreme Than BIG DATA!!! Volume Gartner IS HERE Total data: bigger than big data 451 Group!!!!!!!!!!!!!!! Big Data! It s Real, It s Real-time, and It s Already Changing Your World IDC 4
Industries Are Broadly Embracing Big Data Retail CRM Customer Scoring Store Siting and Layout Fraud Detection / Prevention Supply Chain Optimization Advertising & Public Relations Demand Signaling Ad Targeting Sentiment Analysis Customer Acquisition Financial Services Algorithmic Trading Risk Analysis Fraud Detection Portfolio Analysis Media & Telecommunications Network Optimization Customer Scoring Churn Prevention Fraud Prevention Manufacturing Product Research Engineering Analytics Process & Quality Analysis Distribution Optimization Energy Smart Grid Exploration Government Market Governance Counter-Terrorism Econometrics Health Informatics Healthcare & Life Sciences Pharmaco-Genomics Bio-Informatics Pharmaceutical Research Clinical Outcomes Research 5
6
7
8
The Power of Data Co-Processing 12
GREENPLUM DATABASE Extreme Performance for Analytics Optimized for BI and analytics Deep integration with statistical packages High performance parallel implementations Simple and automatic Just load and query like any database Tables are automatically distributed across nodes Extremely scalable MPP shared-nothing architecture All nodes can scan and process in parallel Linear scalability by adding nodes 13
GREENPLUM DATABASE A Mature Enterprise Platform CLIENT ACCESS 3 rd PARTY TOOLS ADMIN TOOLS CLIENT ACCESS & TOOLS ODBC, JDBC, OLEDB, MapReduce, etc. BI Tools, ETL Tools Data Mining, etc Greenplum Command Center Greenplum Package Manager LOADING & EXT. ACCESS STORAGE & DATA ACCESS LANGUAGE SUPPORT PRODUCT FEATURES Petabyte-Scale Loading Trickle Micro-Batching Anywhere Data Access Hybrid Storage & Execution (Row- & Column-Oriented) In-Database Compression Multi-Level Partitioning Indexes Btree, Bitmap, etc. External Table Support Comprehensive SQL Native MapReduce SQL 2003 OLAP Extensions Programmable Analytics Analytics Extensions (GeoSpatial, PR/R, PL/Java, PL/Python, PL/Perl) GREENPLUM DATABASE ADAPTIVE SERVICES Multi-Level Fault Tolerance (RAID, Mirroring, DR with Data Domain Boost) Online System Expansion Workload Management CORE MPP ARCHITECTURE Shared-Nothing MPP Parallel Query Optimizer Polymorphic Data Storage Parallel Dataflow Engine gnet Software Interconnect Scatter/Gather Streaming Data Loading 14
Extremely Scalable MPP Shared-Nothing Architecture SQL Client Master High-Speed Interconnect Segment Segment Segment Segment 15
Linear Scalability Each node has its own CPU and I/O resources SQL Client Add nodes to scale Master Rebalance happens in the background Segment Segment Segment Segment High-Speed Interconnec Segment Segment Segment Segment 16
GREENPLUM DATABASE High Availability Master Server Data Protection Replicated transaction logs for server failure Optional RAID protection for drive failures Upon server failure Standby server activated Administrator alerted Orchestrated failover Master Master Segment Server Data Protection Mirrored segments for server failures Optional RAID protection for drive failures Upon server failure Mirrored segments take over with no loss of service Fast online differential recovery Segment Segment Segment Segment 17
GREENPLUM DATABASE Most Powerful Data Loading Capabilities Industry leading performance at 10+TB per-hour per-rack SINGLE RACK COMPARISON Scatter-Gather Streaming provides true linear scaling Support for both large-batch and continuous real-time loading strategies Enable complex data transformations in-flight Transparent interfaces to loading via support files, application, and services Greenplum Oracle Exadata Netezza Teradata Greenplum load rates scale linearly with the number of racks, others do not. For example, two racks = >20TB/H 18
GREENPLUM DATABASE Polymorphic Table Storage TM TABLE CUSTOMER Mar 11 Apr 11 May 11 Jun 11 Jul 11 Aug 11 Sept 11 Oct 11 Nov 11 Column-oriented for COLD DATA Row-oriented for HOT DATA Enable Information Lifecycle Management (ILM) Storage types can be mixed within a table or database Four table types: heap, row-oriented AO, column-oriented, external Block compression: Gzip (levels 1-9), QuickLZ Provide the choice of processing model for any table or partition 19
GREENPLUM DATABASE In-Database Analytics MAD lib Bringing the power of parallelism to commonly-used modeling and analytics functions In-database analytics SAS HPA, Access, and Scoring Accelerator MADLib An open-source library of advanced analytics functions Analytics extensions supported, including PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc. 20
GREENPLUM PARTNERS SAS and Greenplum A Strategic Partnership for High-Performance Computing Access relational data-sets for agile analysis SAS/ACCESS provides fast, transparent and secure access to Greenplum data. Leverage database scalability for rapid model deployment SAS Scoring Accelerator publishes models for execution in parallel across the Greenplum cluster. Build complex models at massive scales The SAS High-Performance Analytics Appliance combines SAS In-Memory Analytics with Greenplum parallelism to produce recordbreaking scalability and performance. 21
GREENPLUM DATABASE MADlib Scalable in-database analytics Data-parallel Mathematical Algorithms Statistical Algorithms Machine learning Algorithms Supports structured and unstructured data. Delivered via open-source Accessibility Skill development Converge business, academic, and open-source communities 22
MADlib In-Database Analytical Functions Descriptive Statistics Quantile Profile CountMin (Cormode-Muthukrishnan) Sketch-based Estimator FM (Flajolet-Martin) Sketch-based Estimator MFV (Most Frequent Values) Sketchbased Estimator Frequency Histogram Bar Chart Box Plot Chart Latent Dirichlet Allocation Topic Modeling Modeling Correlation Matrix Association Rule Mining K-Means Clustering Naïve Bayes Classification Linear Regression Logistic Regression Support Vector Machines SVD Matrix Factorisation Decision Trees/CART 23
Greenplum Analytics Labs Packaged solutions that produce business value and actionable results Accelerate analytics capabilities on your data with your analysts Leverage the expertise of Greenplum s Data Scientists Establish a strategic vision for analytics development 24
Greenplum Delivers Choice & Flexibility Greenplum Data Computing Appliance Choose Greenplum Database and/or Hadoop modules in ¼ rack increments Scale up by adding your choice of additional modules Minimal time to value Greenplum Software Solutions Greenplum Database, Hadoop, & Chorus on your x86 hardware Flexibility for any workload or environment Perpetual or subscription licenses 25
GREENPLUM DCA Seamless Infrastructure Integration EMC Data Domain Efficient Backup & Restore Isilon Scale Out Storage For Big Data Staging EMC VMAX or VNX SAN Mirror For Advanced Storage Management EMC VMAX SRDF EMC Data Domain Replication For Disaster Recovery 28
GREENPLUM DATABASE Simple To Manage Greenplum Command Center Complete platform management and control Greenplum Package Manager Automates install, uninstall, update, and query for analytics extensions Support package migration during upgrade, segment recovery, expansion, and standby initialization 29
Innovative Companies Using Greenplum 30
Powerful Partner Ecosystem Discovix 31
Thank you ofir.manor@emc.com Downloads, Documentation, Whitepapers etc: http://www.greenplum.com A copy of this presentation will be avaliable on the event s web site Next Greenplum workshop in Hungary: 04 July, 2012 Register now at EMC Hungary, or Avnet Hungary 32