Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1
That sounds complicated? 2
Who can tell me how best to solve this 3
What are the main mathematical functions?? MULTIPLICATION DIVISION ADDITION 4
So What s the Problem? How many of you have tried to run complex queries that cannot complete? How many of you would love for your IT or other execs to just understand the basic maths?? How many of you would like your analytics to be part of every business process? 5
The Platform Data Science Application Development 6
THE PLATFORM Greenplum UAP Unified Analytics Platform for Big Data Greenplum Database for structured data Greenplum HD, Enterprise-ready Hadoop for unstructured data Greenplum Chorus, the social platform for data science The Platform Application Development 7
Introducing EMC Greenplum Data Computing Appliance DATA IN. DECISIONS OUT. Delivering the fastest data loading and best price/performance ratio in the data warehousing industry 8
EMC Greenplum Data Computing Appliance Performance, scalability, reliability, and reduced TCO for data warehousing/business intelligence environments Extreme performance Optimized for fast query execution and unmatched data loading Rapidly deployable Purpose-built data warehousing appliance Reduced TCO Consolidate data marts for lower costs Private cloud-ready Data and computing are automatically optimized and distributed Highly available Self-healing and fully redundant Elastic scalability Expand capacity and performance online Advanced backup and disaster recovery Leverage industry-leading Data Domain backup and recovery 9
Benefits of an Appliance Approach EMC GREENPLUM DATA COMPUTING APPLIANCE Compute Storage Database Network Rapidly deployable in days, not weeks or months Appliance packaging and pre-tuning assures predictable performance Dramatically simplifies data warehouse and analytics infrastructure Reduces administration overhead Scale-out architecture; simply expand capacity and performance as needed Designed for rapid analysis of data volumes from less than a terabyte, scaling into the petabytes One support structure 10
EMC Greenplum Database Fastest data loading Advanced analytics DATA IN IN-DATABASE ANALYTICS DECISIONS OUT Scatter/Gather Streaming technology for the world s fastest data loading Eliminate data load bottlenecks Clean and integrate new data Several loading options, ranging from bulk load updates to microbatching for near real-time processing Optimized for fast query execution and linear scalability Move processing closer to data Shared-nothing, massively parallel processing (MPP) scale-out architecture Computing is automatically optimized and distributed across resources Provides the best concurrent multiworkload performance Unified data access for greater insight and value from data Enable parallel analysis across the enterprise Open platform with broad language support Certified enterprise connectivity and integration with most business intelligence; extract, transform, and load (ETL); and management products 11
TB/hour Industry s Fastest Data-Loading Rate Scatter/Gather Streaming technology for the industry s fastest data loading 5X 2X Eliminate data-load bottlenecks Remove additional loading tiers Parallel everywhere Netezza TwinFin Teradata Oracle Exadata EMC Greenplum Data Computing Appliance 12
EMC Greenplum Data Computing Appliance Architecture Flexible framework for processing large datasets SQL MapReduce Master Master Segment Segment Segment Segment Segment Massively parallel processing (MPP) architecture Shared-nothing architecture No single coordinator or performance bottleneck MPP everywhere Query optimization across segment servers Automated failover High reliability and availability Linear scalability I/O optimized 13
Shared-Nothing Architecture Massively Parallel Processing (MPP) Interconnect Most scalable database architecture Optimized for business intelligence and analytics Provides automatic parallelization No need for manual partitioning or tuning Just load and query like any database Tables are distributed across segments Each table has a subset of the rows Loading Extremely scalable and I/O optimized All nodes can scan and process in parallel No I/O contention among segments Linear scalability by adding nodes Each node adds storage, query performance, and loading performance 14
High Availability Self-healing and rapid recovery Master Master Segment Segment Segment Segment Master server data protection RAID protection for drive failures Replicated transaction logs for server failure On server failure Standby server-activated Administrator alerted Segment server data protection RAID protection for drive failures Mirrored segments for server failures On server failure Mirrored segments take over with no loss of service Fast online differential recovery 15
Self-Healing Automatic Failover Master servers Master servers Network Interconnect Segment servers Greenplum provides automatic failover using a selfhealing physical block replication architecture Key benefits of this architecture : Automatic failure detection and failover to mirror segments Fast differential recovery and sync (while fully online/readwrite) Improved write performance and reduced network load 16
Integrated EMC Data Domain Backup EMC Greenplum Data Computing Appliance Segment server NFS shares Twinax/ Fibre Channel cables Two 10 Gb IP links EMC Data Domain DD880 Backup and recovery With EMC Data Domain/ Greenplum native utility Reduces storage backup requirements Deduplicates data Fast, reliable data recovery Reduced recovery time Flexible and efficient Designate backup intervals Point-in-time copies 17
Proven Deployments of EMC Greenplum Database Sample use cases across industries with Greenplum Telecommunications, Media, and Entertainment Understand customer behaviors to reduce customer churn rates and develop customer loyalty programs Retail Analyze supply chain to optimize and cut costs Internet Clickstream analytics for ad targeting and market research Financial Services Detect and prevent fraud Credit scoring to reduce credit risk Pharmaceutical Analytics for drug discovery and development 18
Greenplum Data Computing Appliance Is Complementary to Enterprise Data Warehouse Enterprise Data Warehouse Single source of truth One logical model Heavy data governance and quality Operational reporting Financial consolidation Greenplum Data Computing Appliance Source of all the raw data (often 10-times the size of the enterprise data warehouse) Self-service infrastructure to support multiple data marts and sandboxes Rapid analytic iteration and business-led solutions 19
The Need for Consolidation: Data in a Typical Enterprise Enterprise data warehouse ~10% of data Data marts and personal databases ~90% of data Data is everywhere corporate enterprise data warehouse, hundreds of data marts, shadow databases, and spreadsheets The goal of centralizing all data in a single enterprise data warehouse has proven untenable 20
GREENPLUM DATABASE MADlib In-Database Analytical Functions Descriptive Statistics Quantile Profile CountMin (Cormode-Muthukrishnan) Sketch-based Estimator FM (Flajolet-Martin) Sketch-based Estimator MFV (Most Frequent Values) Sketchbased Estimator Frequency Histogram Bar Chart Box Plot Chart Latent Dirichlet Allocation Topic Modeling Modeling Correlation Matrix Association Rule Mining K-Means Clustering Naïve Bayes Classification Linear Regression Logistic Regression Support Vector Machines SVD Matrix Factorisation Decision Trees/CART 21
GREENPLUM HD Mahout Analytical Functions for Hadoop Sampling of Algorithms in Mahout Today: Collaborative Filtering User-based, Item-based recommenders K-Means Clustering Fuzzy K-Means Clustering Mean Shift Clustering Dirichlet Process Clustering Latent Dirichlet Allocation Singular Value Decomposition Parallel Frequent Pattern mining Complementary Naïve Bayes Classifier Random Forest Decision Tree-Based Classifier Java collections (previously Colt) Many more are included or are in development Plus, a robust and growing user community 22
Powerful Partner Ecosystem ANALYTICS BUSINESS INTELLIGENCE DATA INTEGRATION INDUSTRY Discovix TECHNOLOGY 23
So What s the Problem? How many of you have tried to run complex queries that cannot complete? How many of you would love for your IT or other execs to just understand the basic maths?? How many of you would like your analytics to be part of every business process? 24
Greenplum Analytics Lab Data Science Leverage the expertise of Greenplum s Data Scientists t 25
So What s the Problem? How many of you have tried to run complex queries that cannot complete? How many of you would love for your IT or other execs to just understand the basic maths?? How many of you would like your analytics to be part of every business process? 26
Application Development Pivotal Labs The Execution Engine To Quickly Create And Deploy Big Data Applications 27
GREENPLUM DELIVERS THE PREDICTIVE ENTERPRISE 28
The Predictive Enterprise Predictive Enterprise Data Driven Decisions Deliver maximum business value from all the available data Predict outcomes using advanced analytics Leverage data science to gain deep insight about the business Turn insight into action with new applications 29
LET S GET STARTED 30