Advanced In-Database Analytics



Similar documents
Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

EMC GREENPLUM DATABASE

EMC BACKUP MEETS BIG DATA

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Integrated Grid Solutions. and Greenplum

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

I/O Considerations in Big Data Analytics

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Big Data and Its Impact on the Data Warehousing Architecture

2009 Oracle Corporation 1

Oracle Database 12c Plug In. Switch On. Get SMART.

Big Data and Data Science: Behind the Buzz Words

Protecting Big Data Data Protection Solutions for the Business Data Lake

WHITE PAPER. Harnessing the Power of Advanced Analytics How an appliance approach simplifies the use of advanced analytics

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Harnessing the power of advanced analytics with IBM Netezza

Upgrading to Microsoft SQL Server 2008 R2 from Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000

SQL Server 2005 Features Comparison

Big Data and the Data Lake. February 2015

IBM Netezza High Capacity Appliance

Innovative technology for big data analytics

BIG DATA-AS-A-SERVICE

The Data Mining Process

Big + Fast + Safe + Simple = Lowest Technical Risk

Integrating Netezza into your existing IT landscape

Advanced Big Data Analytics with R and Hadoop

A HIGH-PERFORMANCE, SCALABLE BIG DATA APPLIANCE LAURA CHU-VIAL, SENIOR PRODUCT MARKETING MANAGER JOACHIM RAHMFELD, VP FIELD ALLIANCES OF SAP

Integrating a Big Data Platform into Government:

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

2015 Ironside Group, Inc. 2

Virtual Data Warehouse Appliances

HP Vertica. Echtzeit-Analyse extremer Datenmengen und Einbindung von Hadoop. Helmut Schmitt Sales Manager DACH

Introducing Oracle Exalytics In-Memory Machine

6.0, 6.5 and Beyond. The Future of Spotfire. Tobias Lehtipalo Sr. Director of Product Management

In-Memory Analytics for Big Data

Big Data Technologies Compared June 2014

BIG DATA What it is and how to use?

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

In-Database Analytics

IBM Netezza High-performance business intelligence and advanced analytics for the enterprise. The analytics conundrum

Data Warehouse as a Service. Lot 2 - Platform as a Service. Version: 1.1, Issue Date: 05/02/2014. Classification: Open

SEIZE THE DATA SEIZE THE DATA. 2015

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

NextGen Infrastructure for Big DATA Analytics.

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

CitusDB Architecture for Real-Time Big Data

VIEWPOINT. High Performance Analytics. Industry Context and Trends

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

SAS and Teradata Partnership

HIGH PERFORMANCE ANALYTICS FOR TERADATA

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Investor Presentation. Second Quarter 2015

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

White Paper. Unified Data Integration Across Big Data Platforms

Unified Data Integration Across Big Data Platforms

ADVANCED ANALYTICS AND FRAUD DETECTION THE RIGHT TECHNOLOGY FOR NOW AND THE FUTURE

The Internet of Things and Big Data: Intro

EMC SOLUTION FOR AGILE AND ROBUST ANALYTICS ON HADOOP DATA LAKE WITH PIVOTAL HDB

Safe Harbor Statement

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Netezza and Business Analytics Synergy

Technical white paper. R you ready? Turning big data into big value with the HP Vertica Analytics Platform and R

How to Enhance Traditional BI Architecture to Leverage Big Data

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Universal PMML Plug-in for EMC Greenplum Database

Information Architecture

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

How To Use Hp Vertica Ondemand

EMC DATA DOMAIN OPERATING SYSTEM

Oracle Big Data SQL Technical Update

EMC DATA DOMAIN OPERATING SYSTEM

INVESTOR PRESENTATION. First Quarter 2014

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

VMware vsphere Data Protection

SAP Real-time Data Platform. April 2013

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

Focus on the business, not the business of data warehousing!

PARC and SAP Co-innovation: High-performance Graph Analytics for Big Data Powered by SAP HANA

Transcription:

Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1

That sounds complicated? 2

Who can tell me how best to solve this 3

What are the main mathematical functions?? MULTIPLICATION DIVISION ADDITION 4

So What s the Problem? How many of you have tried to run complex queries that cannot complete? How many of you would love for your IT or other execs to just understand the basic maths?? How many of you would like your analytics to be part of every business process? 5

The Platform Data Science Application Development 6

THE PLATFORM Greenplum UAP Unified Analytics Platform for Big Data Greenplum Database for structured data Greenplum HD, Enterprise-ready Hadoop for unstructured data Greenplum Chorus, the social platform for data science The Platform Application Development 7

Introducing EMC Greenplum Data Computing Appliance DATA IN. DECISIONS OUT. Delivering the fastest data loading and best price/performance ratio in the data warehousing industry 8

EMC Greenplum Data Computing Appliance Performance, scalability, reliability, and reduced TCO for data warehousing/business intelligence environments Extreme performance Optimized for fast query execution and unmatched data loading Rapidly deployable Purpose-built data warehousing appliance Reduced TCO Consolidate data marts for lower costs Private cloud-ready Data and computing are automatically optimized and distributed Highly available Self-healing and fully redundant Elastic scalability Expand capacity and performance online Advanced backup and disaster recovery Leverage industry-leading Data Domain backup and recovery 9

Benefits of an Appliance Approach EMC GREENPLUM DATA COMPUTING APPLIANCE Compute Storage Database Network Rapidly deployable in days, not weeks or months Appliance packaging and pre-tuning assures predictable performance Dramatically simplifies data warehouse and analytics infrastructure Reduces administration overhead Scale-out architecture; simply expand capacity and performance as needed Designed for rapid analysis of data volumes from less than a terabyte, scaling into the petabytes One support structure 10

EMC Greenplum Database Fastest data loading Advanced analytics DATA IN IN-DATABASE ANALYTICS DECISIONS OUT Scatter/Gather Streaming technology for the world s fastest data loading Eliminate data load bottlenecks Clean and integrate new data Several loading options, ranging from bulk load updates to microbatching for near real-time processing Optimized for fast query execution and linear scalability Move processing closer to data Shared-nothing, massively parallel processing (MPP) scale-out architecture Computing is automatically optimized and distributed across resources Provides the best concurrent multiworkload performance Unified data access for greater insight and value from data Enable parallel analysis across the enterprise Open platform with broad language support Certified enterprise connectivity and integration with most business intelligence; extract, transform, and load (ETL); and management products 11

TB/hour Industry s Fastest Data-Loading Rate Scatter/Gather Streaming technology for the industry s fastest data loading 5X 2X Eliminate data-load bottlenecks Remove additional loading tiers Parallel everywhere Netezza TwinFin Teradata Oracle Exadata EMC Greenplum Data Computing Appliance 12

EMC Greenplum Data Computing Appliance Architecture Flexible framework for processing large datasets SQL MapReduce Master Master Segment Segment Segment Segment Segment Massively parallel processing (MPP) architecture Shared-nothing architecture No single coordinator or performance bottleneck MPP everywhere Query optimization across segment servers Automated failover High reliability and availability Linear scalability I/O optimized 13

Shared-Nothing Architecture Massively Parallel Processing (MPP) Interconnect Most scalable database architecture Optimized for business intelligence and analytics Provides automatic parallelization No need for manual partitioning or tuning Just load and query like any database Tables are distributed across segments Each table has a subset of the rows Loading Extremely scalable and I/O optimized All nodes can scan and process in parallel No I/O contention among segments Linear scalability by adding nodes Each node adds storage, query performance, and loading performance 14

High Availability Self-healing and rapid recovery Master Master Segment Segment Segment Segment Master server data protection RAID protection for drive failures Replicated transaction logs for server failure On server failure Standby server-activated Administrator alerted Segment server data protection RAID protection for drive failures Mirrored segments for server failures On server failure Mirrored segments take over with no loss of service Fast online differential recovery 15

Self-Healing Automatic Failover Master servers Master servers Network Interconnect Segment servers Greenplum provides automatic failover using a selfhealing physical block replication architecture Key benefits of this architecture : Automatic failure detection and failover to mirror segments Fast differential recovery and sync (while fully online/readwrite) Improved write performance and reduced network load 16

Integrated EMC Data Domain Backup EMC Greenplum Data Computing Appliance Segment server NFS shares Twinax/ Fibre Channel cables Two 10 Gb IP links EMC Data Domain DD880 Backup and recovery With EMC Data Domain/ Greenplum native utility Reduces storage backup requirements Deduplicates data Fast, reliable data recovery Reduced recovery time Flexible and efficient Designate backup intervals Point-in-time copies 17

Proven Deployments of EMC Greenplum Database Sample use cases across industries with Greenplum Telecommunications, Media, and Entertainment Understand customer behaviors to reduce customer churn rates and develop customer loyalty programs Retail Analyze supply chain to optimize and cut costs Internet Clickstream analytics for ad targeting and market research Financial Services Detect and prevent fraud Credit scoring to reduce credit risk Pharmaceutical Analytics for drug discovery and development 18

Greenplum Data Computing Appliance Is Complementary to Enterprise Data Warehouse Enterprise Data Warehouse Single source of truth One logical model Heavy data governance and quality Operational reporting Financial consolidation Greenplum Data Computing Appliance Source of all the raw data (often 10-times the size of the enterprise data warehouse) Self-service infrastructure to support multiple data marts and sandboxes Rapid analytic iteration and business-led solutions 19

The Need for Consolidation: Data in a Typical Enterprise Enterprise data warehouse ~10% of data Data marts and personal databases ~90% of data Data is everywhere corporate enterprise data warehouse, hundreds of data marts, shadow databases, and spreadsheets The goal of centralizing all data in a single enterprise data warehouse has proven untenable 20

GREENPLUM DATABASE MADlib In-Database Analytical Functions Descriptive Statistics Quantile Profile CountMin (Cormode-Muthukrishnan) Sketch-based Estimator FM (Flajolet-Martin) Sketch-based Estimator MFV (Most Frequent Values) Sketchbased Estimator Frequency Histogram Bar Chart Box Plot Chart Latent Dirichlet Allocation Topic Modeling Modeling Correlation Matrix Association Rule Mining K-Means Clustering Naïve Bayes Classification Linear Regression Logistic Regression Support Vector Machines SVD Matrix Factorisation Decision Trees/CART 21

GREENPLUM HD Mahout Analytical Functions for Hadoop Sampling of Algorithms in Mahout Today: Collaborative Filtering User-based, Item-based recommenders K-Means Clustering Fuzzy K-Means Clustering Mean Shift Clustering Dirichlet Process Clustering Latent Dirichlet Allocation Singular Value Decomposition Parallel Frequent Pattern mining Complementary Naïve Bayes Classifier Random Forest Decision Tree-Based Classifier Java collections (previously Colt) Many more are included or are in development Plus, a robust and growing user community 22

Powerful Partner Ecosystem ANALYTICS BUSINESS INTELLIGENCE DATA INTEGRATION INDUSTRY Discovix TECHNOLOGY 23

So What s the Problem? How many of you have tried to run complex queries that cannot complete? How many of you would love for your IT or other execs to just understand the basic maths?? How many of you would like your analytics to be part of every business process? 24

Greenplum Analytics Lab Data Science Leverage the expertise of Greenplum s Data Scientists t 25

So What s the Problem? How many of you have tried to run complex queries that cannot complete? How many of you would love for your IT or other execs to just understand the basic maths?? How many of you would like your analytics to be part of every business process? 26

Application Development Pivotal Labs The Execution Engine To Quickly Create And Deploy Big Data Applications 27

GREENPLUM DELIVERS THE PREDICTIVE ENTERPRISE 28

The Predictive Enterprise Predictive Enterprise Data Driven Decisions Deliver maximum business value from all the available data Predict outcomes using advanced analytics Leverage data science to gain deep insight about the business Turn insight into action with new applications 29

LET S GET STARTED 30