In-Database Analytics Deep Dive with Teradata and Revolution R



Similar documents
High Performance Predictive Analytics in R and Hadoop:

Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise

Revolution R Enterprise: Efficient Predictive Analytics for Big Data

Revolution R Enterprise

Using Microsoft R Server to Address Scalability Issues

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

Find the Hidden Signal in Market Data Noise

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Building and Deploying Customer Behavior Models

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

SQL Server Everything built-in. Csom Gergely Microsoft Adat platform szakértő

RevoScaleR Speed and Scalability

Advanced Big Data Analytics with R and Hadoop

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Understanding the Benefits of IBM SPSS Statistics Server

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

Fast Analytics on Big Data with H20

Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Architectures for Big Data Analytics A database perspective

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Big Data and Data Science: Behind the Buzz Words

Data processing goes big

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Advanced In-Database Analytics

Cisco Data Preparation

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

High-Performance Analytics

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

An In-Depth Look at In-Memory Predictive Analytics for Developers

SEIZE THE DATA SEIZE THE DATA. 2015

Hadoop & SAS Data Loader for Hadoop

HIGH PERFORMANCE ANALYTICS FOR TERADATA

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

In-Memory Analytics for Big Data

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Introducing Oracle Exalytics In-Memory Machine

Predictive Modeling Techniques in Insurance

EVERYTHING THAT MATTERS IN ADVANCED ANALYTICS

Make Better Decisions Through Predictive Intelligence

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Data Mining in the Swamp

The Future of Data Management

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

HDP Enabling the Modern Data Architecture

Assignment # 1 (Cloud Computing Security)

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Parallel Data Preparation with the DS2 Programming Language

SEIZE THE DATA SEIZE THE DATA. 2015

Bringing Big Data Modelling into the Hands of Domain Experts

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

Netezza and Business Analytics Synergy

2015 Workshops for Professors

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

ADVANCED ANALYTICS AND FRAUD DETECTION THE RIGHT TECHNOLOGY FOR NOW AND THE FUTURE

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

ANALYTICS IN BIG DATA ERA

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Actian SQL in Hadoop Buyer s Guide

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Benchmarking Hadoop & HBase on Violin

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Cost-Effective Business Intelligence with Red Hat and Open Source

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Big Data Technologies Compared June 2014

Mark Bennett. Search and the Virtual Machine

Integrating Apache Spark with an Enterprise Data Warehouse

Generalized Linear Models

Big Data Too Big To Ignore

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Modern Data Architecture for Predictive Analytics

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Distributed Computing and Big Data: Hadoop and MapReduce

WHAT S NEW IN SAS 9.4

Advanced analytics at your hands

Data Warehouse as a Service. Lot 2 - Platform as a Service. Version: 1.1, Issue Date: 05/02/2014. Classification: Open

HDP Hadoop From concept to deployment.

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Azure Machine Learning, SQL Data Mining and R

STATISTICA Solutions for Financial Risk Management Management and Validated Compliance Solutions for the Banking Industry (Basel II)

2015 Ironside Group, Inc. 2

Predictive Analytics with TIBCO Spotfire and TIBCO Enterprise Runtime for R

Transcription:

In-Database Analytics Deep Dive with Teradata and Revolution R Mario Inchiosa Chief Scientist, Revolution Analytics Tim Miller Partner Integration Lab, Teradata

Agenda Introduction Revolution R Enterprise Case Study Global Internet Marketplace Under the Hood Summary & Questions

Poll Question #1 Please choose all that apply What data storage/management software do you use? > Hadoop > Teradata > LSF Clusters/Grids > Servers

What is R? Most powerful statistical programming language Flexible, extensible and comprehensive for productivity Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data Thriving open-source community Leading edge of analytics research Fills the talent gap New graduates prefer R R is Hot bit.ly/r-ishot WHITE PAPER

Exploding growth and demand for R R Usage Growth Rexer Data Miner Survey, 2007-2013 70% of data miners report using R R is the first choice of more data miners than any other software Source: www.rexeranalytics.com R is the highest paid IT skill > Dice.com, Jan 2014 R most-used data science language after SQL > O Reilly, Jan 2014 R is used by 70% of data miners > Rexer, Sep 2013 R is #15 of all programming languages > RedMonk, Jan 2014 R growing faster than any other data science language > KDnuggets, Aug 2013 More than 2 million users worldwide

Debt<10% of Income Yes Good Credit Risks Yes NO Income>$40K Bad Credit Risks NO NO Debt=0% Yes Good Credit Risks Debt<10% of Income Yes Good Credit Risks Yes NO Income>$40K Bad Credit Risks NO NO Debt=0% Yes Good Credit Risks Why Is Teradata Different? Server Based vs. In-Database Architectures Sample Data Desktop and Server Analytic Architecture Analyst Results Results SQL Request In-Database Analytic Architecture Exponential Performance Improvement

Challenges Running R in Parallel R is distributed across nodes or servers Runs independently of the other nodes/servers > Great for row independent processing such as Model Scoring > However, for analytic functions requiring all the data such as Model Building Onus is on the R programmer to understand data parallelism Example: Median (Midpoint) Node level calculation: 1 2 7 9 = 4.5 Node Level 1. Find median per node 2. Consolidate and find the midpoint of the results 3. Produce the wrong answer < Wrong 1 1 1 1 2 9 1 7 9 3 9 9 System level calculation: 1 1 1 1 1 2 3 7 9 9 9 9 = 2.5 < Right System Level 1. Sort all the data 2. Take midpoint 3. Produce the right answer

R Operations on Data R operates on independent rows > Score models for a given observation > Parsing Text field > Log(x) R operates on independent partitions > Fit a model to a partition such as region, time, product or store R operates on the entire data set > Global sales average > Regression on all customers R Client R Client R Client

Poll Question #2 Please choose all that apply What statistical programming tools do you use? > R/RRE > SAS > SPSS > Statistica > KXEN

Revolution Analytics Who is Revolution Analytics?

OUR COMPANY OUR SOFTWARE SOME KUDOS The leading provider of advanced analytics software and services based on open source R, since 2007 The only Big Data, Big Analytics software platform based on the data science language R Visionary Gartner Magic Quadrant for Advanced Analytics Platforms, 2014

Finance Insurance Manufacturing & High Tech Healthcare & Pharma Digital Economy Analytics Service Providers

Revolution R Enterprise is. the only big data big analytics platform based on open source R, the de facto statistical computing language for modern analytics High Performance, Scalable Analytics Portable Across Enterprise Platforms Easier to Build & Deploy Analytics

R: Open Source that Drives Innovation, but It Has Some Limitations for Enterprises Big Data In-memory bound Hybrid memory & disk scalability Operates on bigger volumes & factors Speed of Analysis Single threaded Parallel threading Shrinks analysis time Enterprise Readiness Community support Commercial support Delivers full service production support Analytic Breadth & Depth 5000+ innovative analytic packages Leverage open source packages plus Big Data ready packages Supercharges R Commercial Viability Risk of deployment of open source Commercial license Eliminate risk with open source

Introducing Revolution R Enterprise (RRE) The Big Data Big Analytics Platform DevelopR DeployR ConnectR ScaleR DistributedR Big Data Big Analytics Ready > Enterprise readiness > High performance analytics > Multi-platform architecture > Data source integration > Development tools > Deployment tools

The Platform Step by Step: R Capabilities R+CRAN Open source R interpreter UPDATED R 3.1.1 Freely-available R algorithms Algorithms callable by RevoR Embeddable in R scripts 100% Compatible with existing R scripts, functions and packages RevoR Based on open source R Adds high-performance math Available On: Teradata Database Hortonworks Hadoop Cloudera Hadoop MapR Hadoop IBM Platform LSF Linux Microsoft HPC Clusters Windows & Linux Servers Windows & Linux Workstations

The Platform Step by Step: Tools & Deployment DevelopR Integrated development environment for R Visual step-into debugger Based on Visual Studio Isolated Shell Available on: Windows DevelopR DeployR DeployR Web services software development kit for integration analytics via Java, JavaScript or.net APIs Integrates R Into application infrastructures Capabilities: Invokes R Scripts from web services calls RESTful interface for easy integration Works with web & mobile apps, leading BI & Visualization tools and business rules engines

DevelopR - Integrated Development Environment Script with type ahead and code snippets Solutions window for organizing code and data Sophisticated debugging with breakpoints, variable values etc. Packages installed and loaded Objects loaded in the R Environment Object details

DeployR - Integration with 3rd Party Software Data Analysis R / Statistical Modeling Expert DeployR Deployment Expert Business Intelligence Seamless Bring the power of R to any web enabled application Simple Leverage common APIs including JS, Java,.NET Scalable Robustly scale user and compute workloads Secure Manage enterprise security with LDAP & SSO Mobile Web Apps Cloud / SaaS

The Platform Step by Step: Parallelization & Data Sourcing ScaleR Ready-to-Use high-performance big data big analytics Fully-parallelized analytics Data prep & data distillation Descriptive statistics & statistical tests Correlation & covariance matrices Predictive Models linear, logistic, GLM Machine learning Monte Carlo simulation Tools for distributing customized algorithms across nodes ConnectR High-speed & direct connectors Available for: High-performance XDF SAS, SPSS, delimited & fixed format text data files Hadoop HDFS (text & XDF) Teradata Database ODBC DistributedR Distributed computing framework Delivers portability across platforms Available on: Teradata Database Hortonworks / Cloudera / MapR Windows Servers / HPC Clusters IBM Platform LSF Linux Clusters Red Hat Linux Servers SuSE Linux Servers

Revolution R Enterprise ScaleR: High Performance Big Data Analytics Data Prep, Distillation & Descriptive Analytics R Data Step Descriptive Statistics Statistical Tests Sampling Data import Delimited, Fixed, SAS, SPSS, ODBC Variable creation & transformation using any R functions and packages Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category (means, sums) Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data Marginal Summaries of Cross Tabulations Chi Square Test Kendall Rank Correlation Fisher s Exact Test Student s t-test Subsample (observations & variables) Random Sampling

Revolution R Enterprise ScaleR (continued) Statistical Modeling Machine Learning Predictive Models Covariance/Correlation/Sum of Squares/Cross-product Matrix Multiple Linear Regression Logistic Regression Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. - User defined distributions & link functions. Classification & Regression Trees and Forests Gradient Boosted Trees Residuals for all models Data Visualization Histogram ROC Curves (actual data and predicted values) Lorenz Curve Line and Scatter Plots Tree Visualization Variable Selection Stepwise Regression Linear Logistic GLM Simulation and HPC Monte Carlo Run open source R functions and packages across cores and nodes Cluster Analysis K-Means Classification & Regression Decision Trees Decision Forests Gradient Boosted Trees Deployment Prediction (scoring) PMML Export

Write Once Deploy Anywhere. EDW Teradata Database Hadoop Hortonworks, Cloudera, MapR DeployR ConnectR ScaleR DistributedR Clustered Systems Workstations & Servers In the Cloud IBM Platform LSF Microsoft HPC Windows Linux Amazon AWS DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE

Case Study - Global Internet Marketplace Challenge: Model and score 250M customers Server-based workflow was taking 3 days Move calculation in-database to drastically reduce runtime, process twice as many customers, and increase lift

Existing Open Source R model Binomial Logistic Regression > 50+ Independent variables including categorical with indicator variables > Train from small sample (many thousands) not a problem in and of itself > Scoring across entire corpus (many hundred millions) slightly more challenging

Revolution R Enterprise model Same Binomial Logistic Regression > 50+ Independent variables including categorical with indicator variables > Train from large sample (many millions) more accurately captures user patterns and increases lift > Scoring across entire corpus (many hundred millions) completes in minutes

RRE Used to Optimized the Current Process By moving the compute to the data Before After Reduced 3 day process to 10 minutes

time Benchmarking the Optimized Process Scaling study: Time vs. Number of Rows Server-based (Not In-DB) In-DB NOTE: Teradata Environment > 4 node, 1700 Appliance RRE Environment > version 7.2, > R 3.0.2 rows

Optimization process Recode Open Source R to Revolution R Enterprise Before trainit <- glm(as.formula(specs[[i]]), data = training.data, family='binomial', maxit=iters) fits <- predict(trainit, newdata=test.data, type='response') After trainit <- rxglm(as.formula(specs[[i]]), data = training.data, family='binomial', maxiterations=iters) fits <- rxpredict(trainit, newdata=test.data, type='response')

Revolution R Enterprise How RRE Scale R Actually Works

Revolution R Enterprise: RevoR - Performance Enhanced R Open Source R Revolution R Enterprise Customers report 3-50x performance improvements compared to Open Source R without changing any code Computation (4-core laptop) Open Source R Revolution R Speedup Linear Algebra 1 Matrix Multiply 176 sec 9.3 sec 18x Cholesky Factorization 25.5 sec 1.3 sec 19x Linear Discriminant Analysis 189 sec 74 sec 3x General R Benchmarks 2 R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable 1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php 2. http://r.research.att.com/benchmarks/

Scalable and Parallelized Across Cores and Nodes

Scalability and Portability of PEMAs Parallel External Memory Algorithms Anatomy of a PEMA: 1) Initialize, 2) Process Chunk, 3) Aggregate, 4) Finalize Process a chunk of data at a time, giving linear scalability Process an unlimited number of rows of data in a fixed amount of RAM Independent of the compute context (number of cores, computers, distributed computing platform), giving portability across these dimensions Independent of where the data is coming from, giving portability with respect to data sources

ScaleR Performance Efficient computational algorithms Efficient memory management minimize data copying and data conversion Heavy use of C++ templates; optimal code Efficient data file format; fast access by row and column Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities) Handle categorical variables efficiently

Speed and Scalability Comparison Unique PEMAs: Parallel, externalmemory algorithms High-performance, scalable replacements for R/SAS analytic functions Parallel/distributed processing eliminates CPU bottleneck Data streaming eliminates memory size limitations Works with in-memory and diskbased architectures

In-Database Billion Row Logistic Regression 114 seconds on Teradata 2650 (6 nodes, 72 cores), including time to read data Scales linearly with number of rows Scales linearly with number of nodes: 3x faster than on 2 node Teradata system

Allstate compares SAS, Hadoop, and R for Big-Data Insurance Models Generalized linear model, 150 million observations, 70 degrees of freedom Approach Platform Time to fit SAS 16-core Sun Server 5 hours rmr/mapreduce 10-node 80-core Hadoop Cluster > 10 hours R 250 GB Server Impossible (> 3 days) Revolution R Enterprise In-Teradata on 6-node 2650 3.3 minutes http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html

Poll Question #3 Please select one answer At what stage are you in your in-database analytics deployment project? > Still researching tools and methods > Evaluating/Selecting data storage/management platform > Evaluating/Selecting analytics programming tools > Launched the project/working on it now > We re done and looking for another one!

RRE End-User s Perspective Revolution R Enterprise has a new data source, RxTeradata (ODBC and TPT) # Change the data source if necessary tdconn <- "DRIVER= ; IP= ; DATABASE= ; UID= ; PWD= teradatads <- RxTeradata(table= ", connectionstring=tdconn, ) Revolution R Enterprise has a new compute context, RxInTeradata # Change the compute context tdcompute <- rxinteradata(connectionstring=..., sharedir=..., remotesharedir=..., Sample code for R Logistic Regression revopath=..., wait=.., consoleoutput=...) # Specify model formula and parameters rxlogit(arrdelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=teradatads)

Table Operators Teradata 14.10+ Table User Defined Functions (UDFs) allow users to place a function in the FROM clause of a SELECT statement Table Operators extend the existing table UDF capability: > Table Operators are Object Oriented Inputs and outputs can be arbitrary and not fixed as Table UDF s require > Table Operators have a simpler row iterator interface Interface simply produces output rows providing a more natural application development interface than Table UDF s > Table operators operate on a stream of rows. Rows are buffered for high-performance, eliminating row at a time processing > Table operators support PARTITON BY and ORDER BY Allows the development of Map Reduce style operators in-database

RRE Architecture in Teradata 14.10+ tdconnect <- rxteradata(<data, connection string, >) tdcompute <- rxinteradata(<data, server arguments, >) Request Response Teradata 14.10+ PE Layer Master Process Worker Process Data Partition Message Passing Layer Data Partition AMP Layer Worker Process Worker Process Data Partition Master Process Worker Process Data Partition * All communication is done by binary BLOB s ** PUT-based Installer 1. RRE commands are sent to a Master Process - an External Stored Procedure (XSP) in the Parsing Engine that provides parallel coordination 2. RRE analytics are split into Worker Process tasks that run in a Table Operator (TO) on every AMP. a. HPA analytics iterate over the data, and intermediate results are analyzed and managed by the XSP. b. HPC analytics do not iterate, and final results from each AMP are returned to the XSP 3. Final combined results are assembled by the XSP and returned to the user

Summary High-performance, scalable, portable, fully-featured algorithms Integration with R ecosystem Compatibility with Big Data ecosystem

Questions? WE LOVE FEEDBACK Resources for you (available on RevolutionAnalytics.com): Questions White Paper: Teradata and Revolution Analytics: For the Big Data Era, An Analytics Revolution Webinar: Big Data Analytics PARTNERS Mobile with App Teradata and Revolution Analytics Rate this Session InfoHub Kiosks teradata-partners.com

WE LOVE FEEDBACK Questions Thank You! www.revolutionanalytics.com Rate this Session PARTNERS Mobile App InfoHub Kiosks teradata-partners.com www.teradata.com