Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Similar documents
BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Advanced In-Database Analytics

In-Database Analytics

KnowledgeSEEKER Marketing Edition

ADVANCED ANALYTICS AND FRAUD DETECTION THE RIGHT TECHNOLOGY FOR NOW AND THE FUTURE

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

CitusDB Architecture for Real-Time Big Data

Big Data and Data Science: Behind the Buzz Words

How To Handle Big Data With A Data Scientist

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Teradata s Big Data Technology Strategy & Roadmap

The Internet of Things and Big Data: Intro

SEIZE THE DATA SEIZE THE DATA. 2015

In-Memory Analytics for Big Data

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

CERULIUM TERADATA COURSE CATALOG

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Harnessing the power of advanced analytics with IBM Netezza

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Extend your analytic capabilities with SAP Predictive Analysis

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

UNIFY YOUR (BIG) DATA

Driving Value From Big Data

High-Performance Analytics

Discovering Business Insights in Big Data Using SQL-MapReduce

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

SEIZE THE DATA SEIZE THE DATA. 2015

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Teradata Unified Big Data Architecture

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

Ramesh Bhashyam Teradata Fellow Teradata Corporation

BIG DATA What it is and how to use?

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

How To Use Hp Vertica Ondemand

How to Optimize Your Data Mining Environment

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Advanced Big Data Analytics with R and Hadoop

Oracle Big Data Building A Big Data Management System

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Cisco Data Preparation

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

Oracle Big Data SQL Technical Update

Oracle Data Miner (Extension of SQL Developer 4.0)

Revolution R Enterprise

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Big Data and the Data Lake. February 2015

Oracle Advanced Analytics Oracle R Enterprise & Oracle Data Mining

Distributed R for Big Data

Introducing Oracle Exalytics In-Memory Machine

Cray: Enabling Real-Time Discovery in Big Data

Safe Harbor Statement

The 3 questions to ask yourself about BIG DATA

Hadoop & SAS Data Loader for Hadoop

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Parallel Data Preparation with the DS2 Programming Language

See the Big Picture. Make Better Decisions. The Armanta Technology Advantage. Technology Whitepaper

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Traditional BI vs. Business Data Lake A comparison

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Artur Borycki. Director International Solutions Marketing

MicroStrategy Course Catalog

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Customized Report- Big Data

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Building and Deploying Customer Behavior Models

Search and Real-Time Analytics on Big Data

OpenChorus: Building a Tool-Chest for Big Data Science

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

From Spark to Ignition:

Integrating Apache Spark with an Enterprise Data Warehouse

HIGH PERFORMANCE ANALYTICS FOR TERADATA

Find the Hidden Signal in Market Data Noise

Understanding the Value of In-Memory in the IT Landscape

Data processing goes big

SQL Server 2012 Performance White Paper

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

WHITE PAPER. Harnessing the Power of Advanced Analytics How an appliance approach simplifies the use of advanced analytics

Real-Time Data Access Using Restful Framework for Multi-Platform Data Warehouse Environment

Oracle Big Data Discovery The Visual Face of Hadoop

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

White Paper. Unified Data Integration Across Big Data Platforms

Unified Data Integration Across Big Data Platforms

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data on Microsoft Platform

Transcription:

Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata

Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3

Polling question 1 Polling question 1 in the beginning of the session. > What best describes your companies use of R today? No R plans in the near future Exploring or experimenting with R Plans to use R for analytics Actively using R for model development only Actively using R for model development and deployment. 4

Introducing R

Introducing Open Source R " The R Project for Statistical Computing " Interpreted language for statistical computing " Extensible " Free " Open source " Since 1997 " 5,000 Packages 2014 Decision Management Solutions 6

R has become significant in recent years Data Miners using R Packages for R Rexer Analytics Survey, 2013 http://r4stats.com/articles/popularity/ 2014 Decision Management Solutions 7

Enterprise Analytic Requirements

Enterprise analytic challenges Clear sense of the analytic objective Powerful data exploration Seamless, scalable deployment Scalable preparation, modeling, evaluation CRISP-DM 2014 Decision Management Solutions 9

Clear sense of the analytic objective Link Business Action To Analytics 2014 Decision Management Solutions 10

Powerful data exploration Volume Velocity Variety Veracity 2014 Decision Management Solutions 11

Scalable preparation, modeling, evaluation 2014 Decision Management Solutions 12

Seamless, scalable deployment Knowing is not enough Operational Systems Decision Analytic Systems 2014 Decision Management Solutions 13

Seamless, scalable deployment Rapid deployment Batch and real-time deployment Scalable deployment 2014 Decision Management Solutions 14

Challenges of Open Source R

Enterprise scale analytics and R 2014 Decision Management Solutions 16

Complex data integration is a challenge New sources drive data volume in Big Data Columnar Hadoop NoSQL Relational Warehouse Successful analytic teams use increasing variety of data 2014 Decision Management Solutions 17

Scaling data understanding is a challenge More data, more time and effort Single-threaded Parallel execution? In-memory Forced sampling Limits iteration 2014 Decision Management Solutions 18

Time to analyze is a challenge " R users like their tools " Lots of algorithms " Easy to modify and fine tune " But " Takes longer to do data analysis " Tool limit challenges more likely " Scaling up a challenge 2014 Decision Management Solutions 19

Deployment is a challenge Knowing is not enough; we must apply. Willing is not enough; we must do. 1/3 projects have serious deployment challenges Johann Wolfgang von Goethe " More likely to not use results " Unhappy with ease of deployment 2014 Decision Management Solutions 20

Cottage industries don t scale 2014 Decision Management Solutions 21

Industrialization is a challenge From " Local scripts or code " Hand crafting " A focus on model creation " Individual creators To " Managed workflows " Automated scale " A focus on model management " Broad participation and collaboration 2014 Decision Management Solutions 22

R for Enterprise Analytics 2014 Decision Management Solutions 23

Polling question 2: Polling question 2 between transition from James to Bill. What are your biggest challenges with R? (select all that apply) Complex data integration Scaling data understanding Time to analyze Deployment Industrializations Others 24

LIFTING THE LIMITATIONS OF OPEN SOURCE R Bill Franks Chief Analytics Officer Teradata

The Case For A Discovery Platform The Problem The Solution Data Warehouse/ Business Intelligence Advanced Analytics SQL Framework Access Layer Integrated Discovery Platform (IDP) 26 Copyright, Teradata 2014. All rights reserved.

Tackle R s Challenges With Aster R! Complex Data Integration Scaling Data Understanding Time To Analyze Deployment Industrialization 27 Copyright, Teradata 2014. All rights reserved.

Teradata Aster R Making open source R massively scalable & powerful Open source R without limitations Run open source R across Aster s MPP architecture for high speed parallel processing Remove memory and data limitations with in-database processing for massive scalability Leverage all data vs. samples for deeper insights Unmatched ease-of-use and productivity for R users Use familiar R client & R language with Aster through Aster R Library Expose Aster Discovery Portfolio functions as R functions Leverage open source R with no new tools or languages to learn Powerful analytics combining Aster and R Combine 100+ Aster Discovery Portfolio functions and 5,000+ R packages for powerful analytics Integrate R into Aster s SNAP Framework for rapid discovery Empower users with a single comprehensive analytic platform 28 Copyright, Teradata 2014. All rights reserved.

Teradata Aster R Components Aster R Parallel library > R functions running in full system level parallel mode > R interface for Aster Discovery Portfolio (MapReduce) functions Aster R Parallel Constructor > Allows R users to parallelize any R code using split-apply-combine > R engine runs in node independent fashion across all Aster nodes R Engine in the SNAP Framework > Access to any data store, including Hadoop and Teradata > R script can invoke SQL, MapReduce, Graph and R engines > Optimal processing with SNAP s integrated optimizer and executor Teradata Aster R Library Aster R Library R Client Aster R Parallel Constructor Aster Discovery Platform 29 Copyright, Teradata 2014. All rights reserved.

R Implementation Options Where processing takes place R Client Interface R Client R Client R Client R Client Request Results R Server Data Extract Request Results Request Results Database Architecture Client/Server Distributed Computing In-database: Node Level In-database: System Level Data access Data extracted from dbs to client Data extracted from dbs onto servers Rows and partitions accessible by the node Access to all the data within the system Processing In-memory of the client Runs independently inmemory in each server. Coding required for system level parallel processing Runs independently within each node. Coding required for system level parallel processing Runs in coordinated parallel manner, often iterating Results Single result based on limited data set Results are returned from each server Results are returned from each node Single result is produced from all data 30 Copyright, Teradata 2014. All rights reserved.

Node Versus System Parallelism When R runs independently on each node/server, the onus is on the programmer to code correctly to handle node parallelism Node Level 1. Find Mean per node 2. Return 1 answer per node or 3. Calculate mean of mean = 3.5 (X) System Level 1. Calculate count and total for each node 2. Aggregate counts (11) and total (33) 3. Calculate mean total/count = 3 (Correct!) Node1 Mean 9 Node2 Mean 1.33 Node3 Mean 1.66 Node4 Mean 2 Node1 Count 2 Total 18 Node2 Count 3 Total 4 Node3 Node4 Count 3 Count 3 Total 5 Total 6 9 9 1 2 1 2 1 2 1 2 3 9 9 1 2 1 2 1 2 1 2 3 31 Copyright, Teradata 2014. All rights reserved.

Aster R Library Prebuilt Parallel Functions Aster R functions run in system parallel mode across all data > Prebuilt parallel functions to hide the complexity of parallel programming > Allows users to process all data without the need for sampling Familiar R language syntax Leverages virtual data frames > Users operate on virtual data frames that point to tables or views in the database Teradata Aster R Library Aster R Library R Client Aster Discovery R Parallel Constructor Portfolio Extend R capabilities with Aster Discovery Portfolio functions > Allows R to invoke Aster MapReduce functions running in system parallel mode Aster Discovery Platform 32 Copyright, Teradata 2014. All rights reserved.

Teradata Aster R Prebuilt Parallel Functions Data Access and Movement > Connect, query, Teradata & Hadoop access via QueryGrid, bulk load & extract, import/export data into tables, read & write csv, and more Data Management > Create data frames, refresh, and more Data Exploration > Data characteristics, statistics, ranges & distributions, rank, and more Data Transformation and Manipulation > Pivot, log parser, unpack/pack, split, matrices, and more R Operators > [. [[, $ -,!, +, /, *, %%, ==,!=, and more Path & Time Series Analysis > npath, sessionization, and attribution Statistical Analysis > Regressions, Naïve Bayes, support vector machine, regressions, correlations, averages, histogram, Principal component analysis, and more Text Analysis > Sentiment analysis, text processing, ngram, text classifier, and more.. Machine Learning > Kmeans, basket, collaborative filter, random forest, and more 33 Copyright, Teradata 2014. All rights reserved.

Aster R Parallel Constructor Allows users to run open source R scripts in Aster in parallel Users can run any open source R code in parallel with ta.apply() > Accepts a data frame as input > Runs an R script across a single or multiple nodes concurrently > Runs R script using the split-applycombine strategy Similar to Map-Reduce constructs R Client Teradata Aster R Library Aster R Parallel Constructor How It Works > ta.apply() Apply first R script to each node Apply second R combiner script to the results of the first Aster Discovery Platform 34 Copyright, Teradata 2014. All rights reserved.

Aster R Parallel Constructor Example Calculating Average with all the data Data is distributed across the nodes Ex. Calculate average sales to date results Apply part_sumcnt Combine results and run comb_avg banking_df = ta.data.frame( ich_banking") Create a view in Aster for parallel data ingest part_sumcnt <- function(x) list(sum = sum(x), count = length(x)) comb_avg <- function(x) sum(x$sum) / sum(x$count) ta.aggregateapply( banking_df, FUN = part_sumcnt, FUN.result = "list", COMBINER.FUN = comb_avg) Runs the function on each vworker node, reduces the results from each partition User defined R functions executed in parallel with partitioned data in-place 35 Copyright, Teradata 2014. All rights reserved.

Deploying Analytics with Aster R R interface for Aster Discovery Platform > R users now have access to the powerful Aster Discovery Platform Multi-faceted analytics > A single program can call SQL, MapReduce, Graph, or R engines Access to any data across the Teradata UDA > Data from Teradata and Hadoop are accessible through Teradata QueryGrid Business Analyst SQL Client Data Scientist R Client R User 36 Copyright, Teradata 2014. All rights reserved.

A DAY IN THE LIFE OF AN R ANALYST Up Your R Game!

Complex Data Integration Across Multiple Data Sources Traditional R > Create data frames for big data? Error: cannot allocate vector of size 10 GB > Forced to sample data Aster R > Easy Access to Hadoop and DW Create views CREATE VIEW hadoop_view as ( SELECT * FROM load_from_hcatalog (TABLENAME('hcat_table_name') CREATE VIEW Customer_view as ( SELECT * FROM load_from_teradata (TABLENAME( customer_table')) Customer Customer Data Data Sample Customer Customer 1 Sample 1 Transaction Data Samples 5 Customer Sample Customer 1 Data Data Sample 1 Sample 1 Customer Transaction Samples 4 Customer Weblog Sample 3 Customer Customer Transaction Transaction Customer Sample 5 Sample Weblog 4 Customer Sample 2 CustomerTransaction Transaction Customer Sample 3 Customer Sample Transaction 1 Weblog Sample 2 Sample 1 How do I manage all these files? Create data frames weblog_df <- ta.data.frame( hadoop_view ) customer_df <- ta.data.frame( customer_view ) Easy! Selfservice access to all my data! 38 Copyright, Teradata 2014. All rights reserved.

Scaling Data Understanding Traditional R > Summary function to understand the statistical characteristics of the data > Have a large data set? Memory Error cust1_df <- data.frame (customer_sample 1) summary (cust1_df); Summary function rm (cust1_df); > Repeat to validate sample > Repeat to understand transaction samples > Repeat to understand customer web log samples Create data frame for first customer sample file Free up memory for the next command It s WOW! so slow All and my cumbersome. data in one How command! can I see a summary It s fast of ALL too! my data? Teradata Aster R > df <- ta.data.frame ( customer ) > ta.summary (df) Runs the summary function across all your data! 39 Copyright, Teradata 2014. All rights reserved.

Time To Analyze Traditional R > Can t build a cluster on a small sample without losing the most interesting segments > Partitioning data doesn t give me a global view of purchases ca_df <- data.frame ( customer_tbl, where state=ca) kmeans (ca_df) > Writing parallel kmeans will take time Build a cluster for customers in California Segments based on California purchase Low cost High value High cost Low value Oops! Now I I m really missing understand all my customers my customer who segments purchase in characteristics! multiple states Teradata Aster R: Complete view of your data > global_df <- ta.data.frame( customer tbl ) > ta.kmeans (global_df, k) Clusters based on global purchases Prebuilt parallel kmeans algorithm! 40 Copyright, Teradata 2014. All rights reserved.

Deployment Traditional R > How do you deploy R models against large volumes of data? > Option 1: Get a super computer with LOTS of memory No budget > Option 2: Extract, score, write back into the database Too labor intensive > Option 3: Give to IT to recode my model Takes time and adds risk > Option 4: Use PMML & in-database scoring Only if IT lets me Teradata Aster R: > Run any open source R package or script concurrently across all nodes or create your own parallel script with the R Parallel Constructors target_segment <- data.frame(customer tbl) ta.apply (target_df, FUN=myRcode) Error Need Labor That No prone to was intensive. budget! add easy and a PMML and takes No time test in or time! fast! my resources! process 41 Copyright, Teradata 2014. All rights reserved.

Industrialization Traditional R Aster R Extract Samples Inefficient and time consuming Slow Processing Single threaded analytics are slow Data Limitations Bound by memory Complex programming Parallel programming is hard Deployment challenges Often must recode models into SAS or SQL to deploy Self-serve - Immediate access to data via Teradata QueryGrid Fast - Leverage Aster MPP architecture Scalable Designed to run against all your data Easy No parallel programming required with prebuilt functions Flexible Run any open source function in parallel 42 Copyright, Teradata 2014. All rights reserved.

Teradata Aster R High Performance Analytic Platform for R with prebuilt parallel analytic functions to process all your data and the flexibility to run any open source R package at scale To learn more, visit: http://www.teradata.com/teradata-aster-r 43 Copyright, Teradata 2014. All rights reserved.

3 rd Polling question Polling question 3 at the end of the session. > What would you like to learn more about? (select all that apply) How to easily access and integrate data from multiple sources using R How to run R analytics in parallel. How to create models leveraging R, SQL and SQL-MapReduce. Understand model deployment considerations for business analytics Integrating R into production applications. 44

Thank You Questions? James Taylor, Chief Executive Officer, Decision Management Solutions james@decisionmanagementsolutions.com www.decisionmanagementsolutions.com Bill Franks, Chief Analytics Officer, Teradata bill.franks@teradata.com www.teradata.com 7/28/14 45