SEIZE THE DATA. 2015 SEIZE THE DATA. 2015

Similar documents
SEIZE THE DATA SEIZE THE DATA. 2015

Distributed R for Big Data

Distributed R for Big Data

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

The Data Mining Process

HP Vertica. Echtzeit-Analyse extremer Datenmengen und Einbindung von Hadoop. Helmut Schmitt Sales Manager DACH

Advanced In-Database Analytics

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Package HPdcluster. R topics documented: May 29, 2015

Oracle Advanced Analytics Oracle R Enterprise & Oracle Data Mining

Fast Analytics on Big Data with H20

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

High Performance Predictive Analytics in R and Hadoop:

Hurwitz ValuePoint: Predixion

BIG DATA What it is and how to use?

Technical white paper. R you ready? Turning big data into big value with the HP Vertica Analytics Platform and R

Big Data Analytics: Today's Gold Rush November 20, 2013

Advanced Big Data Analytics with R and Hadoop

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

High-Performance Analytics

Actian Vortex Express 3.0

IBM's Fraud and Abuse, Analytics and Management Solution

How To Use Hp Vertica Ondemand

Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study

Azure Machine Learning, SQL Data Mining and R

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

Oracle Data Miner (Extension of SQL Developer 4.0)

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

RevoScaleR Speed and Scalability

Data Mining + Business Intelligence. Integration, Design and Implementation

Journée Thématique Big Data 13/03/2015

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

Microsoft Azure Machine learning Algorithms

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

HP Converged Systems Update. Tom Joyce Senior Vice President, Converged Systems August, 2013

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Harnessing Big Data with KNIME

IBM SPSS Modeler 14.2 In-Database Mining Guide

Universal PMML Plug-in for EMC Greenplum Database

Predictive Analytics

Data Mining Algorithms Part 1. Dejan Sarka

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Integrating VoltDB with Hadoop

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Make Better Decisions Through Predictive Intelligence

Customized Report- Big Data

Focus on the business, not the business of data warehousing!

Table of Contents. June 2010

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

A Performance Analysis of Distributed Indexing using Terrier

Understanding the Benefits of IBM SPSS Statistics Server

An In-Depth Look at In-Memory Predictive Analytics for Developers

Oracle Database 12c Plug In. Switch On. Get SMART.

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

The R pmmltransformations Package

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

IBM SPSS Modeler 15 In-Database Mining Guide

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

IBM SPSS Modeler Professional

Starting Smart with Oracle Advanced Analytics

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Tax Fraud in Increasing

What s Cooking in KNIME

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Data processing goes big

How To Write A Data Processing Pipeline In R

How To Understand Data Mining In R And Rattle

Grow Revenues and Reduce Risk with Powerful Analytics Software

Harnessing the power of advanced analytics with IBM Netezza

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

PivotalR: A Package for Machine Learning on Big Data

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

HIGH PERFORMANCE ANALYTICS FOR TERADATA

HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING. Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

Parallel Data Warehouse

Big Data on Microsoft Platform

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Streaming items through a cluster with Spark Streaming

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Spark: Cluster Computing with Working Sets

Product Deck. Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Transcription:

1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Deep dive into Haven Predictive Analytics Powered by HP Distributed R and HP Vertica Arash Fard, Vishrut Gupta / August 12, 2015

Outline Overview Architecture of Distributed R Out-of-box algorithms vertica.dplyr Package An end-to-end example 3 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Overview 4 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

R is the language of data scientists R is a programming language for statistical computing Distributed R extends R language to provide distributed computing Popular Open source Flexible Extensible Not scalable Slow (single process) Inefficient Data Transfer Scalable Growing number of distributed algorithms Very efficient Data Transfer from Vertica 5 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

HP Vertica and HP Distributed R Delivering scale and high performance for predictive analytics 1 2 1 Ingest and prepare data by leveraging HP Vertica Build Models 2 Build and evaluate predictive models on large data sets using Distributed R 3 BI Integration Deploy Models (In-database scoring) Evaluate Models 3 Deploy models to Vertica and use in-database scoring to produce prediction results for BI and applications 6 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Architecture of Distributed R 7 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

The architecture of Distributed R Master Worker Worker Worker Executors Executors Executors R R R R R R R R R By default: Number of workers = number of physical nodes Number of executors = total number of logical cores 8 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Distributed data structures R - object matrix The same type elements: numerical, character, or logical data.frame Columns with different types list Vectors of different types and lengths Distributed R - dobject darray Each partition is a matrix, no support for character dframe Each partition is a data.frame dlist Each partition is a list 9 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Distributed partitions of data Partitions of a darray for example 1 2 darray 3 4 10 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Distributed computing on partitions foreach function: The main tool for running distributed jobs Express computations over partitions Execute across the cluster 1 foreach f (x) 2 3 4 11 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

An example library(distributedr) # Loads the package into R distributedr_start() # Starts up the cluster X <- darray(dim=c(4,4), blocks=c(2,2)) # Creates a 4 4 darray, each partition 2 2 foreach (index, 1:npartitions(X), function ( Xi= splits(x, index), i=index) { Xi <- matrix(i, nrow(xi), ncol(xi)) update(xi) } ) # Set each partition to a matrix # containing integers== artition_id 12 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Running the example 1 1 1 1 1 foreach f (x) 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 13 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

HP distributed R: deployment options Multiple ways to achieve performant predict analytics using HP big data platform Bring Scale and Performance to Standard R Distributed R Bring Scale and Performance to R and Operationalize predictive analytics Vertica Ingest and prepare large data volumes and various data types Hadoop (HDFS Storage) Standard R Distributed R Standard R JSON, CSV, AVRO etc. 14 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Out-of-box algorithms 15 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Available algorithms Similar signature, accuracy as R packages Scalable and high performance E.g., regression on billions of rows in a couple of minutes Algorithm Linear Regression (GLM) Logistic Regression (GLM) Random Forest K-Means Clustering Page Rank Use cases Risk Analysis, Trend Analysis, etc. Customer Response modeling, Healthcare analytics (Disease analysis) Customer churn, Market campaign analysis Customer segmentation, Fraud detection, Anomaly detection Identify influencers 16 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Using Distributed R is easy 1. distributedr_start() Distributed R Vertica DB 17 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Using Distributed R is easy 1. distributedr_start() 2. In database data preparation (SQL) Distributed R Vertica DB 18 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Using Distributed R is easy 1. distributedr_start() 2. In database data preparation (SQL) 3. db2dframe( ) P1 P2 P3 P4 Distributed R Vertica DB 19 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Using Distributed R is easy Ship code to data 1. distributedr_start() 2. In database data preparation (SQL) P1 P2 P3 P4 3. db2dframe( ) 4. hpdrandomforest( ) Distributed R Vertica DB 20 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Using Distributed R is easy 1. distributedr_start() 2. In database data preparation (SQL) 3. db2dframe( ) 4. hpdrandomforest( ) Model P1 P2 P3 P4 Distributed R 5. Deploy.model( ) Vertica DB 21 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Using Distributed R is easy 1. distributedr_start() 2. In database data preparation (SQL) 3. db2dframe( ) 4. hpdrandomforest( ) P1 P2 P3 P4 Distributed R 5. Deploy.model( ) 6. In database scoring (SQL) Model Vertica DB 22 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Performance benefits 23 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Time (minutes) Distributed R Vertica connector Data access at Big Data scale - fast data transfer 200 180 160 140 120 100 80 60 40 20 0 Loading time of Single R vs Distributed R from Vertica database 50 GB 100 GB 150 GB Size of data >30x 5 cluster nodes, each 24 logical cores Enabling predictive analytics in Vertica: Fast data transfer, distributed model creation and in-database prediction. Shreya Prasad, Arash Fard, Vishrut Gupta, Jorge Martinez, Jeff LeFevre, Vincent Xu, Meichun Hsu, Indrajit Roy. SIGMOD 2015, Melbourne, Australia. 24 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. R VFT

Speed up Distributed model creation in Distributed R Algorithm: K-Means Setup: SL 390 servers, 12*2 HT cores/server, 196GB RAM 40 35 30 25 20 15 10 5 0 Speedup in comparison to a single process 1 2 3 4 5 6 7 8 Number of machines (12 cores per machine) Dataset: 10Mx100 (~8GB), 1000 clusters 25 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Predict time (seconds) Vertica In-database SQL Prediction Functions Near linear scalability and predictions on billions of records Setup: 5*SL 390 servers, 12*2 HT cores/server, 196GB RAM 350 K-Means predict 300 250 200 150 100 50 0 10 100 500 1000 No. of rows (million) 26 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

vertica.dplyr Package An easy way to work with Vertica from R console 27 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

vertica.dplyr Package An easy way to work with Vertica from R console An open source package based on dplyr project by Hadley Wickham https://github.com/vertica/vertica.dplyr Working with database tables from R-console with simple R commands Lazy execution Extensions for loading distributed objects of Distributed R 28 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

An end-to-end example 29 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

A classification problem Use case for Marketing Data source: Publicly available marketing data from a banking institution. (https://archive.ics.uci.edu/ml/datasets/bank+marketing) Goal: To determine if a phone-based marketing campaign will be successful Steps (assuming that the data is already in Vertica database) Exploring and preparing the data in Vertica database Loading the Training set to Distributed R Training a logistic regression model in Distributed R Deploying the model into Vertica database In-database scoring of Test set 30 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

The data in Vertica Demo screenshot - vsql console Bank Marketing Data Set Source: https://archive.ics.uci.edu/ml/datasets/bank+marketing Number of records: 41188 Number of columns: 21 31 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Z-score normalization of numerical features Comparing vsql and vertica.dplyr vsql CREATE VIEW bank_normalized AS SELECT (age m_avg) / std_age AS age_z, FROM bank_original, (SELECT avg(age) AS m_age, stddev(age) AS std_age FROM bank_original) AS tbl_mean_std; vertica.dplyr vertica_odbc <- src_vertica( VerticaDSN ) orig <- tbl(vertica_odbc, bank_original ) m_sd <- summarise(orig,m_age=mean(age), std_age=sd(age), ) z <- collect (m_sd) normalized <- mutate(orig, age_z= (age -z[["m_age"]])/z[["std_age"]], ) norm_table <- select(normalized, age_z, ) 32 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Top-N view of categorical features Comparing vsql and vertica.dplyr other vsql CREATE VIEW bank_top_n AS SELECT DECODE( marital married, married, single, single, other ) AS marital, FROM bank_normalized; vertica.dplyr decoded <- mutate(norm_table, marital2=decode(marital, 'married', married, 'single', single, other ), ) decoded <- select(decoded, marital=marital2, ) db_save_view(decoded,"bank_top_n") 33 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Categorical columns to binary representation Comparing vsql and vertica.dplyr hpdglm(), like many other algorithms, only accepts numerical features We have provided an R function for this purpose: cat2num For example: poutcome column has 3 categories: nonexistent, failure, success. poutcome Code It will be converted to 2 columns: poutcome_1 and poutcome_2 where: poutcome_1=0, poutcome_2=0 means nonexistent poutcome_1=1, poutcome_2=0 means failure poutcome_1=0, poutcome_2=1 means success nonexistent 0 failure 1 success 2 34 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Splitting data to Training and Testing sets Comparing vsql and vertica.dplyr Sampling 20% Testing Set 80% Training Set vsql CREATE TABLE testing_set AS (SELECT * FROM bank_top_n_num WHERE RANDOM() < 0.2) ; CREATE TABLE training_set AS (SELECT * FROM bank_top_n_num EXCEPT SELECT * FROM testing_set); vertica.dplyr top_tbl <- tbl(vertica,"bank_top_n_num") testing_set <- filter(top_tbl, random() < 0.2) testing_set <- compute(testing_set, name="testing_set") training_set <- compute(setdiff(top_tbl, testing_set), "training_set") 35 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Model training and deployment Demo screenshot R console Loading data from Vertica database Training Logistic Regression Deploying the trained model into Vertica 36 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

In-database scoring and checking accuracy Demo screenshot vsql console 37 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Conclusion 38 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Summary Distributed R is a powerful platform for developing distributed algorithms There are growing number of out-of-box Machine Learning and Data Mining algorithms in Distributed R product Integrating Distributed R and Vertica provides a very powerful system to harness the power of Big Data Analytics 39 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Follow-up Distributed R is now a product. (www.distributedr.org) It is open source. (github.com/vertica/distributedr) We are looking for people to use it and contribute to it. 40 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.