Enabling R for Big Data with PL/R and PivotalR Real World Examples on Hadoop & MPP Databases



Similar documents
PivotalR: A Package for Machine Learning on Big Data

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

How Eastern Bank Uses Big Data to Better Serve & Protect its Customers!

Native Connectivity to Big Data Sources in MSTR 10

Advanced In-Database Analytics

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Databricks. A Primer

Interactive data analytics drive insights

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

From Spark to Ignition:

Databricks. A Primer

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

The 4 Pillars of Technosoft s Big Data Practice

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Big Data Scenario mit Power BI vs. SAP HANA Gerhard Brückl

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Azure Data Lake Analytics

High-Performance Analytics

Information Architecture

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Demonstration of SAP Predictive Analysis 1.0, consumption from SAP BI clients and best practices

Harnessing the power of advanced analytics with IBM Netezza

Bringing Big Data to People

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Microsoft SQL Server Business Intelligence and Teradata Database

How to Drive the Mobile App Evolution

WHITE PAPER. Harnessing the Power of Advanced Analytics How an appliance approach simplifies the use of advanced analytics

SAP BusinessObjects BI Clients

Big Data and the Data Lake. February 2015

Massive Predictive Modeling using Oracle R Technologies Mark Hornick, Director, Oracle Advanced Analytics

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

SAP and Hortonworks Reference Architecture

Building Your Big Data Team

How To Handle Big Data With A Data Scientist

Cisco IT Hadoop Journey

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

Data Discovery, Analytics, and the Enterprise Data Hub

Building Data-Driven Internet of Things (IoT) Applications

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

How Companies are! Using Spark

Next-Generation Cloud Analytics with Amazon Redshift

ORACLE BUSINESS INTELLIGENCE FOUNDATION SUITE 11g WHAT S NEW

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

The 3 questions to ask yourself about BIG DATA

LEARNING SOLUTIONS website milner.com/learning phone

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Modern Data Architecture for Predictive Analytics

The power of IBM SPSS Statistics and R together

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Business Intelligence Competency Partners Untangling the Confusion What SAP BW powered by HANA and HANA LIVE mean to your organization

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Big Data Analytics Best Practices

Building Dashboards for Real Business Results. Cindi Howson BIScorecard December 11, 2012

IBM: An Early Leader across the Big Data Security Analytics Continuum Date: June 2013 Author: Jon Oltsik, Senior Principal Analyst

PARC and SAP Co-innovation: High-performance Graph Analytics for Big Data Powered by SAP HANA

Microsoft Research Windows Azure for Research Training

YOUR APP. OUR CLOUD.

Microsoft Research Microsoft Azure for Research Training

SEIZE THE DATA SEIZE THE DATA. 2015

Data Analytics for a Secure Smart Grid

An In-Depth Look at In-Memory Predictive Analytics for Developers

Automating Big Data Benchmarking for Different Architectures with ALOJA

Distributed R for Big Data

Big Data Challenges and Success Factors. Deloitte Analytics Your data, inside out

Safe Harbor Statement

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

Bayesian networks - Time-series models - Apache Spark & Scala

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

BIG DATA-AS-A-SERVICE

Booz Allen Cloud Solutions. Our Capability-Based Approach

Delivering secure, real-time business insights for the Industrial world

Cisco Data Preparation

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Advanced Big Data Analytics with R and Hadoop

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

BIG DATA TRENDS AND TECHNOLOGIES

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Ø Teaching Evaluations. q Open March 3 through 16. Ø Final Exam. q Thursday, March 19, 4-7PM. Ø 2 flavors: q Public Cloud, available to public

Predictive Analytics. Noam Zeigerson, CTO

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Solving Big Data Problems in Computer Vision with MATLAB Loren Shure

Integrating a Big Data Platform into Government:

A Vision for Operational Analytics as the Enabler for Business Focused Hybrid Cloud Operations

COMP9321 Web Application Engineering

The Machine: The future of technology Hyperscale Division EMEA

SURVEY REPORT DATA SCIENCE SOCIETY 2014

SAP BusinessObjects BI Clients. January 2016

Transcription:

Enabling R for Big Data with PL/R and PivotalR Real World Examples on Hadoop & MPP Databases Woo J. Jung Principal Data Scientist Pivotal Labs 1

All In On Open Source Still can t believe we did this. Truly exciting.

Data Science Toolkit KEY TOOLS KEY LANGUAGES SQL P L A T F O R M Pivotal Big Data Suite 3

How Pivotal Data Scientists Select Which Tool to Use Prototype in R or directly in MADlib/PivotalR Optimized for both algorithm efficiency and code overhead Is the algorithm of choice available in MADlib/PivotalR? Yes Build final set of models in MADlib/PivotalR No Do opportunities for explicit parallelization exist? Yes Build final set of models in PL/R No Connect to Pivotal via ODBC 4

How Pivotal Data Scientists Select Which Tool to Use Prototype in R or directly in MADlib/PivotalR Optimized for both algorithm efficiency and code overhead Is the algorithm of choice available in MADlib/PivotalR? Yes Build final set of models in MADlib/PivotalR No Do opportunities for explicit parallelization exist? Yes Build final set of models in PL/R No Connect to Pivotal via ODBC 5

MADlib: Toolkit for Advanced Big Data Analytics madlib.net Better Parallelism Algorithms designed to leverage MPP or Hadoop architecture Better Scalability Algorithms scale as your data set scales No data movement Better Predictive Accuracy Using all data, not a sample, may improve accuracy Open Source Available for customization and optimization by user 6

http://doc.madlib.net/latest/ 7

PivotalR: Bringing MADlib and HAWQ to a Familiar R Interface Challenge Want to harness the familiarity of R s interface and the performance & scalability benefits of in-db/in-hadoop analytics Simple solution: Translate R code into SQL SQL Code SELECT madlib.linregr_train( 'houses,! 'houses_linregr,! 'price,! 'ARRAY[1, tax, bath, size] );! PivotalR d <- db.data.frame( houses")! houses_linregr <- madlib.lm(price ~ tax!!!!+ bath!!!!+ size!!!!, data=d)! http://cran.r-project.org/web/packages/pivotalr/index.html https://pivotalsoftware.github.io/gp-r/ https://github.com/pivotalsoftware/pivotalr 8

PivotalR Design Overview Call MADlib s in-db machine learning functions directly from R Syntax is analogous to native R function RPostgreSQL PivotalR 2. SQL to execute 1. R à SQL 3. Computation results No data here Data doesn t need to leave the database All heavy lifting, including model estimation & computation, are done in the database Database/Hadoop w/ MADlib Data lives here 9

More Piggybacking 10

How Pivotal Data Scientists Select Which Tool to Use Prototype in R or directly in MADlib/PivotalR Optimized for both algorithm efficiency and code overhead Is the algorithm of choice available in MADlib/PivotalR? Yes Build final set of models in MADlib/PivotalR No Do opportunities for explicit parallelization exist? Yes Build final set of models in PL/R No Connect to Pivotal via ODBC 11

What is Data Parallelism? Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks Also known as explicit parallelism Examples: Have each person in this room weigh themselves: Measure each person s weight in parallel Count a deck of cards by dividing it up between people in this room: Count in parallel MapReduce apply() family of functions in R 12

Procedural Language R (PL/R) Parallelized model building in the R language Originally developed by Joe Conway for PostgreSQL Parallelized by virtue of piggybacking on distributed architectures SQL & R http://pivotalsoftware.github.io/gp-r/ 13

Parallelized Analytics in Pivotal via PL/R: An Example SQL & R TN Data CA Data NY Data PA Data TX Data CT Data NJ Data IL Data MA Data WA Data TN Model CA Model NY Model PA Model TX Model CT Model NJ Model IL Model MA Model WA Model Parsimonious R piggy-backs on Pivotal s parallel architecture Minimize data movement Build predictive model for each state in parallel 14

Parallelized R via PL/R: One Example of Its Use With placeholders in SQL, write functions in the native R language Accessible, powerful modeling framework

Parallelized R via PL/R: One Example of Its Use Execute PL/R function Plain and simple table is returned 16

Examples of Usage 17

Pivotal Data Science: Areas of Expertise 18

Pivotal Data Science: Packaged Services LAB PRIMER (2-Week Roadmapping) DATA JAM (Internal DS Contest) LAB 100 (Analytics Bundle) LAB 600 (6-Week Lab) LAB 1200 (12-Week Lab) Analytics Roadmap Prioritized Opportunities Architectural Recommendati ons Hands-on training Hosted data on Pivotal Data stack Results review & assessment On-site MPP analytics training Analytics toolkit Quick insight (2 weeks) Prof. services Data science model building Ready-todeploy model(s) Prof. services Data science model building Ready-todeploy model(s) 19

The Internet of Things: Smart Meter Analytics

Engagement Summary Objective Build key foundations of a data-driven framework for anomaly detection to leverage in revenue protection initiatives Results With limited access to limited data, our models (FFT and Time Series Analysis) identified 191K potentially anomalous meters (7% of all meters). High Performance Pivotal Big Data Suite including MADlib and PL/R 90 seconds to compute FFT for over 3.1 million meters (~3.5 billion readings) à 0.0288 ms/meter ~36 minutes to compute time series models for over 3.1 million meters (~3.5 billion readings) à 0.697 ms/meter 21

Anomaly Detection Methodology & Results All Data (4.5 million meters / ~20 billion meter readings) Step 1: Select Data for Advanced Modeling (3.1 million meters / ~3.5 billion meter readings) Step 2: Detect Anomalies From Frequency Domain Analysis (547K) Step 4: Detect Anomalies From Combined Analysis (191K) Step 3: Detect Anomalies From Time Domain Analysis (485K) 22

-- create type to store frequency, spec, and max freq! create type fourier_type AS (! freq text, spec text, freq_with_maxspec float8);!! -- create plr function to compute periodogram and return frequency with maximum spectral density! create or replace function pgram_concise(tsval float8[])! RETURNS float8 AS! $$! rpgram <- spec.pgram(tsval,fast=false,plot=false,detrend=true)! freq_with_maxspec <- rpgram$freq[which(rpgram$spec==max(rpgram$spec))]! return(freq_with_maxspec)! $$! LANGUAGE 'plr ;!! -- execute function! create table pg_gram_results! as select geo_id, meter_id, pgram_concise(load_ts) FROM! meter_data distributed by (geo_id,meter_id);!

Number of Meters 0 500000 1000000 1500000 2000000 Most Households Use Energy in Daily or Half-Daily Cycles 12 24 Estimated Periodicity of Meters Meters with Anomalous Usage Patterns (~20% of all meters) 546 1092 0 200 400 600 800 1000 546 Periodicity in Hours 1092 Dominant periodicity (i.e. maximum frequency) of each meter is computed ~80% of all households show daily or half-daily patterns of energy usage ~20% of all households show anomalous patterns of energy usage Flag meters falling into the 20% as potentially anomalous meters Follow-up Items: Event type of the anomalies w.r.t. Revenue Protection to be determined with additional data & models 24

Irregular Patterns of Energy Consumption Displayed by Detected Anomalous Meters FFT Analysis: Time Series of an Ordinary Meter Load 0 1 2 3 4 0 200 400 600 800 1000 Elaspsed Hours FFT Analysis: Time Series of an Anomalous Meter Load 0 1 2 3 4 5 6 0 200 400 600 800 1000 Elaspsed Hours 25

Parallelize the Generation of Visualizations

Parallelize Visualization Generation

Parallelize Visualization Generation 28

Demand Modeling & What-If Scenario Analysis Scalable Algorithm Development Using R Prototyping Dashboards on RShiny

Engagement Overview Customer s Business Goal Make data-driven decisions about how to allocate resources for planning & inventory management Compose rich set of reusable data assets from disparate LOBs and make available for ongoing analysis & reporting Build parallelized demand models for 100+ products & locations Develop scalable Hierarchical/Multilevel Bayesian Modeling algorithm (Gibbs Sampling) Construct framework & prototype app for what-if scenario analysis in RShiny 30

Overview of Hierarchical Linear Model Likelihood Priors for parameters Priors for hyperparameters Posterior Likelihood x Priors for parameters x Priors for hyperparameters This joint posterior distribution does not take the form of a known probability density, thus it is a challenge to draw samples from it directly à However, the full conditional posterior distributions follow known probability densities (Gibbs Sampling) 31

Game Plan 1. Figure out which components of the Gibbs Sampler can be embarrassingly parallelized, i.e. the key building blocks Mostly matrix algebra calculations & draws from full conditional distributions, parallelized by Product-Location 2. Build functions (i.e. in PL/R) for each of the building blocks 3. Build a meta-function that ties together each of the functions in (2) to run a Gibbs Sampler 4. Run functions for K iterations, monitor convergence, summarize results 32

Examples of Building Block Functions

Meta-Function & Execution 34

PivotalR & RShiny RShiny RPostgreSQL PivotalR SQL to execute R à SQL Computation results No data here Data doesn t need to leave the database All heavy lifting, including model estimation & computation, are done in the database Database/Hadoop w/ MADlib RShiny Server Data lives here 35

36

37

Next Steps Continue to build even more PivotalR wrapper functions Identify more areas where core R functions can be releveraged and made scalable via PivotalR Explore, learn, and share notes with other packages like PivotalR Explore closer integration with Spark, MLlib, H20 PL/R wrappers directly from R 38

Thank You Have Any Questions? 39

Check out the Pivotal Data Science Blog! http://blog.pivotal.io/data-science-pivotal 40

Additional References PivotalR http://cran.r-project.org/web/packages/pivotalr/pivotalr.pdf https://github.com/pivotalsoftware/pivotalr Video Demo PL/R & General Pivotal+R Interoperability http://pivotalsoftware.github.io/gp-r/ MADlib http://madlib.net/ http://doc.madlib.net/latest/ 41

BUILT FOR THE SPEED OF BUSINESS