PivotalR: A Package for Machine Learning on Big Data



Similar documents
Enabling R for Big Data with PL/R and PivotalR Real World Examples on Hadoop & MPP Databases

Advanced In-Database Analytics

Big Data and Data Science: Behind the Buzz Words

Accelerating GeoSpatial Data Analytics With Pivotal Greenplum Database

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

I/O Considerations in Big Data Analytics

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop & SAS Data Loader for Hadoop

EMC GREENPLUM DATABASE

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

SEIZE THE DATA SEIZE THE DATA. 2015

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Architectures for Big Data Analytics A database perspective

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

HAWQ Architecture. Alexey Grishchenko

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Big Data and Big Analytics

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Massive scale analytics with Stratosphere using R

In-Database Analytics

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Bringing Big Data to People

Information Architecture

Tax Fraud in Increasing

Universal PMML Plug-in for EMC Greenplum Database

Comprehensive Analytics on the Hortonworks Data Platform

Pivotal HAWQ Release Notes

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Fundamentals Curriculum HAWQ

BIG DATA-AS-A-SERVICE

Big Data and the Data Lake. February 2015

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Data processing goes big

Fast Analytics on Big Data with H20

Massive Cloud Auditing using Data Mining on Hadoop

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Parallel Time Series Modeling - A Case Study of In-Database Big Data Analytics

Customized Report- Big Data

6 Steps to Faster Data Blending Using Your Data Warehouse

2015 The MathWorks, Inc. 1

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

The Inside Scoop on Hadoop

Advanced Big Data Analytics with R and Hadoop

What s Cooking in KNIME

SEIZE THE DATA SEIZE THE DATA. 2015

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Choosing The Right Big Data Tools For The Job A Polyglot Approach

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

BIG DATA What it is and how to use?

Search and Real-Time Analytics on Big Data

Internet of Things. Opportunity Challenges Solutions

Bringing Big Data Modelling into the Hands of Domain Experts

ANALYTICS CENTER LEARNING PROGRAM

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

STATISTICA Solutions for Financial Risk Management Management and Validated Compliance Solutions for the Banking Industry (Basel II)

Big Data and Market Surveillance. April 28, 2014

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Big Data on Microsoft Platform

What s New in MATLAB and Simulink

Harnessing the power of advanced analytics with IBM Netezza

Prerequisites. Course Outline

File S1: Supplementary Information of CloudDOE

High Performance Predictive Analytics in R and Hadoop:

Modern Data Architecture for Predictive Analytics

Big Data and Data Science. The globally recognised training program

Big Data Technologies Compared June 2014

Luncheon Webinar Series May 13, 2013

Safe Harbor Statement

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

White Paper. Unified Data Integration Across Big Data Platforms

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Unified Data Integration Across Big Data Platforms

6.0, 6.5 and Beyond. The Future of Spotfire. Tobias Lehtipalo Sr. Director of Product Management

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

Predictive Analytics

Discovering Business Insights in Big Data Using SQL-MapReduce

Data Science And Big Data Analytics Course

Navigating Big Data business analytics

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Oracle Big Data Strategy Simplified Infrastrcuture

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

How To Handle Big Data With A Data Scientist

How To Use Hp Vertica Ondemand

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data and Analytics 21 A Technical Perspective Abhishek Bhattacharya, Aditya Gandhi and Pankaj Jain November 2012

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Bringing Intergalactic Data Speak (a.k.a.: SQL) to Hadoop Martin Willcox Director Big Data Centre of Excellence (Teradata

Ganzheitliches Datenmanagement

Transcription:

PivotalR: A Package for Machine Learning on Big Data Hai Qian Predictive Analytics Team, Pivotal Inc. madlib@gopivotal.com Copyright 2013 Pivotal. All rights reserved.

What Can Small Data Scientists Bring on Their Big Data Journey? Small Data Ra nd p use ython r Databases In-me m Flat files ory m buildin odel g Command-line tools Big Data nd Ra n o pyth r use MapRe duce Many tools and approaches are being adapted to big data technologies S HDF Cloud computing P MP ses aba t a d puting m o c uted b i r t s i D Command-line tools Copyright 2014 Pivotal. All rights reserved.

Tools for Data Scientists COMMERCIAL OPEN SOURCE (OR FREE) PL/R, PL/Python PL/Java Copyright 2014 Pivotal. All rights reserved.

PivotalR Design Overview Database/Hadoop w/ MADlib RPostgreSQL PivotalR 2. SQL to execute 1. R SQL 3. Computation results No data here Copyright 2014 Pivotal. All rights reserved. Data lives here

Data Operators crossprod, scale, sample Arith, Logical, Cast Extraction: x[,-2], x[,1:3], x[,c( rings, sex )], x$arr[, 1] Replacement: x[x$sex == I,] <- NA Support array columns Easy to construct complicated SQL queries. For example, filter NULL values for (i in 1:ncol(w)) w <- w[!is.na(w[i]),] Copyright 2013 Pivotal. All rights reserved.

Machine learning Current MADlib wrapper functions: madlib.lm, madlib.glm, madlib.summary, madlib.arima, madlib.elnet Related functions: generic.cv, generic.bagging, margins, predict Support for formula y Support for categorical variables as.factor, predict Copyright 2013 Pivotal. All rights reserved. ~. - x[1:2] - z + factor(w) relevel,

MADlib In-database Machine Learning Library Pivotal Confidential Internal Use Only

How Pivotal Data Scientists Select Which Pivotal Tool to Use Copyright 2014 Pivotal. All rights reserved.

MADlib: Toolkit for Advanced Big Data Analytics Better Parallelism Algorithms designed to leverage MPP or Hadoop architecture Better Scalability Algorithms scale as your data set scales No data movement Better Predictive Accuracy Using all data, not a sample, may improve accuracy Open Source Available for customization and optimization by user Copyright 2014 Pivotal. All rights reserved.

Which platforms does it run on? (Partly ported) Impala HAWQ HDFS Pivotal Confidential Internal Use Only GPDB PostgreSQL

Shared-Nothing Database Architecture MPP (Massively Parallel Processing) Master Servers... SQL MapReduce... Query planning & dispatch Network Interconnect Segment Servers Query processing & data storage External Sources Loading, streaming, etc. Pivotal Confidential Internal Use Only......

Pivotal Confidential Internal Use Only

Example usage Train a model Predict for new data Pivotal Confidential Internal Use Only

But not all Data Scientists speak SQL Accessing Scalability through R Pivotal Confidential Internal Use Only

PivotalR: A familiar R interface Current version 0.1.16.12 Pivotal R d <- db.data.frame( houses") houses_linregr <- madlib.lm( price ~ tax + bath + size, data=d) Copyright 2014 Pivotal. All rights reserved. SQL Code SELECT madlib.linregr_train( 'houses, 'houses_linregr, 'price, 'ARRAY[1, tax, bath, size] );

Machine learning Something that MADlib cannot do generic.cv margins Not easy to do in MADlib on the server side as.factor and relevel step Copyright 2013 Pivotal. All rights reserved.

Quick Prototype Examples (see the script): Linear regression PCA Poisson regression Left inverse of a matrix AdaBoost Portable Same code on all supported platforms Copyright 2013 Pivotal. All rights reserved.

Sending R code into Databases Write R scripts that be sent into the database No translation to SQL Any R functions Copyright 2013 Pivotal. All rights reserved.

Testing Framework R CMD INSTALL --install-tests PivotalR_0.1.16.2.tar.gz Copyright 2013 Pivotal. All rights reserved. Based on testthat

Future Work Better support of PL/R Better graphics support Support more platforms Copyright 2013 Pivotal. All rights reserved.

Additional References MADlib http://madlib.net/ http://doc.madlib.net/latest/ PivotalR http://cran.r-project.org/web/packages/pivotalr/pivotalr.pdf https://github.com/gopivotal/pivotalr Video Demo Copyright 2014 Pivotal. All rights reserved.