1 PivotalR: A Package for Machine Learning on Big Data Hai Qian Predictive Analytics Team, Pivotal Inc. Copyright 2013 Pivotal. All rights reserved.
2 What Can Small Data Scientists Bring on Their Big Data Journey? Small Data Ra nd p use ython r Databases In-me m Flat files ory m buildin odel g Command-line tools Big Data nd Ra n o pyth r use MapRe duce Many tools and approaches are being adapted to big data technologies S HDF Cloud computing P MP ses aba t a d puting m o c uted b i r t s i D Command-line tools Copyright 2014 Pivotal. All rights reserved.
3 Tools for Data Scientists COMMERCIAL OPEN SOURCE (OR FREE) PL/R, PL/Python PL/Java Copyright 2014 Pivotal. All rights reserved.
4 PivotalR Design Overview Database/Hadoop w/ MADlib RPostgreSQL PivotalR 2. SQL to execute 1. R SQL 3. Computation results No data here Copyright 2014 Pivotal. All rights reserved. Data lives here
5 Data Operators crossprod, scale, sample Arith, Logical, Cast Extraction: x[,-2], x[,1:3], x[,c( rings, sex )], x$arr[, 1] Replacement: x[x$sex == I,] <- NA Support array columns Easy to construct complicated SQL queries. For example, filter NULL values for (i in 1:ncol(w)) w <- w[!is.na(w[i]),] Copyright 2013 Pivotal. All rights reserved.
6 Machine learning Current MADlib wrapper functions: madlib.lm, madlib.glm, madlib.summary, madlib.arima, madlib.elnet Related functions: generic.cv, generic.bagging, margins, predict Support for formula y Support for categorical variables as.factor, predict Copyright 2013 Pivotal. All rights reserved. ~. - x[1:2] - z + factor(w) relevel,
7 MADlib In-database Machine Learning Library Pivotal Confidential Internal Use Only
8 How Pivotal Data Scientists Select Which Pivotal Tool to Use Copyright 2014 Pivotal. All rights reserved.
9 MADlib: Toolkit for Advanced Big Data Analytics Better Parallelism Algorithms designed to leverage MPP or Hadoop architecture Better Scalability Algorithms scale as your data set scales No data movement Better Predictive Accuracy Using all data, not a sample, may improve accuracy Open Source Available for customization and optimization by user Copyright 2014 Pivotal. All rights reserved.
10 Which platforms does it run on? (Partly ported) Impala HAWQ HDFS Pivotal Confidential Internal Use Only GPDB PostgreSQL
13 Example usage Train a model Predict for new data Pivotal Confidential Internal Use Only
14 But not all Data Scientists speak SQL Accessing Scalability through R Pivotal Confidential Internal Use Only
15 PivotalR: A familiar R interface Current version Pivotal R d <- db.data.frame( houses") houses_linregr <- madlib.lm( price ~ tax + bath + size, data=d) Copyright 2014 Pivotal. All rights reserved. SQL Code SELECT madlib.linregr_train( 'houses, 'houses_linregr, 'price, 'ARRAY[1, tax, bath, size] );
16 Machine learning Something that MADlib cannot do generic.cv margins Not easy to do in MADlib on the server side as.factor and relevel step Copyright 2013 Pivotal. All rights reserved.
17 Quick Prototype Examples (see the script): Linear regression PCA Poisson regression Left inverse of a matrix AdaBoost Portable Same code on all supported platforms Copyright 2013 Pivotal. All rights reserved.
18 Sending R code into Databases Write R scripts that be sent into the database No translation to SQL Any R functions Copyright 2013 Pivotal. All rights reserved.
19 Testing Framework R CMD INSTALL --install-tests PivotalR_ tar.gz Copyright 2013 Pivotal. All rights reserved. Based on testthat
20 Future Work Better support of PL/R Better graphics support Support more platforms Copyright 2013 Pivotal. All rights reserved.
21 Additional References MADlib PivotalR https://github.com/gopivotal/pivotalr Video Demo Copyright 2014 Pivotal. All rights reserved.
For: Application Development & Delivery Professionals The Forrester Wave : Big Data Hadoop Solutions, Q1 2014 by Mike Gualtieri and Noel Yuhanna, February 27, 2014 Key Takeaways Hadoop s Momentum Is Unstoppable
White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers
An Oracle White Paper June 2013 Oracle: Big Data for the Enterprise Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Emergence and Taxonomy of Big Data as a Service Benoy Bhagattjee Working Paper CISL# 2014-06 May 2014 Composite Information Systems Laboratory (CISL) Sloan School of Management, Room E62-422 Massachusetts
IBM SPSS Modeler 15 User s Guide Note: Before using this information and the product it supports, read the general information under Notices on p. 249. This edition applies to IBM SPSS Modeler 15 and to
white paper Boosting Retail Revenue and Efficiency with Big Data Analytics A Simplified, Automated Approach to Big Data Applications: StackIQ Enterprise Data Management and Monitoring Abstract Contents
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania email@example.com, firstname.lastname@example.org
Oracle Data Integrator 12c New Features Overview Advancing Big Data Integration O R A C L E W H I T E P A P E R M A R C H 2 0 1 5 Table of Contents Executive Overview 1 Oracle Data Integrator 184.108.40.206.1
Web Application Hosting Cloud Architecture Executive Overview This paper describes vendor neutral best practices for hosting web applications using cloud computing. The architectural elements described
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
BigBench: Towards an Industry Standard Benchmark for Big Data Analytics Ahmad Ghazal 1,5, Tilmann Rabl 2,6, Minqing Hu 1,5, Francois Raab 4,8, Meikel Poess 3,7, Alain Crolotte 1,5, Hans-Arno Jacobsen 2,9
David Chappell December 2011 WHAT IS AN APPLICATION PLATFORM? Sponsored by Microsoft Corporation Copyright 2011 Chappell & Associates Just about every application today relies on other software: operating
ascent Thought leadership from Atos white paper Data Analytics as a Service: unleashing the power of Cloud and Big Data Your business technologists. Powering progress Big Data and Cloud, two of the trends
PRODUCTS BUSINESSOBJECTS DATA INTEGRATOR IT Benefits Correlate and integrate data from any source Efficiently design a bulletproof data integration process Improve data quality Move data in real time and
Hur hanterar vi utmaningar inom området - Big Data Jan Östling Enterprise Technologies Intel Corporation, NER Legal Disclaimers All products, computer systems, dates, and figures specified are preliminary
Oracle University Contact Us: Local: 1800 103 4775 Intl: +91 80 40291196 Oracle Big Data Essentials Duration: 3 Days What you will learn This Oracle Big Data Essentials training deep dives into using the
Plug Into The Cloud with Oracle Database 12c ORACLE WHITE PAPER DECEMBER 2014 Disclaimer The following is intended to outline our general product direction. It is intended for information purposes only,
Data Virtualization Overview Take Big Advantage of Your Data "Using a data virtualization technique is: number one, much quicker time to market; number two, much more cost effective; and three, gives us
Quantum Q-Cloud Backup-as-a-Service Reference Architecture NOTICE This Technology Brief may contain proprietary information protected by copyright. Information in this Technology Brief is subject to change
Log Management and SIEM Evaluation Checklist Authors: Frank Bijkersma ( email@example.com ) Vinod Shankar (firstname.lastname@example.org) Published on www.infosecnirvana.com, www.frankbijkersma.com Date: