Scalable Data Science with Hadoop, Spark and R. Mario Inchiosa, PhD Principal Software Engineer Microsoft Data Group DSC 2016 July 2, 2016

Similar documents
High Performance Predictive Analytics in R and Hadoop:

Using Microsoft R Server to Address Scalability Issues

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

In-Database Analytics Deep Dive with Teradata and Revolution R

Revolution R Enterprise: Efficient Predictive Analytics for Big Data

Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise

Revolution R Enterprise

Fast Analytics on Big Data with H20

SEIZE THE DATA SEIZE THE DATA. 2015

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson

Big Data on Microsoft Platform

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How To Test The Performance Of An Ass 9.4 And Sas 7.4 On A Test On A Powerpoint Powerpoint 9.2 (Powerpoint) On A Microsoft Powerpoint 8.4 (Powerprobe) (

Advanced Big Data Analytics with R and Hadoop

Bayesian networks - Time-series models - Apache Spark & Scala

Microsoft Research Microsoft Azure for Research Training

MSCA Introduction to Statistical Concepts

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Predicting Flight Delays

STATISTICA Formula Guide: Logistic Regression. Table of Contents

HIGH PERFORMANCE ANALYTICS FOR TERADATA

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Microsoft Research Windows Azure for Research Training

Advanced In-Database Analytics

Predictive Modeling Techniques in Insurance

Map-Reduce for Machine Learning on Multicore

L3: Statistical Modeling with Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

extreme Datamining mit Oracle R Enterprise

Azure Data Lake Analytics

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Assignment # 1 (Cloud Computing Security)

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Apache Hama Design Document v0.6

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Modern Data Architecture for Predictive Analytics

RevoScaleR Speed and Scalability

Hadoop & Spark Using Amazon EMR

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Comprehensive Analytics on the Hortonworks Data Platform

BIG DATA What it is and how to use?

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Journée Thématique Big Data 13/03/2015

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Microsoft Azure Machine learning Algorithms

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Data analysis process

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Chase Wu New Jersey Ins0tute of Technology

RevoScaleR 7.3 HPC Server Getting Started Guide

Understanding the Benefits of IBM SPSS Statistics Server

Data Mining in the Swamp

Bringing Big Data Modelling into the Hands of Domain Experts

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

Package dsmodellingclient

MSCA Introduction to Statistical Concepts

Big Data and Data Science: Behind the Buzz Words

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Azure Machine Learning, SQL Data Mining and R

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Workshop on Hadoop with Big Data

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Hadoop Ecosystem B Y R A H I M A.

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Register on projectbotticelli.com. Introduction to BI & Big Data DAX MDX Data Mining

Using In-Memory Computing to Simplify Big Data Analytics

Big Data With Hadoop

SEIZE THE DATA SEIZE THE DATA. 2015

Parallel Data Warehouse

What s Cooking in KNIME

International Journal of Innovative Research in Computer and Communication Engineering

Distributed Computing and Big Data: Hadoop and MapReduce

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

Testing Big data is one of the biggest

The Stratosphere Big Data Analytics Platform

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Automating Big Data Benchmarking for Different Architectures with ALOJA

SQL Server Everything built-in. Csom Gergely Microsoft Adat platform szakértő

RapidMiner Radoop Documentation

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

Transcription:

Scalable Data Science with Hadoop, Spark and R Mario Inchiosa, PhD Principal Software Engineer Microsoft Data Group DSC 2016 July 2, 2016

Microsoft R Server Cloud Hadoop & Spark R Server portfolio R Server Technology EDW RDBMS Desktops & Servers

R Server Parallel External Memory Algorithms (PEMAs) The initialize() method of the master Pema object is executed The master Pema object is serialized and sent to each worker process The worker processes call processdata() once for each chunk of data The fields of the worker s Pema object are updated from the data In addition, a data frame may be returned from processdata(), and will be written to an output data source When a worker has processed all of its data, it sends its reserialized Pema object back to the master (or an intermediate combiner) The master process loops over all of the Pema objects returned to it, calling updateresults() to update its Pema object processresults() is then called on the master Pema object to convert intermediate results to final results hasconverged(), whose default returns TRUE, is called, and either the results are returned to the user or another iteration is started 3

R Script for Execution in MapReduce Sample R Script: Define Compute Context Define Data Source rxsetcomputecontext( RxHadoopMR( ) ) indata <- RxTextData( /ds/airontime.csv, filesystem = hdfsfs) model <- rxlogit(arr_del15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = indata) Train Predictive Model

Easy to Switch From MapReduce to Spark Sample R Script: Change the Compute Context Keep other code unchanged rxsetcomputecontext( RxSpark( ) )

R Server: scale-out R 100% compatible with open source R Any code/package that works today with R will work in R Server Wide range of scalable and distributed R functions Examples: rxdatastep(), rxsummary(), rxglm(), rxdforest(), rxpredict() Ability to parallelize any R function Ideal for parameter sweeps, simulation, scoring

Parallelized & Distributed Algorithms ETL Data import Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher s Exact Test Student s t-test Predictive Statistics Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Predictions/scoring for models Residuals for all models Variable Selection Stepwise Regression Machine Learning Decision Trees Decision Forests Gradient Boosted Decision Trees Naïve Bayes Clustering K-Means Sampling Subsample (observations & variables) Random Sampling Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Custom Parallelization rxdatastep rxexec PEMA-R API

R Server Hadoop Architecture Data in Distributed Storage R process on Edge Node R R R R R R R R R R Master R process on Edge Node Apache YARN and Spark R Server Worker R processes on Data Nodes

R Server for Hadoop - Connectivity Remote Execution: ssh Edge Node Worker Task ssh or R Tools for Visual Studio https:// R Server Master Task Initiator or Worker Task Thin Client IDEs Finalizer MapReduce https:// Worker Task Jupyter Notebooks Web Services DeployR BI Tools & Applications

HDInsight + R Server: Managed Hadoop for Advanced Analytics in the Cloud SparkR functions R Spark and Hadoop Blob Storage Data Lake Storage RevoScaleR functions Easy setup, elastic, SLA Spark Integrated notebooks experience Upgraded to latest Version 1.6.1 R Server Leverage R skills with massively scalable algorithms and statistical functions Reuse existing R functions over multiple machines

R Server on Hadoop/HDInsight scales to hundreds of nodes, billions of rows and terabytes of data Logistic Regression on NYC Taxi Dataset 2.2 TB Elapsed Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Billions of rows

Typical advanced analytics lifecycle Prepare Model Operationalize

Airline Arrival Delay Prediction Demo Clean/Join Using SparkR from R Server Train/Score/Evaluate Scalable R Server functions Deploy/Consume Using AzureML from R Server

Airline data set Passenger flight on-time performance data from the US Department of Transportation s TranStats data collection >20 years of data 300+ Airports Every carrier, every commercial flight http://www.transtats.bts.gov

Weather data set Hourly land-based weather observations from NOAA > 2,000 weather stations http://www.ncdc.noaa.gov/orders/qclcd/

Provisioning a cluster with R Server

Scaling a cluster

Clean and Join using SparkR in R Server

Train, Score, and Evaluate using R Server

Publish Web Service from R

Demo Technologies HDInsight Premium Hadoop cluster Spark on YARN distributed computing R Server R interpreter SparkR data manipulation functions RevoScaleR Statistical & Machine Learning functions AzureML R package and Azure ML web service

Building a genetic disease risk application with R BAM BAM BAM BAM BAM Data Public genome data from 1000 Genomes About 2TB of raw data VariantTools GWAS Platform HDInsight Hadoop (8 clusters) 1500 cores, 4 data centers Microsoft R Server Processing VariantTools R package (Bioconductor) Match against NHGRI GWAS catalog Analytics Disease Risk Ancestry Presentation Expose as Web Service APIs Phone app, Web page, Enterprise applications

microsoft.com/r-server microsoft.com/hdinsight