L3: Statistical Modeling with Hadoop



Similar documents
The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Generalized Linear Models

Machine Learning Logistic Regression

Big Data Analytics. Lucas Rego Drumond

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Local classification and local likelihoods

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Journée Thématique Big Data 13/03/2015

CSCI567 Machine Learning (Fall 2014)

Parallelization Strategies for Multicore Data Analysis

Distributed Computing and Big Data: Hadoop and MapReduce

Applied Multivariate Analysis - Big data analytics

Map-Reduce for Machine Learning on Multicore

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Introduction to Logistic Regression

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Section 6: Model Selection, Logistic Regression and more...

The Stratosphere Big Data Analytics Platform

Advanced In-Database Analytics

Statistical Machine Learning

In-Database Analytics

SAS Software to Fit the Generalized Linear Model

Machine Learning Big Data using Map Reduce

Hadoop Ecosystem B Y R A H I M A.

VI. Introduction to Logistic Regression

Predictive Dynamix Inc

Logistic regression (with R)

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Lecture 6: Logistic Regression

Mammoth Scale Machine Learning!

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

L1: Introduction to Hadoop

A Performance Analysis of Distributed Indexing using Terrier

Lecture 8 February 4

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Developing a MapReduce Application

Logistic Regression (1/24/13)

Fast Analytics on Big Data with H20

R at the front end and

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Hadoop Parallel Data Processing

Lecture 3: Linear methods for classification

High Performance Predictive Analytics in R and Hadoop:

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Predict Influencers in the Social Network

Shroudbase Technical Overview

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Advanced Big Data Analytics with R and Hadoop

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

MapReduce/Bigtable for Distributed Optimization

Data Mining Practical Machine Learning Tools and Techniques

Spark: Cluster Computing with Working Sets

Examining a Fitted Logistic Model

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

Multivariate Logistic Regression

Spark and the Big Data Library

Some Essential Statistics The Lure of Statistics

Régression logistique : introduction

Big Data and Data Science: Behind the Buzz Words

Nominal and ordinal logistic regression

Classification On The Clouds Using MapReduce

Logistic Regression for Spam Filtering

BIG DATA What it is and how to use?


Advanced analytics at your hands

Distributed Machine Learning and Big Data

Lecture 8: Gamma regression

not possible or was possible at a high cost for collecting the data.

Analytics on Big Data

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

High Productivity Data Processing Analytics Methods with Applications

A very short Intro to Hadoop

The Artificial Prediction Market

How To Write A Data Processing Pipeline In R

Logistic Regression.

Predictive Modeling Techniques in Insurance

Linear Threshold Units

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Logistic Regression. BUS 735: Business Decision Making and Research

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Open Access Research of Fast Search Algorithm Based on Hadoop Cloud Platform

Introduction to Online Learning Theory

MapReduce Evaluator: User Guide

Transcription:

L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014

Today we are going to learn... 1 Linear regression with Hadoop 2 Logistic regression with Hadoop Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 2 / 10

Linear Regression I Assume we have a large dataset. How will we perform regression data analysis now? In such cases, we can use R and Hadoop integration to perform parallel linear regression by implementing Mapper and Reducer. It will divide the dataset into chunks among the available nodes and then they will process the distributed data in parallel. It will not fire memory issues when we run with an R and Hadoop cluster because the large dataset is going to be distributed and processed with R among Hadoop computation nodes. Also, keep in mind that this implemented method does not provide higher prediction accuracy than the lm() model. Assume we have data set contains both y nˆ1 and X nˆp. The linear model yields the following solution to ˆβ y = Xβ + ɛ ˆβ = (X 1 X) 1 X 1 y Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 3 / 10

Linear Regression II In OLS, to find coefficients is equal to find the solution of the linear system The Big Data problem: n ąą p (X 1 X)β = X 1 y The calculations of X 1 X and X 1 y is very computational demanding. But notice that the final output of (X 1 X) pˆp and (X 1 y) pˆ1 are fairly small. The solution: Let Hadoop calculate X 1 X and X 1 y. The final results can be obtained afterwards. The outline of the linear regression algorithm is as follows: 1 Calculating the X 1 X value with MapReduce job1. 2 Calculating the X 1 y value with MapReduce job2. 3 Deriving the coefficient values with solve(x 1 X, X 1 y). Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 4 / 10

Logistic regression I In statistics, logistic regression or logit regression is a type of probabilistic classification model. Logistic regression is used extensively in numerous disciplines, including the medical and social science fields. It can be binomial or multinomial. Binary logistic regression deals with situations in which the outcome for a dependent variable can have two possible types. Multinomial logistic regression deals with situations where the outcome can have three or more possible types. Logistic regression can be implemented using logistic functions. The logit model connects the explanatory variables in this way P i = 1 1 + exp( (β 1 + β 2 X i )) Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 5 / 10

Logistic regression II Alternatively we can write the model in this way log P i 1 P i = β 1 + β 2 X i where P i /(1 P i ) is called odds ratio: the ratio of probability of a family will own a house to the probability of not owing a house. This model can be easily estimated with R > glm(formula, data=inputdata, family="binomial") Bad news: The above estimation requires sequential iterative method. Will the following hypothetical Hadoop workflow work? Defining the Mapper function Defining the Reducer function Defining the Logistic Regression MapReduce function Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 6 / 10

Logistic regression III Logistic regression is the standard industry workhorse that underlies many production fraud detection and advertising quality and targeting products. The most common implementations use Stochastic Gradient Descent (SGD) to all large training sets to be used. The good news is that it is blazingly fast and thus it is not a problem for Hadoop implementation to handle training sets of tens of millions of examples. With the down-sampling typical in many data-sets, this is equivalent to a dataset with billions of raw training examples. The ready to use solutions: Apache Mahout (see next lecture) https://mahout.apache.org/users/classification/logistic-regression.html RHadoop: The RHadoop package rmr2 allow you to code your own gradient descent method with Hadoop. https://github.com/revolutionanalytics/rmr2 Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 7 / 10

Logistic Regression: The divide and conquer approach I Consider the following logistic regression model p(y i ) = exptx1 i βu 1 + exptx 1 i βu where x i = (x i1,..., x ip ) 1, i = 1,..., n. The p β for β coefficients can be estimated via maximum likelihood method. The Logistic Regression with subsets Divide n sample into k blocks that each block consists of m observationsãăć Do logistic regression with each block on a single node. mÿ pβ l = arg max ty li xliβ 1 log(1 + exptxiβu)u. 1 i=1 The Full Logistic Regression model with coefficients β p can be approximated by weighted average of β p l pβ = 1 kÿ pβ l. k l=1 Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 8 / 10

Logistic Regression: The divide and conquer approach II Some properties: When m Ñ, n Ñ, it can be shown that Consistency pβ p Ñ β. Asymptotic Normal? n( p β β) N(0, I(β)) where 1 I(β) = lim mñ k kÿ t 1 mÿ var(s(β ))u 1 l. m l=1 i=1 Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 9 / 10

Assignment (III) Use the dataset given on Hadoop server to implement at least one of statistical models you have learned before. Send your report to Feng no later than Dec 31. Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 10 / 10