Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com



Similar documents
Bayesian networks - Time-series models - Apache Spark & Scala

STA 4273H: Statistical Machine Learning

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Bayesian Statistics: Indian Buffet Process

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Machine Learning over Big Data

Linear Threshold Units

Statistical Machine Learning

Learning outcomes. Knowledge and understanding. Competence and skills

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Statistics Graduate Courses

APPLIED MISSING DATA ANALYSIS

Introduction to Markov Chain Monte Carlo

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Tutorial on Markov Chain Monte Carlo

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Machine Learning.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Forecasting and Hedging in the Foreign Exchange Markets

Big Data Analytics CSCI 4030

Classification Problems

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

BayesX - Software for Bayesian Inference in Structured Additive Regression

Parallelization Strategies for Multicore Data Analysis

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Supervised Learning (Big Data Analytics)

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Journée Thématique Big Data 13/03/2015

Statistical machine learning, high dimension and big data

Bayesian Factorization Machines

Model Calibration with Open Source Software: R and Friends. Dr. Heiko Frings Mathematical Risk Consulting

Spark and the Big Data Library

Evaluation of Machine Learning Techniques for Green Energy Prediction

Handling attrition and non-response in longitudinal data

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

More details on the inputs, functionality, and output can be found below.

Bayesian Statistics in One Hour. Patrick Lam

Lecture 3: Linear methods for classification

NEURAL NETWORKS A Comprehensive Foundation

Imputing Missing Data using SAS

Introduction to Machine Learning Using Python. Vikram Kamath

Stephen du Toit Mathilda du Toit Gerhard Mels Yan Cheng. LISREL for Windows: PRELIS User s Guide

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Linear Classification. Volker Tresp Summer 2015

HT2015: SC4 Statistical Data Mining and Machine Learning

Model-based Synthesis. Tony O Hagan

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Machine Learning Introduction

Analytics on Big Data

Dealing with large datasets

Lecture 8 February 4

Big Data, Statistics, and the Internet

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Question 2 Naïve Bayes (16 points)

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

ANALYTICS IN BIG DATA ERA

Multi-Class and Structured Classification

Program description for the Master s Degree Program in Mathematics and Finance

Machine learning for algo trading

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Bayes and Naïve Bayes. cs534-machine Learning

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Towards running complex models on big data

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Topic models for Sentiment analysis: A Literature Survey

How To Understand The Theory Of Probability

Social Media Mining. Data Mining Essentials

Predictive Data modeling for health care: Comparative performance study of different prediction models

Model Selection and Claim Frequency for Workers Compensation Insurance

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

ANALYTICS IN BIG DATA ERA

An Introduction to Machine Learning

Mammoth Scale Machine Learning!

Large-Scale Test Mining

Machine Learning Final Project Spam Filtering

MS1b Statistical Data Mining

Christfried Webers. Canberra February June 2015

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Statistical Models in Data Mining

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Imputing Values to Missing Data

Transcription:

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1

Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian ML on Spark SimSQL and Giraph (Next afternoon) Discussions (Thursday morning) 2

ML is Everywhere 3

Problem 1 Age 18 25 26 26 27 28 28 29 29 30 31 31 32 33 33 35 37 37 43 Income 9127 10192 12302 11467 12540 12597 13136 10343 11578 12828 13748 14548 13160 13595 13915 15695 15906 16926 17428 Age 20 21 30 25 Income???? Figure. An example for supervised regression. 4

Problem 2 Figure. An example for unsupervised classification. 5

Bayesian ML To apply Bayesian ML: Use statistical process to explain how data is generated i.e. generative process Θ. Learn parameters/variables in the process given data Θ. Use Bayes' Rule to learn Θ D : ΘD = Θ (Θ) ~ ( ) likelihood Θ (Θ) prior 6

Solution for Problem 1 Model: Bayesian LR 1. 2. 3. 4. Generate ~ (0 ) Generate ~ 0 Generate ~ Given each age : income ~ ( + σ02 a b y σ2 1 1 ) Age 18 25 26 26 27 28 28 29 29 30 31 31 32 33 33 35 37 37 43 Income 9127 10192 12302 11467 12540 12597 13136 10343 11578 12828 13748 14548 13160 13595 13915 15695 15906 16926 17428 Age Income 20 21 30 25???? Figure. Training set and test set 7

Bayesian Inference Using Bayes theorem σ02 Let Θ = Θ = 0 Θ Θ = Θ (Θ) a + 0 1 1) b σ2 y 8

Inference Methods + ) Newton-Raphson algorithm gradient descent algorithm etc. Expectation maximization (EM) algorithm. Approximate inference methods. Sampling methods like Monte Carlo Markov chain (MCMC). 9

Markov Chain Monte Carlo Markov Chain A collection of random variables {Xt} (r = 0 1 ) having the property that P(Xt X0 X1..Xt-1) = P(Xt Xt-1) Markov Chain Monte Carlo (MCMC) A class of sampling algorithms to sample a distribution by constructing a Markov chain whose equilibrium states approximate the desired distribution. 10

Bivariate Normal Distribution Input: mean: (0 0). covariance: [2 1; 1 2]. Output: A set of sampled data points. Steps: Select an initial point (x0 y0). Sample (xi+1 yi+1) based on (xi yi) repeatedly. After a burn-in number of steps collect samples periodically. 11

Monte Carlo Markov chain Gibbs sampling Initialize Θ while(true){ choose Θ; sample ~ Θ } ; Learning parameters in example 1 1) Initialize a and b; 2). 3) 4) 5).. 2) + + + 4) 0 0 ; ; 1 1 ;. 12

Monte Carlo Markov chain Each step is a standard distribution.... + ( 1 + 0. 1 + 0.5 + 0 + 0 + 0 1 1 ) --- It is a normal distribution. 13

Code for Problem 1 example1.m 14

Rejection sampling We want to sample from 1) Find a simple distribution. 2) Find a constant such that 3) Generate 4) < ~ use and generate ~ as a sample; 1) 4). ( ). 0. 15

ML with Big Data Data is big. Row number. Column number. Problems It is harder to learn. It is harder to model. 16

Problem 3 Problem. Given the 20newsgroup dataset we want to create a binary classifier to classify religion and non-religion groups. Dataset: http://qwone.com/~jason/20newsgroups/20news19997.tar.gz ~18000 non-religion documents ~2000 religion documents Figure. Data set. 17

Problem Modeling ~18000 non-religion documents ~2000 religion documents [ [ : : : : [ : : : : : : : : ] ] ] 18

Bayesian LR Create a classifier SVM KNN etc. ( I still use Bayesian LR. ). Generative model. 1. 2. 3. Generate ~ 0 Generate ~ Given each doc : generate outcome ; // w is a column vector 1 1 ; // is a column vector ~ ; 19

Sample all.. where = ( at once 1 1 1 + 1 + + = 0 ) σ2 + (X is a matrix and Y is a vector). 20

Problems with this approach. + + The computational complexity of is. The computational complexity of is and it is hard to parallelize. Given = 10 and = 10 both complexity can be up to 10 If we use double for repressors the complexity can be 10. Up to 5 hours per iteration with Amazon 100 m2.4xlarge nodes.. 21

One dimension at a time. It is a normal distribution. ( +!! ) ) 0 0 Problems: It is too slow: too many scans and scans follow each other. For large and it takes too much time. 22

A Block-based Sampler Sample a block of dimensions (. ) exp = ~ 2 ( + ) + 0 2 + const 0 23

A Block-based Sampler (. ) exp ~ Given 2 = ( ) + 2 + const for each MCMC iteration The number of scans over data: The number of aggregates: Computation cost: 0 = /.. +. 24

Time Cost Figure. The time cost per MCMC iteration for linear regression on 20newsgroup dataset where 10000 distinct words and 2M documents are used. 25

Problem 4 Problem. Given the 20-newsgroup dataset with a large fraction of missing values the task is to recover such values. Dataset: http://qwone.com/~jason/20newsgroups/20news19997.tar.gz [ [ : :? [ :? : : :? : :? :? : : :?] ] ] 26

Problem Modeling It is modelled as an imputation problem. Really old topic and there are more than twenty of methods. One of the most widely used methods is Multi-Gaussian distribution. 27

Challenges Σ Ν( Σ) computed from Σ takes Θ m is the size of dimensions. Σ or Σ from data takes Θ n is the number of data points. Σ should be positive-definite. GMRF(Gaussian Markov Random Field) has similar problems. 28

Solution PGRF (Pairwise Gaussian Random Field) Sidestep the covariance. High performance. Algorithm complexity is linear with the scale of data. 29

Simple Case 1 L = ( and Σ = = Multi-Gaussian model 1 =.. Σ exp{ 0.5 Σ PGRF model for complete correlations Let definition for = = and ( ) ) ( } Figure. Three dimensions are strongly correlated. and similar : 30 )

Simple Case 2 PGRF model with two correlations Let = similar definition for = and ( ) and : ( ) ( ) Figure. Two dimensions are correlated. 31

Simple Case 3 ( ( ) ( ) Ψ ) Ψ Ψ Figure. Four dimensions with two correlations. PGRF model with two correlations Let = = = then: and 32

Simple Case 4 Input Data: x. Ψ = {(1 3) (1 5) (3 5) (4 5)}. Model Z= = ( ) ( ) Figure. A PGRF model for 5 dimensional variables. 33

Generative Model ( ) 1. 2. 3. Figure. Graphical model for Markov random field. Ψ is known. 1 ~ ~ ~ : Ψ 1 Ω = : 1 1 ; Σ ; ( ); Figure 6. Generative process 34

Inference of PGRF variables parallel yes An independent subgraph of variables can be parallelized. complexity Θ( Θ( Θ(( m : the number of dimensions n: the number of data points p: the number of input correlations. Θ( ) ) + ) ) ) 35

Sample 3 6 4 1 2 and 5 7 Figure. The correlation graph. 36

1. Find the maximum independent set (MIS). 2 6 4 1 3 5 7 Figure. The correlation graph. 37

2. Sample and 6 4 1 2 for selected vertices. 3 5 7 Figure. The correlation graph. 38

3. Find the MIS in the remaining graph. 2 6 4 1 3 5 7 Figure. The correlation graph. 39

4. Sample and 6 4 1 2 for selected vertices. 3 5 7 Figure. The correlation graph. 40

5. Sample and 6 4 1 2 for remaining vertices. 3 5 7 Figure. The correlation graph. 41

Evaluation Impact of the number of correlations. Linear regression with PGRF. 42

Evaluation Scalability of our approach (PGRF). Dataset: 20newsgroups Dimensions: 10000 Each machine has a simulated copy of dataset. 43

Conclusion In Big Data computation efficiency should be considered in both modelling and inference. 44