Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Size: px

Start display at page:

Download "Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]"

Tobias Austin
10 years ago
Views:

1 Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University 1

2 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian ML on Spark SimSQL and Giraph (Next afternoon) Discussions (Thursday morning) 2

3 ML is Everywhere 3

4 Problem 1 Age Income Age Income???? Figure. An example for supervised regression. 4

12828 13748 14548 13160 13595 13915 15695 15906 16926 17428 Age 20

5 Problem 2 Figure. An example for unsupervised classification. 5

6 Bayesian ML To apply Bayesian ML: Use statistical process to explain how data is generated i.e. generative process Θ. Learn parameters/variables in the process given data Θ. Use Bayes' Rule to learn Θ D : ΘD = Θ (Θ) ~ ( ) likelihood Θ (Θ) prior 6

Learn parameters/variables in the process given data Θ.

7 Solution for Problem 1 Model: Bayesian LR Generate ~ (0 ) Generate ~ 0 Generate ~ Given each age : income ~ ( + σ02 a b y σ2 1 1 ) Age Income Age Income ???? Figure. Training set and test set 7

25 26 26 27 28 28 29 29 30 31 31 32 33 33 35 37 37 43 Income 9127 10192 12302 11467 12540 12597

8 Bayesian Inference Using Bayes theorem σ02 Let Θ = Θ = 0 Θ Θ = Θ (Θ) a ) b σ2 y 8

9 Inference Methods + ) Newton-Raphson algorithm gradient descent algorithm etc. Expectation maximization (EM) algorithm. Approximate inference methods. Sampling methods like Monte Carlo Markov chain (MCMC). 9

10 Markov Chain Monte Carlo Markov Chain A collection of random variables {Xt} (r = 0 1 ) having the property that P(Xt X0 X1..Xt-1) = P(Xt Xt-1) Markov Chain Monte Carlo (MCMC) A class of sampling algorithms to sample a distribution by constructing a Markov chain whose equilibrium states approximate the desired distribution. 10

.Xt-1) = P(Xt Xt-1) Markov Chain Monte Carlo (MCMC) A class of sampling

11 Bivariate Normal Distribution Input: mean: (0 0). covariance: [2 1; 1 2]. Output: A set of sampled data points. Steps: Select an initial point (x0 y0). Sample (xi+1 yi+1) based on (xi yi) repeatedly. After a burn-in number of steps collect samples periodically. 11

12 Monte Carlo Markov chain Gibbs sampling Initialize Θ while(true){ choose Θ; sample ~ Θ } ; Learning parameters in example 1 1) Initialize a and b; 2). 3) 4) 5).. 2) ) 0 0 ; ; 1 1 ;. 12

13 Monte Carlo Markov chain Each step is a standard distribution ( ) --- It is a normal distribution. 13

14 Code for Problem 1 example1.m 14

15 Rejection sampling We want to sample from 1) Find a simple distribution. 2) Find a constant such that 3) Generate 4) < ~ use and generate ~ as a sample; 1) 4). ( )

16 ML with Big Data Data is big. Row number. Column number. Problems It is harder to learn. It is harder to model. 16

17 Problem 3 Problem. Given the 20newsgroup dataset we want to create a binary classifier to classify religion and non-religion groups. Dataset: ~18000 non-religion documents ~2000 religion documents Figure. Data set. 17

to classify religion and non-religion groups. Dataset: http://qwone.

18 Problem Modeling ~18000 non-religion documents ~2000 religion documents [ [ : : : : [ : : : : : : : : ] ] ] 18

19 Bayesian LR Create a classifier SVM KNN etc. ( I still use Bayesian LR. ). Generative model Generate ~ 0 Generate ~ Given each doc : generate outcome ; // w is a column vector 1 1 ; // is a column vector ~ ; 19

3. Generate ~ 0 Generate ~ Given each doc : generate

20 Sample all.. where = ( at once = 0 ) σ2 + (X is a matrix and Y is a vector). 20

21 Problems with this approach. + + The computational complexity of is. The computational complexity of is and it is hard to parallelize. Given = 10 and = 10 both complexity can be up to 10 If we use double for repressors the complexity can be 10. Up to 5 hours per iteration with Amazon 100 m2.4xlarge nodes.. 21

22 One dimension at a time. It is a normal distribution. ( +!! ) ) 0 0 Problems: It is too slow: too many scans and scans follow each other. For large and it takes too much time. 22

23 A Block-based Sampler Sample a block of dimensions (. ) exp = ~ 2 ( + ) const 0 23

24 A Block-based Sampler (. ) exp ~ Given 2 = ( ) const for each MCMC iteration The number of scans over data: The number of aggregates: Computation cost: 0 = /

25 Time Cost Figure. The time cost per MCMC iteration for linear regression on 20newsgroup dataset where distinct words and 2M documents are used. 25

26 Problem 4 Problem. Given the 20-newsgroup dataset with a large fraction of missing values the task is to recover such values. Dataset: [ [ : :? [ :? : : :? : :? :? : : :?] ] ] 26

27 Problem Modeling It is modelled as an imputation problem. Really old topic and there are more than twenty of methods. One of the most widely used methods is Multi-Gaussian distribution. 27

28 Challenges Σ Ν( Σ) computed from Σ takes Θ m is the size of dimensions. Σ or Σ from data takes Θ n is the number of data points. Σ should be positive-definite. GMRF(Gaussian Markov Random Field) has similar problems. 28

29 Solution PGRF (Pairwise Gaussian Random Field) Sidestep the covariance. High performance. Algorithm complexity is linear with the scale of data. 29

30 Simple Case 1 L = ( and Σ = = Multi-Gaussian model 1 =.. Σ exp{ 0.5 Σ PGRF model for complete correlations Let definition for = = and ( ) ) ( } Figure. Three dimensions are strongly correlated. and similar : 30 )

31 Simple Case 2 PGRF model with two correlations Let = similar definition for = and ( ) and : ( ) ( ) Figure. Two dimensions are correlated. 31

32 Simple Case 3 ( ( ) ( ) Ψ ) Ψ Ψ Figure. Four dimensions with two correlations. PGRF model with two correlations Let = = = then: and 32

33 Simple Case 4 Input Data: x. Ψ = {(1 3) (1 5) (3 5) (4 5)}. Model Z= = ( ) ( ) Figure. A PGRF model for 5 dimensional variables. 33

34 Generative Model ( ) Figure. Graphical model for Markov random field. Ψ is known. 1 ~ ~ ~ : Ψ 1 Ω = : 1 1 ; Σ ; ( ); Figure 6. Generative process 34

35 Inference of PGRF variables parallel yes An independent subgraph of variables can be parallelized. complexity Θ( Θ( Θ(( m : the number of dimensions n: the number of data points p: the number of input correlations. Θ( ) ) + ) ) ) 35

36 Sample and 5 7 Figure. The correlation graph. 36

37 1. Find the maximum independent set (MIS) Figure. The correlation graph. 37

38 2. Sample and for selected vertices Figure. The correlation graph. 38

39 3. Find the MIS in the remaining graph Figure. The correlation graph. 39

40 4. Sample and for selected vertices Figure. The correlation graph. 40

41 5. Sample and for remaining vertices Figure. The correlation graph. 41

42 Evaluation Impact of the number of correlations. Linear regression with PGRF. 42

43 Evaluation Scalability of our approach (PGRF). Dataset: 20newsgroups Dimensions: Each machine has a simulated copy of dataset. 43

44 Conclusion In Big Data computation efficiency should be considered in both modelling and inference. 44

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly