# Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Save this PDF as:

Size: px
Start display at page:

Download "Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014"

## Transcription

1 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014

2 The State of Big Data

3 Why probabilistic models for Big Data? 1. If you don t have to worry about overfitting, your model is likely too small. (Welling) 2. For scientific problems, prior knowledge is usually available. 3. Not all data points inform all latent variables. 4. We can have big data with small N (e.g. 20 experiments, 10 TB per experiment.) 5. How do you regularize models with complex dependencies with billions of parameters.

4 Trueskill Xbox Live ranking system. Predict player skill 2,000,000 matches per day 10,000,000 players

5 Bing advertising Predict expected click revenue for ads. Requires ~10ms response times. \$832 million in annual revenue. Every percent accuracy matters.

6 Main considerations for the Big Data revolution Trading off Accuracy/Time/Data size. Large scale infrastructure.

7 Trading off Accuracy-Time-Data Size For a given computational budget, can we guarantee a level of inferential accuracy as data grows in size? Two main approaches to achieve this: Divide and conquer: divide in subproblems and piece solutions together. Algorithmic weakening: hierarchy of algorithms ordered by computational complexity. (Jordan, 2013, Bernoulli 19(4), )

8 STOCHASTIC VARIATIONAL INFERENCE Bringing the power of stochastic gradient descent to variational inference For a complete understanding of SVI, three components are required Stochastic Gradient Descent Natural Gradients Variational Inference Alex Davies and Roger Frigola 1 / 43

9 STOCHASTIC VARIATIONAL INFERENCE SVI applies to models of the following form: Local and global latent parameters Conjugate distributions Alex Davies and Roger Frigola 2 / 43

10 STOCHASTIC GRADIENT DESCENT The poster child of large scale optimization. Alex Davies and Roger Frigola 3 / 43

11 STOCHASTIC GRADIENT DESCENT The trade-offs of large scale learning: ɛ = E [ E( f n ) E(f n ) ] + E [E(f n ) E(f F)] + E [E(f F) E(f )] (1) f is the optimal function f F is the best function in our function class f n is the empirical risk minimizing function f n is the approximation to f n from early stopping of optimization Alex Davies and Roger Frigola 4 / 43

12 STOCHASTIC GRADIENT DESCENT The trade-offs of large scale learning: ɛ = E [ E( f n ) E(f n ) ] + E [E(f n ) E(f F)] + E [E(f F) E(f )] (2) ɛ = ɛ optimization + ɛ estimation + ɛ approximation f is the optimal function f F is the best function in our function class f n is the empirical risk minimizing function f n is the approximation to f n from early stopping of optimization Alex Davies and Roger Frigola 5 / 43

13 STOCHASTIC GRADIENT DESCENT GD SGD Time per iteration n 1 Iterations to accuracy ρ log 1 1 ρ ρ Time to accuracy ρ n log 1 1 ρ ρ 1 Time to excess error ɛ log ɛ α ɛ ɛ Line 4 comes from mystical frequentist bounds with α [1, 2]. It assumes strong convexity or certain assumptions and also that we have optimally traded off between the three forms of error. Alex Davies and Roger Frigola 6 / 43

14 STOCHASTIC GRADIENT DESCENT But convergence is easy to assure. All that is required is the following constraints on the step size schedule: ρt = ρ 2 t < Alex Davies and Roger Frigola 7 / 43

15 STOCHASTIC VARIATIONAL INFERENCE Bringing the power of stochastic gradient descent to variational inference For a complete understanding of SVI, three components are required Stochastic Gradient Descent Natural Gradients Variational Inference Alex Davies and Roger Frigola 8 / 43

16 NATURAL GRADIENTS Natural gradients are a sensible choice of gradients that can be used when optimizing probability distributions. Naturally. Alex Davies and Roger Frigola 9 / 43

17 NATURAL GRADIENTS With gradient descent, we want to optimize through some nice space with respect to the function we are optimizing. It would be nice if it was pretty smooth and well behaved, and the length-scale over which the function tends to vary stays relatively stable. Alex Davies and Roger Frigola 10 / 43

18 NATURAL GRADIENTS When optimizing a probability distribution, our loss function is a function of the distribution. But we represent our distributions in terms of their parameters. Hopefully our loss function is nice wrt the probability distribution, but unless we have very specifically chosen the parameterization for the loss function we are optimizing, optimizing in parameter space will probably make things worse. θ p(θ) f (p(θ)) Alex Davies and Roger Frigola 11 / 43

19 NATURAL GRADIENTS When optimizing a probability distribution, our loss function is a function of the distribution. But we represent our distributions in terms of their parameters. Hopefully our loss function is nice wrt the probability distribution, but unless we have very specifically chosen the parameterization for the loss function we are optimizing, optimizing in parameter space will probably make things worse. Alex Davies and Roger Frigola 12 / 43

20 NATURAL GRADIENTS Example: Normal distribution with standard parameterization Alex Davies and Roger Frigola 13 / 43

21 NATURAL GRADIENTS A step size of 10 in parameter space takes us from N (0, 10000) N (10, 10000) while a step size of 1 in parameter space can take us from N (0, 1) N (1, 1) Alex Davies and Roger Frigola 14 / 43

22 NATURAL GRADIENTS When we are optimizing a probability distribution, we want to take nice consistent steps in probability space. ie from one step to another, we want to have moved a consistent distance. A sensible measure of distance for probability distributions is the symmetrized KL divergence. Alex Davies and Roger Frigola 15 / 43

23 NATURAL GRADIENTS So how about... Instead of finding the direction we go for arg max f (θ + dθ) where dθ < ɛ dθ arg max f (θ + dθ) where D sym KL (θ, θ + dθ) < ɛ dθ Alex Davies and Roger Frigola 16 / 43

24 NATURAL GRADIENTS So how about... Instead of finding the direction we go for arg max f (θ + dθ) where dθ < ɛ dθ arg max f (θ + dθ) where D sym KL (θ, θ + dθ) < ɛ dθ To do this, we can make a linear transformation of our space so that d θ = D sym KL (θ, θ + dθ) Alex Davies and Roger Frigola 17 / 43

25 NATURAL GRADIENTS It turns out that you can do this by using the inverse of the Fisher information matrix G as the linear transformation. ˆ f = G 1 f (3) Now we have a sensible optimization step that s independent of the parameterization of our distribution. Alex Davies and Roger Frigola 18 / 43

26 STOCHASTIC VARIATIONAL INFERENCE Bringing the power of stochastic gradient descent to variational inference For a complete understanding of SVI, three components are required Stochastic Gradient Descent Natural Gradients Variational Inference Alex Davies and Roger Frigola 19 / 43

27 VARIATIONAL INFERENCE Approximate the full distribution with one in a nice class, that we can calculate with easily. Q(θ φ) P(θ X) Alex Davies and Roger Frigola 20 / 43

28 VARIATIONAL INFERENCE Approximate the full distribution with one in a nice class, that we can calculate with easily. Q(θ φ) P(θ X) We need to find the value of φ that makes Q(θ φ) as close as possible to P(X θ). Alex Davies and Roger Frigola 21 / 43

29 VARIATIONAL INFERENCE We already know the go-to measure for distance between distributions, KL divergence. So we want to minimize: D KL (Q(θ φ), P(θ X)) Alex Davies and Roger Frigola 22 / 43

30 VARIATIONAL INFERENCE : KL-DIVERGENCE We already know the go-to measure for distance between distributions, KL divergence. So we want to minimize: D KL (Q(θ φ), P(θ X)) = E Q [log Q(θ φ)] E Q [log P(θ X)] Alex Davies and Roger Frigola 23 / 43

31 VARIATIONAL INFERENCE : KL-DIVERGENCE We already know the go-to measure for distance between distributions, KL divergence. So we want to minimize: D KL (Q(θ φ), P(θ X)) = E Q [log Q(θ φ)] E Q [log P(θ X)] D KL (Q, P) = E Q [log Q(θ φ)] E Q [log P(θ, X)] + } {{ } log P(X) } {{ } -ELBO Marginal likelihood Alex Davies and Roger Frigola 24 / 43

32 VARIATIONAL INFERENCE : KL-DIVERGENCE D KL (Q, P) = E Q [log Q(θ φ)] E Q [log P(θ, X)] + } {{ } log P(X) } {{ } -ELBO Log marginal likelihood 1. Since log P(X) is independent of φ, we can minimize the KL by maximizing the ELBO 2. Since the KL is always positive, the ELBO is a lower-bound on the log marginal likelihood Alex Davies and Roger Frigola 25 / 43

33 VARIATIONAL INFERENCE : KL-DIVERGENCE To optimize the parameters of the variational distribution φ, optimize the ELBO using your favourite gradient method. L(φ) = E Q [log P(θ, X)] E Q [log Q(θ φ)] Alex Davies and Roger Frigola 26 / 43

34 STOCHASTIC VARIATIONAL INFERENCE Bringing the power of stochastic gradient descent to variational inference For a complete understanding of SVI, three components are required Stochastic Gradient Descent Natural Gradients Variational Inference Alex Davies and Roger Frigola 27 / 43

35 STOCHASTIC VARIATIONAL INFERENCE The derivative of the ELBO in a conjugate, global and local variable model: λ L(φ) = G(E Q [η(x, z, α)] λ) (4) Alex Davies and Roger Frigola 28 / 43

36 STOCHASTIC VARIATIONAL INFERENCE The derivative of the ELBO in a conjugate, global and local variable model: λ L(φ) = G(E [η(x, z, α)] λ) (5) That G looks like a total hassle to calculate. (Quadratic in number of variational parameters). Alex Davies and Roger Frigola 29 / 43

37 STOCHASTIC VARIATIONAL INFERENCE The derivative of the ELBO in a conjugate, global and local variable model: λ L(φ) = G(E [η(x, z, α)] λ) (6) That G looks like a total hassle to calculate. (Quadratic in number of variational parameters). λ L(φ) = G 1 G(E [η(x, z, α)] λ) (7) = E [η(x, z, α)] λ (8) Alex Davies and Roger Frigola 30 / 43

38 STOCHASTIC VARIATIONAL INFERENCE The natural derivative of the ELBO in a conjugate, global and local variable model: λ L(φ) = E [η(x, z, α)] λ (9) (t are the sufficient statistics) = α + N(E φi [t(x i, z i )], 1) λ (10) Alex Davies and Roger Frigola 31 / 43

39 STOCHASTIC VARIATIONAL INFERENCE Breather slider. Alex Davies and Roger Frigola 32 / 43

40 GAUSSIAN PROCESSES WITH INDUCING POINTS Alex Davies and Roger Frigola 33 / 43

41 GAUSSIAN PROCESSES WITH INDUCING POINTS SVI applies to models of the following form: Local and global parameters Conjugate distributions Alex Davies and Roger Frigola 34 / 43

42 STOCHASTIC VARIATIONAL INFERENCE FOR GPS In order to apply stochastic variational inference to GPs, the inducing points from sparse GPs become our global variational parameters. Demo time! Alex Davies and Roger Frigola 35 / 43

43 LARGE SCALE GRAPHICAL MODELS Many of the hardest things about scaling a system to truly big data are actually engineering challenges that come from distributed computation. 1. Shared memory distribution 2. Communciation 3. Synchronization 4. Fault tolerance This shouldn t be our job. Alex Davies and Roger Frigola 36 / 43

44 LARGE SCALE GRAPHICAL MODELS Hadoop was the first excellent programming paradigm for big data. Alex Davies and Roger Frigola 37 / 43

45 LARGE SCALE GRAPHICAL MODELS You use a restricted programming structure, Map-Reduce, and everything else just magically works. Map-Reduce turns out to be very useful for Data processing tasks but not so much for Machine learning. Alex Davies and Roger Frigola 38 / 43

46 LARGE SCALE GRAPHICAL MODELS You use a restricted programming structure, Map-Reduce, and everything else just magically works. Map-Reduce turns out to be very useful for Data processing tasks but not so much for Machine learning. Alex Davies and Roger Frigola 39 / 43

47 LARGE SCALE GRAPHICAL MODELS An alternative to Map-Reduce is Bulk Synchronous Processing (BSP), which is implemented in Giraph and GraphLab Alex Davies and Roger Frigola 40 / 43

48 LARGE SCALE GRAPHICAL MODELS The power of these frameworks are in their simplicity. Rather than specifying a map and a reduce function you simply specify a node function. At each iteration a node can perform local computations, recieve messages along it s edges from the previous round of computation and send message out along it s edges. Alex Davies and Roger Frigola 41 / 43

49 Page rank code Alex Davies and Roger Frigola 42 / 43

50 LARGE SCALE GRAPHICAL MODELS Summary: If you can write an algorithm that iteratively computes locally on nodes in a graph, you can use Giraph and GraphLab and scale to billions of nodes without every knowing a thing about parallel architecture. Alex Davies and Roger Frigola 43 / 43

51 MCMC for Big Data Conventional wisdom: MCMC not suited for Big Data. Is MCMC the optimal way to sample a posterior for large data sets and limited computation time?

52 MCMC Data y, unknown parameter θ GOAL: sample posterior where p(y θ) = N i=1 p(y i θ) p(θ y) p(y θ) p(θ) by obtaining a set {θ 1, θ 2,..., θ T } distributed according to p(θ y). I = E p(θ y) [f (θ)] 1 T T f (θ t ), t=1 with θ t p(θ y)

53 Metropolis-Hastings Simple algorithm to generate a Markov chain with a given stationary distribution. Start with θ 0 1. Draw candidate θ q(θ θ t ). 2. Compute acceptance probability P a = min(1, p(θ y) q(θ t θ ) p(θ t y) q(θ θ t ) ) 3. Draw u Uniform[0, 1]. If u < P a set θ t+1 θ otherwise set θ t+1 θ t. PROBLEM: evaluating p(θ y) is O(N).

54 Bias and Variance Samples can be used to compute expectations wrt the posterior I = E p(θ y) [f (θ)] I Î = 1 T f (θ t ) T t=1 After burn-in, Î is an unbiased estimator of I E chains [Î] = I Var chains [Î] = σ2 f τ T

55 New Paradigm: Accept Bias - Reduce Variance If we can sample faster, T grows and the variance reduces. Risk is useful to study bias/variance trade-off R = E chains [(I Î)2 ] = B 2 ɛ + V ɛ The setting of ɛ that minimises risk depends on available computation time. Traditional MCMC setting assumes T, best strategy is to have no bias since variance will go down to zero.

56 Existing Methods Use mini-batches of data and omit MH acceptance test Stochastic Gradient Langevin Dynamics (Welling and Teh, 2011) and Stochastic Gradient Fisher Scoring (Ahn, Koratikkara and Welling, 2012) Approximate MH Acceptance test (Koratikkara, Chen and Welling, 2014) and (Bardenet, Doucet and Holmes, 2014)

57 Approximate MH Acceptance Test Acceptance probability P a = min(1, p(θ y) q(θ t θ ) p(θ t y) q(θ θ t ) ) O(N) computation to obtain one bit of information. Options: Use unbiased estimators of the likelihood that use only a subset of data high variance. Reformulate MH acceptance test as a statistical decision problem.

58 Approximate MH Acceptance Test Equivalent reformulation of MH acceptance test 1. Draw u Uniform[0, 1]. 2. Compute µ 0 = 1 N log(u p(θ t) q(θ θ t ) p(θ ) q(θ t θ ) ) µ = 1 N N i=1 l i, l i = log p(y i θ ) p(y i θ t ) 3. If µ > µ 0 set θ t+1 θ otherwise set θ t+1 θ t. This looks like a hypothesis test!

59 Approximate MH Acceptance Test Approximate µ with a random subset of the data µ l = Test statistic ( n j=1 l j, l j = log p(y j θ ) p(y j θ t ) s l = l 2 ( l) 2 n ), Standard deviation of l (2) n 1 s l = s l 1 n 1 n N 1, Standard deviation of l (3) t = l µ 0 s l If n is large enough for Central Limit Theorem to hold, and µ = µ 0 t follows a standard Student-t dist. with n-1 DOF. (1)

60 Approximate MH Acceptance Test If 1 Φ n 1 ( t ) < ɛ, µ is statistically significantly different from µ 0. ( where Φ n 1 is a Student-t cdf) Then, if 1 Φ n 1 ( t ) < ɛ, the approximate MH test becomes: if l > µ0 set θ t+1 θ otherwise set θ t+1 θ t. If the test is not significant at a level ɛ we draw more samples and recompute t.

61 Recap: Approximate MH Acceptance Test Can often make confident decisions even when n < N. Saves time that can be used to draw longer chains and hence reduce variance. High ɛ High bias, fast Low ɛ Low bias, slow (high n required) ɛ is a knob to trade off bias and variance.

62 Optimal Sequential Test Design How to choose the parameters of the algorithm? Choose initial mini-batch size 500 for Central Limit Theorem to hold. Keep ɛ as small as possible while maintaining low average data use.

63 Experiments: Independent Component Analysis Posterior over unmixing matrix. Risk in mean Amari distance to ground truth.

64 Experiments: Stochastic Gradient Langevin Dynamics (SGLD) + Approximate MH Test Linear regression with Laplace prior. SGLD can be thrown off track in areas of large gradients and low density.

### Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Section 5. Stan for Big Data Bob Carpenter Columbia University Part I Overview Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Anoop Korattikara AKORATTI@UCI.EDU School of Information & Computer Sciences, University of California, Irvine, CA 92617, USA Yutian Chen YUTIAN.CHEN@ENG.CAM.EDU Department of Engineering, University of

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Towards running complex models on big data

Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

### Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### Lab 8: Introduction to WinBUGS

40.656 Lab 8 008 Lab 8: Introduction to WinBUGS Goals:. Introduce the concepts of Bayesian data analysis.. Learn the basic syntax of WinBUGS. 3. Learn the basics of using WinBUGS in a simple example. Next

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Inference of Probability Distributions for Trust and Security applications

Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations

### Exploiting the Statistics of Learning and Inference

Exploiting the Statistics of Learning and Inference Max Welling Institute for Informatics University of Amsterdam Science Park 904, Amsterdam, Netherlands m.welling@uva.nl Abstract. When dealing with datasets

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Variational Mean Field for Graphical Models

Variational Mean Field for Graphical Models CS/CNS/EE 155 Baback Moghaddam Machine Learning Group baback @ jpl.nasa.gov Approximate Inference Consider general UGs (i.e., not tree-structured) All basic

### Imperfect Debugging in Software Reliability

Imperfect Debugging in Software Reliability Tevfik Aktekin and Toros Caglar University of New Hampshire Peter T. Paul College of Business and Economics Department of Decision Sciences and United Health

### Computational Statistics for Big Data

Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

### Introduction to Hypothesis Testing. Point estimation and confidence intervals are useful statistical inference procedures.

Introduction to Hypothesis Testing Point estimation and confidence intervals are useful statistical inference procedures. Another type of inference is used frequently used concerns tests of hypotheses.

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Designing a learning system

Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing

### Gaussian Processes to Speed up Hamiltonian Monte Carlo

Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

### Sampling and Hypothesis Testing

Population and sample Sampling and Hypothesis Testing Allin Cottrell Population : an entire set of objects or units of observation of one sort or another. Sample : subset of a population. Parameter versus

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Introduction to Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

### Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### A Latent Variable Approach to Validate Credit Rating Systems using R

A Latent Variable Approach to Validate Credit Rating Systems using R Chicago, April 24, 2009 Bettina Grün a, Paul Hofmarcher a, Kurt Hornik a, Christoph Leitner a, Stefan Pichler a a WU Wien Grün/Hofmarcher/Hornik/Leitner/Pichler

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Gaussian Processes for Big Data Problems

Gaussian Processes for Big Data Problems Marc Deisenroth Department of Computing Imperial College London http://wp.doc.ic.ac.uk/sml/marc-deisenroth Machine Learning Summer School, Chalmers University 14

### 11. Time series and dynamic linear models

11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd

### Bayesian Image Super-Resolution

Bayesian Image Super-Resolution Michael E. Tipping and Christopher M. Bishop Microsoft Research, Cambridge, U.K..................................................................... Published as: Bayesian

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Distributed Bayesian Posterior Sampling via Moment Sharing

Distributed Bayesian Posterior Sampling via Moment Sharing Minjie Xu 1, Balaji Lakshminarayanan 2, Yee Whye Teh 3, Jun Zhu 1, and Bo Zhang 1 1 State Key Lab of Intelligent Technology and Systems; Tsinghua

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

### large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

### Inference on Phase-type Models via MCMC

Inference on Phase-type Models via MCMC with application to networks of repairable redundant systems Louis JM Aslett and Simon P Wilson Trinity College Dublin 28 th June 202 Toy Example : Redundant Repairable

### Tutorial on variational approximation methods. Tommi S. Jaakkola MIT AI Lab

Tutorial on variational approximation methods Tommi S. Jaakkola MIT AI Lab tommi@ai.mit.edu Tutorial topics A bit of history Examples of variational methods A brief intro to graphical models Variational

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Hypothesis Testing. 1 Introduction. 2 Hypotheses. 2.1 Null and Alternative Hypotheses. 2.2 Simple vs. Composite. 2.3 One-Sided and Two-Sided Tests

Hypothesis Testing 1 Introduction This document is a simple tutorial on hypothesis testing. It presents the basic concepts and definitions as well as some frequently asked questions associated with hypothesis

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

### 171:290 Model Selection Lecture II: The Akaike Information Criterion

171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information

### A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing

### Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

### Recent Developments of Statistical Application in. Finance. Ruey S. Tsay. Graduate School of Business. The University of Chicago

Recent Developments of Statistical Application in Finance Ruey S. Tsay Graduate School of Business The University of Chicago Guanghua Conference, June 2004 Summary Focus on two parts: Applications in Finance:

### Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

### Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis Pelle Jakovits Outline Problem statement State of the art Approach Solutions and contributions Current work Conclusions

### Cell Phone based Activity Detection using Markov Logic Network

Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### Improving Generalization

Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

### 38. STATISTICS. 38. Statistics 1. 38.1. Fundamental concepts

38. Statistics 1 38. STATISTICS Revised September 2013 by G. Cowan (RHUL). This chapter gives an overview of statistical methods used in high-energy physics. In statistics, we are interested in using a

### Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

### Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan. 15.450, Fall 2010

Simulation Methods Leonid Kogan MIT, Sloan 15.450, Fall 2010 c Leonid Kogan ( MIT, Sloan ) Simulation Methods 15.450, Fall 2010 1 / 35 Outline 1 Generating Random Numbers 2 Variance Reduction 3 Quasi-Monte

### Machine Learning over Big Data

Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Model-based Synthesis. Tony O Hagan

Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem Lecture 12 04/08/2008 Sven Zenker Assignment no. 8 Correct setup of likelihood function One fixed set of observation

### Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Magnus Wiktorsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February

### ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Part 2: One-parameter models

Part 2: One-parameter models Bernoilli/binomial models Return to iid Y 1,...,Y n Bin(1, θ). The sampling model/likelihood is p(y 1,...,y n θ) =θ P y i (1 θ) n P y i When combined with a prior p(θ), Bayes

### Invited Applications Paper

Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM

### How to assess the risk of a large portfolio? How to estimate a large covariance matrix?

Chapter 3 Sparse Portfolio Allocation This chapter touches some practical aspects of portfolio allocation and risk assessment from a large pool of financial assets (e.g. stocks) How to assess the risk

### BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli

### Detecting Corporate Fraud: An Application of Machine Learning

Detecting Corporate Fraud: An Application of Machine Learning Ophir Gottlieb, Curt Salisbury, Howard Shek, Vishal Vaidyanathan December 15, 2006 ABSTRACT This paper explores the application of several

### Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

### INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

### Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

Exponential Random Graph Models for Social Network Analysis Danny Wyatt 590AI March 6, 2009 Traditional Social Network Analysis Covered by Eytan Traditional SNA uses descriptive statistics Path lengths