# Christfried Webers. Canberra February June 2015

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829

2 c Part VIII Linear Classification 2 Logistic 277of 829

3 Three for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Logistic 278of 829

4 Generative approach: model class-conditional densities p(x C k ) and priors p(c k ) to calculate the posterior probability for class C 1 p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = 1 + exp( a(x)) = σ(a(x)) where a and the logistic sigmoid function σ(a) are given by a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln p(x, C 1) p(x, C 2 ) 1 σ(a) = 1 + exp( a). c Logistic 279of 829

5 Logistic Sigmoid The logistic sigmoid function σ(a) = 1 1+exp( a) "squashing function because it maps the real axis into a finite interval (0, 1) σ( a) = 1 σ(a) Derivative d daσ(a) = σ(a) σ( a) = σ(a) (1 σ(a)) ( ) Inverse is called logit function a(σ) = ln σ 1 σ c Σ a a Σ 4 Logistic Σ Logistic Sigmoid σ(a) a 6 Logit a(σ) 280of 829

6 - Multiclass c The normalised exponential is given by where p(c k x) = p(x C k) p(c k ) j p(x C j) p(c j ) = exp(a k) j exp(a j) a k = ln(p(x C k ) p(c k )). Also called softmax function as it is a smoothed version of the max function. Example: If a k a j for all j k, then p(c k x) 1, and p(c j x) 0. Logistic 281of 829

7 Probabil. Generative Model - c Assume class-conditional probabilities are Gaussian, all classes share the same covariance. What can we say about the posterior probabilities? { 1 1 p(x C k ) = exp 1 } (2π) D/2 Σ 1/2 2 (x µ k) T Σ 1 (x µ k ) { 1 1 = exp 1 } (2π) D/2 Σ 1/2 2 xt Σ 1 x exp {µ Tk Σ 1 x 12 } µtk Σ 1 µ k where we separated the quadratic term in x and the linear term. Logistic 282of 829

8 Probabil. Generative Model - For two classes and a(x) is Therefore where p(c 1 x) = σ(a(x)) a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln exp { } µ T 1 Σ 1 x 1 2 µt 1 Σ 1 µ 1 exp { } + ln p(c 1) µ T 2 Σ 1 x 1 2 µt 2 Σ 1 µ 2 p(c 2 ) w = Σ 1 (µ 1 µ 2 ) p(c 1 x) = σ(w T x + w 0 ) w 0 = 1 2 µt 1 Σ 1 µ µt 2 Σ 1 µ 2 + ln p(c 1) p(c 2 ) c Logistic 283of 829

9 Probabil. Generative Model - c Class-conditional densities for two classes (left). Posterior probability p(c 1 x) (right). Note the logistic sigmoid of a linear function of x. Logistic 284of 829

10 General Case - K Classes, Shared Covariance Use the normalised exponential where p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) to get a linear function of x a k = ln (p(x C k )p(c k )). c where a k (x) = w T k x + w k0. w k = Σ 1 µ k w k0 = 1 2 µt k Σ 1 µ k + p(c k ). Logistic 285of 829

11 General Case - K Classes, Different Covariance If each class-conditional probability has a different covariance, the quadratic terms 1 2 xt Σ 1 x do not longer cancel each other out. We get a quadratic discriminant. c Logistic 286of 829

12 Maximum Likelihood Solution Given the functional form of the class-conditional densities p(x C k ), can we determine the parameters µ and Σ? c Logistic 287of 829

13 Maximum Likelihood Solution Given the functional form of the class-conditional densities p(x C k ), can we determine the parameters µ and Σ? Not without data ;-) Given also a data set (x n, t n ) for n = 1,..., N. (Using the coding scheme where t n = 1 corresponds to class C 1 and t n = 0 denotes class C 2. Assume the class-conditional densities to be Gaussian with the same covariance, but different mean. Denote the prior probability p(c 1 ) = π, and therefore p(c 2 ) = 1 π. Then p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = π N (x n µ 1, Σ) p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π) N (x n µ 2, Σ) c Logistic 288of 829

14 Maximum Likelihood Solution Thus the likelihood for the whole data set X and t is given by p(t, X π, µ 1, µ 2, Σ) = N [π N (x n µ 1, Σ)] tn n=1 Maximise the log likelihood The term depending on π is [(1 π) N (x n µ 2, Σ)] 1 tn N (t n ln π + (1 t n ) ln(1 π) n=1 which is maximal for π = 1 N N n=1 t n = N 1 N = N 1 N 1 + N 2 where N 1 is the number of data points in class C 1. c Logistic 289of 829

15 Maximum Likelihood Solution c Similarly, we can maximise the log likelihood (and thereby the likelihood p(t, X π, µ 1, µ 2, Σ) ) depending on the mean µ 1 or µ 2, and get µ 1 = 1 N t n x n N 1 n=1 µ 2 = 1 N (1 t n ) x n N 2 n=1 For each class, this are the means of all input vectors assigned to this class. Logistic 290of 829

16 Maximum Likelihood Solution c Finally, the log likelihood ln p(t, X π, µ 1, µ 2, Σ) can be maximised for the covariance Σ resulting in Σ = N 1 N S 1 + N 2 N S 2 S k = 1 N k n C k (x n µ k )(x n µ k ) T Logistic 291of 829

17 - Naive Bayes c Assume the input space consists of discrete features, in the simplest case x i {0, 1}. For a D-dimensional input space, a general distribution would be represented by a table with 2 D entries. Together with the normalisation constraint, this are 2 D 1 independent variables. Grows exponentially with the number of features. The Naive Bayes assumption is that all features conditioned on the class C k are independent of each other. p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi Logistic 292of 829

18 - Naive Bayes With the naive Bayes p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi we can then again find the factors a k in the normalised exponential p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) as a linear function of the x i a k (x) = D {x i ln µ ki + (1 x i ) ln(1 µ ki )} + ln p(c k ). i=1 c Logistic 293of 829

19 Three for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Logistic 294of 829

20 c Maximise a likelihood function defined through the conditional distribution p(c k x) directly. Discriminative training Typically fewer parameters to be determined. As we learn the posteriror p(c k x) directly, prediction may be better than with a generative model where the class-conditional density assumptions p(x C k ) poorly approximate the true distributions. But: discriminative models can not create synthetic data, as p(x) is not modelled. Logistic 295of 829

21 Original Input versus Feature Space Used direct input x until now. All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at the green crosses in the input space. c x2 1 0 φ Logistic x φ1 296of 829

22 Original Input versus Feature Space Linear decision boundaries in the feature space correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space. Nonlinear features φ(x) can not remove the overlap; but they may increase it! c Logistic x2 1 0 φ x φ1 297of 829

23 Original Input versus Feature Space c Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Linear ). Understanding of more advanced algorithms becomes easier if we introduce the feature space now and use it instead of the original input space. Some applications use fixed features successfully by avoiding the limitations. We will therefore use φ instead of x from now on. Logistic 298of 829

24 Logistic is Classification Two classes where the posterior of class C 1 is a logistic sigmoid σ() acting on a linear function of the feature vector φ p(c 1 φ) = y(φ) = σ(w T φ) p(c 2 φ) = 1 p(c 1 φ) Model dimension is equal to dimension of the feature space M. Compare this to fitting two Gaussians }{{} 2M + M(M + 1)/2 = M(M + 5)/2 } {{ } means shared covariance For larger M, the logistic regression model has a clear advantage. c Logistic 299of 829

25 Logistic is Classification Determine the parameter via maximum likelihood for data (φ n, t n ), n = 1,..., N, where φ n = φ(x n ). The class membership is coded as t n {0, 1}. Likelihood function p(t w) = where y n = p(c 1 φ n ). N n=1 y tn n (1 y n ) 1 tn Error function : negative log likelihood resulting in the cross-entropy error function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 c Logistic 300of 829

26 Logistic is Classification Error function (cross-entropy error ) E(w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 y n = p(c 1 φ n ) = σ(w T φ n ) Gradient of the error function (using dσ da E(w) = N (y n t n )φ n n=1 = σ(1 σ) ) gradient does not contain any sigmoid function for each data point error is product of deviation y n t n and basis function φ n. BUT : maximum likelihood solution can exhibit over-fitting even for many data points; should use regularised error or MAP then. c Logistic 301of 829

27 Given a continous distribution p(x) which is not Gaussian, can we approximate it by a Gaussian q(x)? Need to find a mode of p(x). Try to find a Gaussian with the same mode. c Logistic Non-Gaussian (yellow) and Gaussian approximation (red). Negative log of the Non-Gaussian (yellow) and Gaussian approx. (red). 302of 829

28 Assume p(x) can be written as c p(z) = 1 Z f (z) with normalisation Z = f (z) dz. Furthermore, assume Z is unknown! A mode of p(z) is at a point z 0 where p (z 0 ) = 0. Taylor expansion of ln f (z) at z 0 where ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 A = d2 dz 2 ln f (z) z=z 0 Logistic 303of 829

29 c Exponentiating we get ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 f (z) f (z 0 ) exp{ A 2 (z z 0) 2 }. And after normalisation we get the Laplace approximation q(z) = ( ) 1/2 A exp{ A 2π 2 (z z 0) 2 }. Only defined for precision A > 0 as only then p(z) has a maximum. Logistic 304of 829

30 - Vector Space Approximate p(z) for z R M c we get the Taylor expansion p(z) = 1 Z f (z). ln f (z) ln f (z 0 ) 1 2 (z z 0) T A(z z 0 ) where the Hessian A is defined as A = ln f (z) z=z0. The Laplace approximation of p(z) is then q(z) = { A 1/2 exp 1 } (2π) M/2 2 (z z 0) T A(z z 0 ) = N (z z 0, A 1 ) Logistic 305of 829

31 c Exact Bayesian inference for the logistic regression is intractable. Why? Need to normalise a product of prior probabilities and likelihoods which itself are a product of logistic sigmoid functions, one for each data point. Evaluation of the predictive distribution also intractable. Therefore we will use the Laplace approximation. Logistic 306of 829

32 c Assume a Gaussian prior because we want a Gaussian posterior. p(w) = N (w m 0, S 0 ) for fixed hyperparameter m 0 and S 0. Hyperparameters are parameters of a prior distribution. In contrast to the model parameters w, they are not learned. For a set of training data (x n, t n ), where n = 1,..., N, the posterior is given by where t = (t 1,..., t N ) T. p(w t) p(w)p(t w) Logistic 307of 829

33 Using our previous result for the cross-entropy function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 we can now calculate the log of the posterior p(w t) p(w)p(t w) using the notation y n = σ(w T φ n ) as ln p(w t) = 1 2 (w m 0) T S 1 0 (w m 0) + N {t n ln y n + (1 t n ) ln(1 y n )} n=1 c Logistic 308of 829

34 To obtain a Gaussian approximation to ln p(w t) = 1 2 (w m 0) T S 1 0 (w m 0) + N {t n ln y n + (1 t n ) ln(1 y n )} n=1 1 Find w MAP which maximises ln p(w t). This defines the mean of the Gaussian approximation. (Note: This is a nonlinear function in w because y n = σ(w T φ n ).) 2 Calculate the second derivative of the negative log likelihood to get the inverse covariance of the Laplace approximation S N = ln p(w t) = S N y n (1 y n )φ n φ T n. n=1 c Logistic 309of 829

35 c The approximated Gaussian (via Laplace approximation) of the posterior distribution is now where q(w φ) = N (w w MAP, S N ) S N = ln p(w t) = S N y n (1 y n )φ n φ T n. n=1 Logistic 310of 829

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Probability Theory. Elementary rules of probability Sum rule. Product rule. p. 23

Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### CSC 411: Lecture 07: Multiclass Classification

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Section 5. Stan for Big Data Bob Carpenter Columbia University Part I Overview Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y

### Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### Classification. Chapter 3

Chapter 3 Classification In chapter we have considered regression problems, where the targets are real valued. Another important class of problems is classification problems, where we wish to assign an

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Principled Hybrids of Generative and Discriminative Models

Principled Hybrids of Generative and Discriminative Models Julia A. Lasserre University of Cambridge Cambridge, UK jal62@cam.ac.uk Christopher M. Bishop Microsoft Research Cambridge, UK cmbishop@microsoft.com

### Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl

### Efficient Streaming Classification Methods

1/44 Efficient Streaming Classification Methods Niall M. Adams 1, Nicos G. Pavlidis 2, Christoforos Anagnostopoulos 3, Dimitris K. Tasoulis 1 1 Department of Mathematics 2 Institute for Mathematical Sciences

### Nominal and ordinal logistic regression

Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### A Log-Robust Optimization Approach to Portfolio Management

A Log-Robust Optimization Approach to Portfolio Management Dr. Aurélie Thiele Lehigh University Joint work with Ban Kawas Research partially supported by the National Science Foundation Grant CMMI-0757983

### Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### Support Vector Machine (SVM)

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

### Gaussian Classifiers CS498

Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary

### Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

### CSI:FLORIDA. Section 4.4: Logistic Regression

SI:FLORIDA Section 4.4: Logistic Regression SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 We can generalize this roblem to two class roblem as well! SI:FLORIDA Reisit Masked lass

### SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation

SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu September 19, 2015 Outline

### Cell Phone based Activity Detection using Markov Logic Network

Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart

### Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006

More on Mean-shift R.Collins, CSE, PSU Spring 2006 Recall: Kernel Density Estimation Given a set of data samples x i ; i=1...n Convolve with a kernel function H to generate a smooth function f(x) Equivalent

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables Tony Pourmohamad Department of Mathematics De Anza College Spring 2015 Objectives By the end of this set of slides,

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### The Optimality of Naive Bayes

The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

### Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

### Chapter 2: Binomial Methods and the Black-Scholes Formula

Chapter 2: Binomial Methods and the Black-Scholes Formula 2.1 Binomial Trees We consider a financial market consisting of a bond B t = B(t), a stock S t = S(t), and a call-option C t = C(t), where the

### The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### JUST THE MATHS UNIT NUMBER 1.8. ALGEBRA 8 (Polynomials) A.J.Hobson

JUST THE MATHS UNIT NUMBER 1.8 ALGEBRA 8 (Polynomials) by A.J.Hobson 1.8.1 The factor theorem 1.8.2 Application to quadratic and cubic expressions 1.8.3 Cubic equations 1.8.4 Long division of polynomials

### Exploratory Data Analysis Using Radial Basis Function Latent Variable Models

Exploratory Data Analysis Using Radial Basis Function Latent Variable Models Alan D. Marrs and Andrew R. Webb DERA St Andrews Road, Malvern Worcestershire U.K. WR14 3PS {marrs,webb}@signal.dera.gov.uk

### Probabilistic user behavior models in online stores for recommender systems

Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user

### Basic Utility Theory for Portfolio Selection

Basic Utility Theory for Portfolio Selection In economics and finance, the most popular approach to the problem of choice under uncertainty is the expected utility (EU) hypothesis. The reason for this

### BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

### Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### 1 The Brownian bridge construction

The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof

### Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

### Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### An Introduction to Basic Statistics and Probability

An Introduction to Basic Statistics and Probability Shenek Heyward NCSU An Introduction to Basic Statistics and Probability p. 1/4 Outline Basic probability concepts Conditional probability Discrete Random

### Gaussian Process Training with Input Noise

Gaussian Process Training with Input Noise Andrew McHutchon Department of Engineering Cambridge University Cambridge, CB PZ ajm57@cam.ac.uk Carl Edward Rasmussen Department of Engineering Cambridge University

### Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

### Support Vector Machines for Classification and Regression

UNIVERSITY OF SOUTHAMPTON Support Vector Machines for Classification and Regression by Steve R. Gunn Technical Report Faculty of Engineering, Science and Mathematics School of Electronics and Computer

### Section 5.1 Continuous Random Variables: Introduction

Section 5. Continuous Random Variables: Introduction Not all random variables are discrete. For example:. Waiting times for anything (train, arrival of customer, production of mrna molecule from gene,

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

### Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

### Computational Statistics for Big Data

Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount

### Feature Selection for High-Dimensional Genomic Microarray Data

Feature Selection for High-Dimensional Genomic Microarray Data Eric P. Xing Michael I. Jordan Richard M. Karp Division of Computer Science, University of California, Berkeley, CA 9472 Department of Statistics,

### Optimal order placement in a limit order book. Adrien de Larrard and Xin Guo. Laboratoire de Probabilités, Univ Paris VI & UC Berkeley

Optimal order placement in a limit order book Laboratoire de Probabilités, Univ Paris VI & UC Berkeley Outline 1 Background: Algorithm trading in different time scales 2 Some note on optimal execution

### Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber 2011 1

Data Modeling & Analysis Techniques Probability & Statistics Manfred Huber 2011 1 Probability and Statistics Probability and statistics are often used interchangeably but are different, related fields