Christfried Webers. Canberra February June 2015


 Sherilyn Hoover
 2 years ago
 Views:
Transcription
1 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829
2 c Part VIII Linear Classification 2 Logistic 277of 829
3 Three for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative 1 Solve the inference problem of determining the classconditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Logistic 278of 829
4 Generative approach: model classconditional densities p(x C k ) and priors p(c k ) to calculate the posterior probability for class C 1 p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = 1 + exp( a(x)) = σ(a(x)) where a and the logistic sigmoid function σ(a) are given by a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln p(x, C 1) p(x, C 2 ) 1 σ(a) = 1 + exp( a). c Logistic 279of 829
5 Logistic Sigmoid The logistic sigmoid function σ(a) = 1 1+exp( a) "squashing function because it maps the real axis into a finite interval (0, 1) σ( a) = 1 σ(a) Derivative d daσ(a) = σ(a) σ( a) = σ(a) (1 σ(a)) ( ) Inverse is called logit function a(σ) = ln σ 1 σ c Σ a a Σ 4 Logistic Σ Logistic Sigmoid σ(a) a 6 Logit a(σ) 280of 829
6  Multiclass c The normalised exponential is given by where p(c k x) = p(x C k) p(c k ) j p(x C j) p(c j ) = exp(a k) j exp(a j) a k = ln(p(x C k ) p(c k )). Also called softmax function as it is a smoothed version of the max function. Example: If a k a j for all j k, then p(c k x) 1, and p(c j x) 0. Logistic 281of 829
7 Probabil. Generative Model  c Assume classconditional probabilities are Gaussian, all classes share the same covariance. What can we say about the posterior probabilities? { 1 1 p(x C k ) = exp 1 } (2π) D/2 Σ 1/2 2 (x µ k) T Σ 1 (x µ k ) { 1 1 = exp 1 } (2π) D/2 Σ 1/2 2 xt Σ 1 x exp {µ Tk Σ 1 x 12 } µtk Σ 1 µ k where we separated the quadratic term in x and the linear term. Logistic 282of 829
8 Probabil. Generative Model  For two classes and a(x) is Therefore where p(c 1 x) = σ(a(x)) a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln exp { } µ T 1 Σ 1 x 1 2 µt 1 Σ 1 µ 1 exp { } + ln p(c 1) µ T 2 Σ 1 x 1 2 µt 2 Σ 1 µ 2 p(c 2 ) w = Σ 1 (µ 1 µ 2 ) p(c 1 x) = σ(w T x + w 0 ) w 0 = 1 2 µt 1 Σ 1 µ µt 2 Σ 1 µ 2 + ln p(c 1) p(c 2 ) c Logistic 283of 829
9 Probabil. Generative Model  c Classconditional densities for two classes (left). Posterior probability p(c 1 x) (right). Note the logistic sigmoid of a linear function of x. Logistic 284of 829
10 General Case  K Classes, Shared Covariance Use the normalised exponential where p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) to get a linear function of x a k = ln (p(x C k )p(c k )). c where a k (x) = w T k x + w k0. w k = Σ 1 µ k w k0 = 1 2 µt k Σ 1 µ k + p(c k ). Logistic 285of 829
11 General Case  K Classes, Different Covariance If each classconditional probability has a different covariance, the quadratic terms 1 2 xt Σ 1 x do not longer cancel each other out. We get a quadratic discriminant. c Logistic 286of 829
12 Maximum Likelihood Solution Given the functional form of the classconditional densities p(x C k ), can we determine the parameters µ and Σ? c Logistic 287of 829
13 Maximum Likelihood Solution Given the functional form of the classconditional densities p(x C k ), can we determine the parameters µ and Σ? Not without data ;) Given also a data set (x n, t n ) for n = 1,..., N. (Using the coding scheme where t n = 1 corresponds to class C 1 and t n = 0 denotes class C 2. Assume the classconditional densities to be Gaussian with the same covariance, but different mean. Denote the prior probability p(c 1 ) = π, and therefore p(c 2 ) = 1 π. Then p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = π N (x n µ 1, Σ) p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π) N (x n µ 2, Σ) c Logistic 288of 829
14 Maximum Likelihood Solution Thus the likelihood for the whole data set X and t is given by p(t, X π, µ 1, µ 2, Σ) = N [π N (x n µ 1, Σ)] tn n=1 Maximise the log likelihood The term depending on π is [(1 π) N (x n µ 2, Σ)] 1 tn N (t n ln π + (1 t n ) ln(1 π) n=1 which is maximal for π = 1 N N n=1 t n = N 1 N = N 1 N 1 + N 2 where N 1 is the number of data points in class C 1. c Logistic 289of 829
15 Maximum Likelihood Solution c Similarly, we can maximise the log likelihood (and thereby the likelihood p(t, X π, µ 1, µ 2, Σ) ) depending on the mean µ 1 or µ 2, and get µ 1 = 1 N t n x n N 1 n=1 µ 2 = 1 N (1 t n ) x n N 2 n=1 For each class, this are the means of all input vectors assigned to this class. Logistic 290of 829
16 Maximum Likelihood Solution c Finally, the log likelihood ln p(t, X π, µ 1, µ 2, Σ) can be maximised for the covariance Σ resulting in Σ = N 1 N S 1 + N 2 N S 2 S k = 1 N k n C k (x n µ k )(x n µ k ) T Logistic 291of 829
17  Naive Bayes c Assume the input space consists of discrete features, in the simplest case x i {0, 1}. For a Ddimensional input space, a general distribution would be represented by a table with 2 D entries. Together with the normalisation constraint, this are 2 D 1 independent variables. Grows exponentially with the number of features. The Naive Bayes assumption is that all features conditioned on the class C k are independent of each other. p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi Logistic 292of 829
18  Naive Bayes With the naive Bayes p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi we can then again find the factors a k in the normalised exponential p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) as a linear function of the x i a k (x) = D {x i ln µ ki + (1 x i ) ln(1 µ ki )} + ln p(c k ). i=1 c Logistic 293of 829
19 Three for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative 1 Solve the inference problem of determining the classconditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Logistic 294of 829
20 c Maximise a likelihood function defined through the conditional distribution p(c k x) directly. Discriminative training Typically fewer parameters to be determined. As we learn the posteriror p(c k x) directly, prediction may be better than with a generative model where the classconditional density assumptions p(x C k ) poorly approximate the true distributions. But: discriminative models can not create synthetic data, as p(x) is not modelled. Logistic 295of 829
21 Original Input versus Feature Space Used direct input x until now. All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at the green crosses in the input space. c x2 1 0 φ Logistic x φ1 296of 829
22 Original Input versus Feature Space Linear decision boundaries in the feature space correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space. Nonlinear features φ(x) can not remove the overlap; but they may increase it! c Logistic x2 1 0 φ x φ1 297of 829
23 Original Input versus Feature Space c Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Linear ). Understanding of more advanced algorithms becomes easier if we introduce the feature space now and use it instead of the original input space. Some applications use fixed features successfully by avoiding the limitations. We will therefore use φ instead of x from now on. Logistic 298of 829
24 Logistic is Classification Two classes where the posterior of class C 1 is a logistic sigmoid σ() acting on a linear function of the feature vector φ p(c 1 φ) = y(φ) = σ(w T φ) p(c 2 φ) = 1 p(c 1 φ) Model dimension is equal to dimension of the feature space M. Compare this to fitting two Gaussians }{{} 2M + M(M + 1)/2 = M(M + 5)/2 } {{ } means shared covariance For larger M, the logistic regression model has a clear advantage. c Logistic 299of 829
25 Logistic is Classification Determine the parameter via maximum likelihood for data (φ n, t n ), n = 1,..., N, where φ n = φ(x n ). The class membership is coded as t n {0, 1}. Likelihood function p(t w) = where y n = p(c 1 φ n ). N n=1 y tn n (1 y n ) 1 tn Error function : negative log likelihood resulting in the crossentropy error function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 c Logistic 300of 829
26 Logistic is Classification Error function (crossentropy error ) E(w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 y n = p(c 1 φ n ) = σ(w T φ n ) Gradient of the error function (using dσ da E(w) = N (y n t n )φ n n=1 = σ(1 σ) ) gradient does not contain any sigmoid function for each data point error is product of deviation y n t n and basis function φ n. BUT : maximum likelihood solution can exhibit overfitting even for many data points; should use regularised error or MAP then. c Logistic 301of 829
27 Given a continous distribution p(x) which is not Gaussian, can we approximate it by a Gaussian q(x)? Need to find a mode of p(x). Try to find a Gaussian with the same mode. c Logistic NonGaussian (yellow) and Gaussian approximation (red). Negative log of the NonGaussian (yellow) and Gaussian approx. (red). 302of 829
28 Assume p(x) can be written as c p(z) = 1 Z f (z) with normalisation Z = f (z) dz. Furthermore, assume Z is unknown! A mode of p(z) is at a point z 0 where p (z 0 ) = 0. Taylor expansion of ln f (z) at z 0 where ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 A = d2 dz 2 ln f (z) z=z 0 Logistic 303of 829
29 c Exponentiating we get ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 f (z) f (z 0 ) exp{ A 2 (z z 0) 2 }. And after normalisation we get the Laplace approximation q(z) = ( ) 1/2 A exp{ A 2π 2 (z z 0) 2 }. Only defined for precision A > 0 as only then p(z) has a maximum. Logistic 304of 829
30  Vector Space Approximate p(z) for z R M c we get the Taylor expansion p(z) = 1 Z f (z). ln f (z) ln f (z 0 ) 1 2 (z z 0) T A(z z 0 ) where the Hessian A is defined as A = ln f (z) z=z0. The Laplace approximation of p(z) is then q(z) = { A 1/2 exp 1 } (2π) M/2 2 (z z 0) T A(z z 0 ) = N (z z 0, A 1 ) Logistic 305of 829
31 c Exact Bayesian inference for the logistic regression is intractable. Why? Need to normalise a product of prior probabilities and likelihoods which itself are a product of logistic sigmoid functions, one for each data point. Evaluation of the predictive distribution also intractable. Therefore we will use the Laplace approximation. Logistic 306of 829
32 c Assume a Gaussian prior because we want a Gaussian posterior. p(w) = N (w m 0, S 0 ) for fixed hyperparameter m 0 and S 0. Hyperparameters are parameters of a prior distribution. In contrast to the model parameters w, they are not learned. For a set of training data (x n, t n ), where n = 1,..., N, the posterior is given by where t = (t 1,..., t N ) T. p(w t) p(w)p(t w) Logistic 307of 829
33 Using our previous result for the crossentropy function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 we can now calculate the log of the posterior p(w t) p(w)p(t w) using the notation y n = σ(w T φ n ) as ln p(w t) = 1 2 (w m 0) T S 1 0 (w m 0) + N {t n ln y n + (1 t n ) ln(1 y n )} n=1 c Logistic 308of 829
34 To obtain a Gaussian approximation to ln p(w t) = 1 2 (w m 0) T S 1 0 (w m 0) + N {t n ln y n + (1 t n ) ln(1 y n )} n=1 1 Find w MAP which maximises ln p(w t). This defines the mean of the Gaussian approximation. (Note: This is a nonlinear function in w because y n = σ(w T φ n ).) 2 Calculate the second derivative of the negative log likelihood to get the inverse covariance of the Laplace approximation S N = ln p(w t) = S N y n (1 y n )φ n φ T n. n=1 c Logistic 309of 829
35 c The approximated Gaussian (via Laplace approximation) of the posterior distribution is now where q(w φ) = N (w w MAP, S N ) S N = ln p(w t) = S N y n (1 y n )φ n φ T n. n=1 Logistic 310of 829
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
More informationLinear Models for Classification
Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationCS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationLinear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.
Steven J Zeil Old Dominion Univ. Fall 200 DiscriminantBased Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 DiscriminantBased Classification Likelihoodbased:
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationPattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University
Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationPa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on
Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationProbabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur
Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationCCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal  the stuff biology is not
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationA crash course in probability and Naïve Bayes classification
Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s
More informationParametric Models Part I: Maximum Likelihood and Bayesian Density Estimation
Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015
More informationProbability Theory. Elementary rules of probability Sum rule. Product rule. p. 23
Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction
More informationLecture 8 February 4
ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt
More informationCSC 411: Lecture 07: Multiclass Classification
CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07Multiclass
More informationBayes and Naïve Bayes. cs534machine Learning
Bayes and aïve Bayes cs534machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
More informationSection 5. Stan for Big Data. Bob Carpenter. Columbia University
Section 5. Stan for Big Data Bob Carpenter Columbia University Part I Overview Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: MultiLayer Perceptrons
University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: MultiLayer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationCheng Soon Ong & Christfried Webers. Canberra February June 2016
c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I
More informationClassification. Chapter 3
Chapter 3 Classification In chapter we have considered regression problems, where the targets are real valued. Another important class of problems is classification problems, where we wish to assign an
More informationReject Inference in Credit Scoring. JieMen Mok
Reject Inference in Credit Scoring JieMen Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationPrincipled Hybrids of Generative and Discriminative Models
Principled Hybrids of Generative and Discriminative Models Julia A. Lasserre University of Cambridge Cambridge, UK jal62@cam.ac.uk Christopher M. Bishop Microsoft Research Cambridge, UK cmbishop@microsoft.com
More informationResponse variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.
Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables  logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit
More informationLecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
More informationNotes for STA 437/1005 Methods for Multivariate Data
Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More informationGaussian Processes in Machine Learning
Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl
More informationEfficient Streaming Classification Methods
1/44 Efficient Streaming Classification Methods Niall M. Adams 1, Nicos G. Pavlidis 2, Christoforos Anagnostopoulos 3, Dimitris K. Tasoulis 1 1 Department of Mathematics 2 Institute for Mathematical Sciences
More informationNominal and ordinal logistic regression
Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationP (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )
Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =
More informationSemiSupervised Support Vector Machines and Application to Spam Filtering
SemiSupervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationA LogRobust Optimization Approach to Portfolio Management
A LogRobust Optimization Approach to Portfolio Management Dr. Aurélie Thiele Lehigh University Joint work with Ban Kawas Research partially supported by the National Science Foundation Grant CMMI0757983
More informationSummary of Probability
Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible
More information11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 letex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial
More informationSupport Vector Machine (SVM)
Support Vector Machine (SVM) CE725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept HardMargin SVM SoftMargin SVM Dual Problems of HardMargin
More informationGaussian Classifiers CS498
Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary
More informationInterpretation of Somers D under four simple models
Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms
More informationCSI:FLORIDA. Section 4.4: Logistic Regression
SI:FLORIDA Section 4.4: Logistic Regression SI:FLORIDA Reisit Masked lass Problem.5.5 2 .5  .5 .5  .5.5.5 We can generalize this roblem to two class roblem as well! SI:FLORIDA Reisit Masked lass
More informationSYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation
SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu September 19, 2015 Outline
More informationCell Phone based Activity Detection using Markov Logic Network
Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart
More informationRobert Collins CSE598G. More on Meanshift. R.Collins, CSE, PSU CSE598G Spring 2006
More on Meanshift R.Collins, CSE, PSU Spring 2006 Recall: Kernel Density Estimation Given a set of data samples x i ; i=1...n Convolve with a kernel function H to generate a smooth function f(x) Equivalent
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models  part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK2800 Kgs. Lyngby
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #47/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationGeneralized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)
Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationMATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables
MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables Tony Pourmohamad Department of Mathematics De Anza College Spring 2015 Objectives By the end of this set of slides,
More informationPenalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAGLMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
More informationThe Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
More informationMessagepassing sequential detection of multiple change points in networks
Messagepassing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal
More informationChapter 2: Binomial Methods and the BlackScholes Formula
Chapter 2: Binomial Methods and the BlackScholes Formula 2.1 Binomial Trees We consider a financial market consisting of a bond B t = B(t), a stock S t = S(t), and a calloption C t = C(t), where the
More informationThe equivalence of logistic regression and maximum entropy models
The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.winvector.com/blog/20/09/thesimplerderivationoflogisticregression/
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multiclass classification.
More informationMonotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. AbuMostafa EE and CS Deptartments California Institute of Technology
More informationJUST THE MATHS UNIT NUMBER 1.8. ALGEBRA 8 (Polynomials) A.J.Hobson
JUST THE MATHS UNIT NUMBER 1.8 ALGEBRA 8 (Polynomials) by A.J.Hobson 1.8.1 The factor theorem 1.8.2 Application to quadratic and cubic expressions 1.8.3 Cubic equations 1.8.4 Long division of polynomials
More informationExploratory Data Analysis Using Radial Basis Function Latent Variable Models
Exploratory Data Analysis Using Radial Basis Function Latent Variable Models Alan D. Marrs and Andrew R. Webb DERA St Andrews Road, Malvern Worcestershire U.K. WR14 3PS {marrs,webb}@signal.dera.gov.uk
More informationProbabilistic user behavior models in online stores for recommender systems
Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user
More informationBasic Utility Theory for Portfolio Selection
Basic Utility Theory for Portfolio Selection In economics and finance, the most popular approach to the problem of choice under uncertainty is the expected utility (EU) hypothesis. The reason for this
More informationBayesX  Software for Bayesian Inference in Structured Additive Regression
BayesX  Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, LudwigMaximiliansUniversity Munich
More informationInstitute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
More informationData Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a
More information1 The Brownian bridge construction
The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof
More informationSpatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
More informationGaussian Conjugate Prior Cheat Sheet
Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian
More informationThe Exponential Family
The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural
More informationAn Introduction to Basic Statistics and Probability
An Introduction to Basic Statistics and Probability Shenek Heyward NCSU An Introduction to Basic Statistics and Probability p. 1/4 Outline Basic probability concepts Conditional probability Discrete Random
More informationGaussian Process Training with Input Noise
Gaussian Process Training with Input Noise Andrew McHutchon Department of Engineering Cambridge University Cambridge, CB PZ ajm57@cam.ac.uk Carl Edward Rasmussen Department of Engineering Cambridge University
More informationDetection of changes in variance using binary segmentation and optimal partitioning
Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the
More informationSupport Vector Machines for Classification and Regression
UNIVERSITY OF SOUTHAMPTON Support Vector Machines for Classification and Regression by Steve R. Gunn Technical Report Faculty of Engineering, Science and Mathematics School of Electronics and Computer
More informationSection 5.1 Continuous Random Variables: Introduction
Section 5. Continuous Random Variables: Introduction Not all random variables are discrete. For example:. Waiting times for anything (train, arrival of customer, production of mrna molecule from gene,
More informationMaster s Theory Exam Spring 2006
Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem
More informationLOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationMVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr
Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen
More informationSampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data
Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian
More informationComputational Statistics for Big Data
Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount
More informationFeature Selection for HighDimensional Genomic Microarray Data
Feature Selection for HighDimensional Genomic Microarray Data Eric P. Xing Michael I. Jordan Richard M. Karp Division of Computer Science, University of California, Berkeley, CA 9472 Department of Statistics,
More informationOptimal order placement in a limit order book. Adrien de Larrard and Xin Guo. Laboratoire de Probabilités, Univ Paris VI & UC Berkeley
Optimal order placement in a limit order book Laboratoire de Probabilités, Univ Paris VI & UC Berkeley Outline 1 Background: Algorithm trading in different time scales 2 Some note on optimal execution
More informationData Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber 2011 1
Data Modeling & Analysis Techniques Probability & Statistics Manfred Huber 2011 1 Probability and Statistics Probability and statistics are often used interchangeably but are different, related fields
More informationLearning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu
Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of
More informationL4: Bayesian Decision Theory
L4: Bayesian Decision Theory Likelihood ratio test Probability of error Bayes risk Bayes, MAP and ML criteria Multiclass problems Discriminant functions CSCE 666 Pattern Analysis Ricardo GutierrezOsuna
More information