# Linear Classification. Volker Tresp Summer 2015

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Linear Classification Volker Tresp Summer

2 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong (dog, cat,...)? 2

3 Linear Classifiers Linear classifiers separate classes by a linear hyperplane In high dimensions a linear classifier often can separate the classes Linear classifiers cannot solve the exclusive-or problem In combination with basis functions, kernels or a neural network, linear classifiers can form nonlinear class boundaries 3

4 Binary Classification Problems We will focus first on binary classification where the task is to assign binary class labels y i = 1 and y i = 0 (or y i = 1 and y i = 1 ) We already know the Perceptron. Now we learn about additional approaches I. Generative models for classification II. Logistic regression III. Classification via regression 4

5 Two Linearly Separable Classes 5

6 Two Classes that Cannot be Separated Linearly 6

7 The Classical Example not two Classes that cannot be Separated Linearly: XOR 7

8 Separability is not a Goal in Itself. With Overlapping Classes the Goal is the Best Possible Hyperplane 8

9 I. Generative Model for Classification In a generative model one assumes a probabilistic data generating process (likelihood model). Often generative models are complex and contain unobserved (latent, hidden) variables Here we consider a simple example: how data are generated in a binary classification problem First we have a model how classes are generated P (y). y = 1 could stand for a good customer and y = 0 could stand for a bad customer. Then we have a model how attributes are generated, given the classes P (x y). This could stand for Income, age, occupation (x) given a customer is a good customer (y = 1) Income, age, occupation (x) given a customer is not a good customer (y = 0) Using Bayes formula, we then derive P (y x): the probability that a given customer is a good customer y = 1 or bad customer y = 0, given that we know the customer s income, age and occupation 9

10 How is Data Generated? We assume that the observed classes y i are generated with probability with 0 κ 1 1. P (y i = 1) = κ 1 P (y i = 0) = κ 0 = 1 κ 1 In a next step, a data point x i has been generated from P (x i y i ) (Note, that x i = (x i,1,..., x i,m ) T, which means that x i does not contain the bias x i,0 ) We now have a complete model: P (y i )P (x i y i ) 10

11 Bayes Theorem To classify a data point x i, i.e. to determine the y i, we apply Bayes theorem and get P (y i x i ) = P (x i y i )P (y i ) P (x i ) P (x i ) = P (x i y i = 1)P (y i = 1) + P (x i y i = 0)P (y i = 0) 11

12 ML Estimate for Class Probabilities Maximum-likelihood estimator for the prior class probabilities are and ˆP (y i = 1) = ˆκ 1 = N 1 /N ˆP (y i = 0) = ˆκ 0 = N 0 /N = 1 ˆκ 1 where N 1 and N 0 is the number of training data points for class 1, respectively class 0 12

13 Class-specific Distributions To model P (x i y i ) one can chose an application specific distribution A popular choice is a Gaussian distribution P (x i y i = l) = N (x i ; µ (l), Σ) with ( ) N x i ; µ (l), Σ = ( 1 (2π) M/2 Σ exp 1 2 (x i µ (l)) T Σ 1 ( x i µ (l))) Note, that both Gaussian distributions have different modes (centers) but the same covariance matrices. This has been shown to often work well 13

14 Maximum-likelihood Estimators for Modes and Covariances One obtains a maximum likelihood estimators for the modes ˆµ (l) = 1 N l i:y i =l One obtains as unbiased estimators for the covariance matrix x i ˆΣ = 1 N M 1 (x i ˆµ (l) )(x i ˆµ (l) ) T l=0 i:y i =l 14

15 Expanding the Quadratic Terms in the Exponent Note that (x i µ (l)) T Σ 1 ( x i µ (l)) = x T i Σ 1 x i 2µ (l)t Σ 1 x i + µ (l)t Σ 1 µ (l)... and... ( x i µ (0)) T Σ 1 ( x i µ (0)) = 2 (x i µ (1)) T Σ 1 ( x i µ (1)) ( µ (0) µ (l)) T Σ 1 x i + µ (0)T Σ 1 µ (0) µ (1)T Σ 1 µ (1) 15

16 A Posteriori Distribution It follows that P (y i = 1 x i ) = P (x i y i = 1)P (y i = 1) P (x i y i = 1)P (y i = 1) + P (x i y i = 0)P (y i = 0) = P (x i y i =0)P (y i =0) P (x i y i =1)P (y i =1) = κ 0 κ exp ((µ (0) µ (1) ) T Σ 1 x 1 i µ(0)t Σ 1 µ (0) 1 2 µ(1)t Σ 1 µ (1)) ) = sig (w 0 + x T i w = sig w 0 + M x i,j w j j w = Σ 1 ( µ (1) µ (0)) 16

17 w 0 = log κ 1 /κ µ(0)t Σ 1 µ (0) µ(1)t Σ 1 µ (1) Recall: sig(arg) = 1/(1 + exp( arg))

18 Comments This specific generative model leads to linear class boundaries The solution is analogue to Fisher s linear discriminant analysis, where one projects the data into a space in which data from the same class have small variance and where the distance between class modes are maximized In other words, one gets the same results from an optimization criterion without assuming Gaussian distributions Although we have used Bayes formula, the analysis was frequentist. A Bayesian analysis with a prior distribution on the parameters is also possible If the two class-specific Gaussians have different covariance matrices the approach is still feasible but one would need to estimate two covariance matrices and the decision boundaries are not linear anymore. 17

19 Special Case: Naive Bayes If one assumes that the covariance matrices are diagonal, one obtains a Naive-Bayes classifier P (x i y i = l) = M j=1 N (x i,j ; µ (l) j, σ 2 j ) The naive Bayes classifier has considerable fewer parameters but completely ignores class-specific correlations between features; this is sometimes considered to be naive Naive Bayes classifiers are popular in text analysis with often more than features (key words). For example, the classes might be SPAM and no-spam and the features a keywords in the texts. Instead of a Gaussian distribution, a Bernoulli distribution is employed, with P (word j SPAM) = γ j,s and P (word j no-spam) = γ j,n 18

20 II. Logistic Regression The generative model motivates ) P (y i = 1 x i ) = sig (x T i w (now we include the bias x T i = (x i,0 = i, 1, x i,1,..., x i,m 1 ) T ). sig() as defined before (logistic funktion). One now optimizes the likelihood of the conditional model L(w) = N i=1 sig ) (x T i w yi ( ( )) 1 sig x T i w 1 yi 19

21 Log-Likelihood Log-likelihood function l = N i=1 y i log ( )) sig (x T i w + (1 y i ) log ( )) 1 sig (x T i w l = ( ) N 1 y i log 1 + exp( x T i w) i=1 ( ) 1 + (1 y i ) log 1 + exp(x T i w) N = y i log(1 + exp( x T i w)) + (1 y i) log(1 + exp(x T i w)) i=1 20

22 Adaption The derivatives of the log-likelihood with respect to the parameters l N w = i=1 y i x i exp( x T i w) 1 + exp( x T i w) (1 y i) x i exp(x T i w) 1 + exp(x T i w) = N i=1 y i x i (1 sig(x T i w)) (1 y i)x i sig(x T i w) = N (y i sig(x T i w))x i i=1 A gradient-based optimization of the parameters to maximize the log-likelihood w w + η l w Typically one uses a Newton-Raphson optimization procedure 21

23

24 Cross-entropy Cost Function In information theory, the cross entropy between a true distribution p and an approximative distribution q is defined as H(p, q) = x p(x = x) log q(x = x) The likelihood cost function is identical to the cross entropy cost function and is written for y i {0, 1} cost = N i=1 y i log(sig(x T i w)) (1 y i) log(1 sig(x T i w))) = N i=1 y i log(1 + exp( x T i w)) + (1 y i) log(1 + exp(x T i w)) = N i=1 log ( )) 1 + exp ((1 2y i )x T i w 22

25 ... and for y i { 1, 1} cost = N i=1 log ( )) 1 + exp ( y i x T i w

26 Logistic Regression in Medical Statistics Logistic regression has become one of the the most important tools in medicine to analyse the outcome of treatments y = 1 means that the patient has the disease. x 1 = 1 might represent the fact that the patient was exposed (e.g., by a genetic variant) and x 1 = 0 might mean that the patient was not exposed. The other inputs are often typical confounders (age, sex,...) Logistic regression then permits the prediction of the outcome for any patient Of course, of great interest is if w 1 is significantly positive (i.e., the exposure was harmful) 23

27 Log-Odds The logarithm of the odds is defined as For logistic regression, log(odds(x i )) = log P (y i = 1 x i ) P (y i = 0 x i ) log(odds(x i )) = log P (y i = 1 x i ) P (y i = 0 x i ) = log exp( x T i w) 1 + exp( x T i w) exp( x T i w) = log 1 exp( x T i w) = xt i w 24

28 Log Odds Ratio The log odds ratio evaluates the effect of the treatment (valid for any patient) w 1 = log(odds(x 1 = 1)) log(odds(x 1 = 0)) = log Odds(x 1 = 1) Odds(x 1 = 0) This is the logarithm of the so called odds ratio (OR). If w 1 0, then the exposure does not have an effect: the odds ratio is commonly used in case-control studies! What we have calculated is the log-odds for a patient with properties x i to get the disease We get a nice interpretation of the parameters in logistic regression: w 0 is the log odds of getting the disease, when the patient does not get the treatment (and all other inputs are zero). This is the only term that changes, when the class mix is changed w 1 is the log odds ratio associated with the treatment. This is independent of the class mix or of other input factors (thus we cannot model interactions) w 2,..., w M 1 models the log odds ratio associated with confounders (age, sex,...) 25

29 Logistic Regression as a Generalized Linear Models (GLM) Consider a Bernoulli distribution with P (y = 1) = θ and P (y = 1) = 1 θ, with 0 θ 1 In the theory of the exponential family of distributions, one sets θ = sig(η). Now we get valid probabilities for any η R! η is called the natural parameter and sig( ) the inverse parameter mapping for the Bernoulli distribution This is convenient if we make η a linear function of the inputs and one obtains a Generalized Linear Model (GLM) P (y i = 1 x i, w) = sig(x T i w) Thus logistic regression is the GLM for the Bernoulli likelihood model 26

30 III. Classification via Regression Linear Regression: f(x i, w) = w 0 + M 1 j=1 w j x i,j = x T i w We define as target y i = 1 if the pattern x i belongs to class 1 and y i = 0 if pattern x i belongs to class 0 We calculate weights w LS = (X T X) 1 X T y as LS solution, exactly as in linear regression For a new pattern z we calculate f(z) = z T w LS and assign the pattern to class 1 if f(z) > 1/2; otherwise we assign the pattern to class 0 27

31 Bias Asymptotically, a LS-solution converges to the posterior class probabilities, although a linear functions is typically not able to represent P (c = 1 x). The resulting class boundary can still be sensible One can expect good class boundaries in high dimensions and/or in combination with basis functions, kernels and neural networks; in high dimensions sometimes consistency can be achieved 28

32 Classification via Regression with Linear Functions 29

33 Classification via Regression with Radial Basis Functions 30

34 Performance Although the approach might seem simplistic, the performance can be excellent (in particular in high dimensions and/or in combination with basis functions, kernels and neural networks). The calculation of the optimal parameters can be very fast! 31

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

Iraqi Journal of Statistical Science (18) 2010 p.p. [35-58] Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application ABSTRACT Nawzad. M. Ahmad * Bayesian decision theory

### Detecting Corporate Fraud: An Application of Machine Learning

Detecting Corporate Fraud: An Application of Machine Learning Ophir Gottlieb, Curt Salisbury, Howard Shek, Vishal Vaidyanathan December 15, 2006 ABSTRACT This paper explores the application of several

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### LCs for Binary Classification

Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

### Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### An Introduction to Data Mining

An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Estimating an ARMA Process

Statistics 910, #12 1 Overview Estimating an ARMA Process 1. Main ideas 2. Fitting autoregressions 3. Fitting with moving average components 4. Standard errors 5. Examples 6. Appendix: Simple estimators

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

### 3F3: Signal and Pattern Processing

3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani zoubin@eng.cam.ac.uk Department of Engineering University of Cambridge Lent Term Classification We will represent data by

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### Bayesian probability theory

Bayesian probability theory Bruno A. Olshausen arch 1, 2004 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using probability. The foundations

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### Efficient Streaming Classification Methods

1/44 Efficient Streaming Classification Methods Niall M. Adams 1, Nicos G. Pavlidis 2, Christoforos Anagnostopoulos 3, Dimitris K. Tasoulis 1 1 Department of Mathematics 2 Institute for Mathematical Sciences

### THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

### Practice problems for Homework 11 - Point Estimation

Practice problems for Homework 11 - Point Estimation 1. (10 marks) Suppose we want to select a random sample of size 5 from the current CS 3341 students. Which of the following strategies is the best:

### Logit Models for Binary Data

Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### CSI:FLORIDA. Section 4.4: Logistic Regression

SI:FLORIDA Section 4.4: Logistic Regression SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 We can generalize this roblem to two class roblem as well! SI:FLORIDA Reisit Masked lass

### GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

### Gaussian Classifiers CS498

Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Factorial experimental designs and generalized linear models

Statistics & Operations Research Transactions SORT 29 (2) July-December 2005, 249-268 ISSN: 1696-2281 www.idescat.net/sort Statistics & Operations Research c Institut d Estadística de Transactions Catalunya

### Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### Classifiers & Classification

Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

### Classification with Hybrid Generative/Discriminative Models

Classification with Hybrid Generative/Discriminative Models Rajat Raina, Yirong Shen, Andrew Y. Ng Computer Science Department Stanford University Stanford, CA 94305 Andrew McCallum Department of Computer

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

### Probabilistic Discriminative Kernel Classifiers for Multi-class Problems

c Springer-Verlag Probabilistic Discriminative Kernel Classifiers for Multi-class Problems Volker Roth University of Bonn Department of Computer Science III Roemerstr. 164 D-53117 Bonn Germany roth@cs.uni-bonn.de

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y

### Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Bayesian logistic betting strategy against probability forecasting Akimichi Takemura, Univ. Tokyo (joint with Masayuki Kumon, Jing Li and Kei Takeuchi) November 12, 2012 arxiv:1204.3496. To appear in Stochastic

### D-optimal plans in observational studies

D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Natural Conjugate Gradient in Variational Inference

Natural Conjugate Gradient in Variational Inference Antti Honkela, Matti Tornio, Tapani Raiko, and Juha Karhunen Adaptive Informatics Research Centre, Helsinki University of Technology P.O. Box 5400, FI-02015

### Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

### Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

### Note on the EM Algorithm in Linear Regression Model

International Mathematical Forum 4 2009 no. 38 1883-1889 Note on the M Algorithm in Linear Regression Model Ji-Xia Wang and Yu Miao College of Mathematics and Information Science Henan Normal University

### 1. χ 2 minimization 2. Fits in case of of systematic errors

Data fitting Volker Blobel University of Hamburg March 2005 1. χ 2 minimization 2. Fits in case of of systematic errors Keys during display: enter = next page; = next page; = previous page; home = first

### Cell Phone based Activity Detection using Markov Logic Network

Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart

### Probability Theory. Elementary rules of probability Sum rule. Product rule. p. 23

Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction

### Monte Carlo Simulation

1 Monte Carlo Simulation Stefan Weber Leibniz Universität Hannover email: sweber@stochastik.uni-hannover.de web: www.stochastik.uni-hannover.de/ sweber Monte Carlo Simulation 2 Quantifying and Hedging

### Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

### A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

REVSTAT Statistical Journal Volume 4, Number 2, June 2006, 131 142 A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA Authors: Daiane Aparecida Zuanetti Departamento de Estatística, Universidade Federal de São

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### 1. The maximum likelihood principle 2. Properties of maximum-likelihood estimates

The maximum-likelihood method Volker Blobel University of Hamburg March 2005 1. The maximum likelihood principle 2. Properties of maximum-likelihood estimates Keys during display: enter = next page; =

### VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models