# Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Save this PDF as:

Size: px
Start display at page:

Download "Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University"

## Transcription

1 Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University

2 Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

3 Logistic Regression 3 / 43 Logistic Regression is a generative model, because it models the posterior probabilites directly.

4 Pattern Analysis 4 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

5 5 / 43 Posteriors and the Logistic Function For two classes y {0, 1} we get: p(y = 0 x) = p(y = 0) p(x y = 0) p(x) = p(y = 0) p(x y = 0) p(y = 0)p(x y = 0) + p(y = 1)p(x y = 1) = p(y=1)p(x y=1) p(y=0)p(x y=0)

6 Posteriors and the Logistic Function 6 / 43 p(y = 0 x) = 1 p(y=1)p(x y=1) log 1 + e p(y=0)p(x y=0) = 1 + e 1 p(y=0) p(x y=0) log log p(y=1) p(x y=1)

7 Posteriors and the Logistic Function 7 / 43 We see that the posterior can be written in terms of a logistic function: and thus for the other prior p(y = 0 x) = e F (x) p(y = 1 x) = 1 p(y = 0 x) = = e F (x) 1 + e F (x) e F (x)

8 Posteriors and the Logistic Function 8 / 43 Definition The logistic function (also called sigmoid function) is defined by where x IR. g(x) = e x

9 Posteriors and the Logistic Function 9 / 43 The derivative of the sigmoid function fulfills the nice property: g (x) = = = 1 (1 + e x ) 2 e x 1 (1 + e x ) e x (1 + e x ) 1 (1 + e x ) 1 (1 + e x ) = g(x)g( x) = g(x)(1 g(x)).

10 Posteriors and the Logistic Function 10 / Abbildung: Sigmoid function: g(ax) = 1/(1 + e ax ) for a = 1, 2, 3, 4

11 Pattern Analysis 11 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

12 Decision Boundary 12 / 43 The decision boundary δ(x) = 0 (zero level set) in feature space separates the two classes. Points x on the decision boundary satisfy: and thus p(y = 0 x) = p(y = 1 x) log p(y = 0 x) p(y = 1 x) = log 1 = 0.

13 Decision Boundary 13 / 43 Lemma The decision boundary is given by F(x) = 0. Proof: log p(y = 0 x) p(y = 1 x) p(y = 0 x) p(y = 1 x) = F(x) = 0 = e F (x) p(y = 0 x) = e F (x) p(y = 1 x) p(y = 0 x) = e F (x) (1 p(y = 0 x))

14 Decision Boundary 14 / 43 Now we use that the posteriors sum up to one: p(y = 0 x) = e F (x) (1 p(y = 0 x)) p(y = 0 x) = p(y = 0 x) = e F (x) 1 + e F (x) e F (x)

15 Decision Boundary 15 / Abbildung: Two Gaussians and its posteriors: σ 0 =σ 1 = 0.2, µ 0 = 2, µ 1 = 1

16 16 / 43 Decision Boundary Example Let us assume both classes have normally distributed d-dimensional feature vectors: p(x y) = 1 det 2πΣ e 1 2 (x µy )T Σ 1 y (x µ y ) then we can write the posterior of y = 0 in terms of a logistic function: p(y = 0 x) = e xt Ax+α T x+α 0

17 17 / 43 Decision Boundary Example log p(y = 0 x) p(y = 1 x) = log p(y = 0) p(y = 1) + log 1 e 1 2 (x µ 0) T Σ 1 0 (x µ 0) det 2πΣ0 1 e 1 2 (x µ 1) T Σ 1 1 (x µ 1) det 2πΣ1 This function has the constant component: We observe: c = log p(y = 0) p(y = 1) log det 2πΣ 1 det 2πΣ 0 Priors imply a constant offset of the decision boundary. If priors and covariance matrices of both classes are identical, this offset is c = 0.

18 Decision Boundary 18 / 43 Example Furthermore we have: log e 1 2 (x µ 0) T Σ 1 0 (x µ 0) = 1 2 = 1 2 e 1 2 (x µ 1) T Σ 1 1 (x µ 1) = ( (x µ 1 ) T Σ 1 1 (x µ 1) (x µ 0 ) T Σ 1 0 (x µ 0) ( x T (Σ 1 1 Σ 1 0 )x 2(µT 1 Σ 1 1 µ T 0 Σ 1 0 )x+ +µ T 1 Σ 1 1 µ 1 µ T 0 Σ 1 0 µ 0 ) )

19 Decision Boundary 19 / 43 Example Now we have: A = 1 2 (Σ 1 1 Σ 1 0 ) α T = µ T 0 Σ 1 0 µ T 1 Σ 1 1 α 0 = log p(y = 0) p(y = 1) + 1 ( log det 2πΣ ) 1 + µ T 1 2 det 2πΣ Σ 1 1 µ 1 µ T 0 Σ 1 0 µ 0 0

20 Decision Boundary 20 / x x 1 Abbildung: Two sample sets and the Gaussian decision boundary.

21 Decision Boundary 21 / x x 1 Abbildung: Shift of decision boundary by setting identical priors: p(y) = 1/2

22 Decision Boundary 22 / 43 Example (cont.) If both classes share the same covariances i.e. Σ = Σ 0 = Σ 1, then the argument of the sigmoid function is linear in the components of x. A = 0 α T = (µ 0 µ 1 ) T Σ 1 α 0 = log p(y = 0) p(y = 1) (µ 0 + µ 1 ) T Σ 1 (µ 1 µ 0 )

23 Decision Boundary 23 / x x 1 Abbildung: Identical covariances lead to linear decision boundary

24 Decision Boundary 24 / x x 1 Abbildung: Quadratic and linear decision boundary in comparison

25 25 / 43 Decision Boundary Note: If the class conditionals are Gaussians and share the same covariance, the argument of the exponential function is affine in x. This result is even true for a more general family of pdfs and not limited to Gaussian.

26 Decision Boundary 26 / 43 Definition The exponential family is a class of pdf s that can be written in the following canonical form p(x; θ, φ) = e θ T x b(θ) +c(x,φ) a(φ) where θ IR d is the location parameter vector, φ the dispersion parameter.

27 Decision Boundary 27 / 43 Example Binomial, Poisson, hypergeometric, exponential distributions or Gaussians belong to the the exponential family.

28 Decision Boundary 28 / 43 Lemma If all class-conditional densities are members of the same exponential family distribution with equal dispersion φ, the decision boundary F(x) = 0 is linear in the components of x.

29 Pattern Analysis 29 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

30 30 / 43 Log-Likelihood Function Let us assume the posteriors are given by p(y = 0 x) = 1 g(θ T x) p(y = 1 x) = g(θ T x) where g(θ T x) is the sigmoid function parameterized in θ. The parameter vector θ has to be estimated from a set S of m training samples: S = {(x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ),..., (x m, y m )}. Method of choice: Maximum Likelihood Estimation

31 Log-Likelihood Function 31 / 43 Before we work on the formulas of the ML-estimator, we rewrite the posteriors using Bernoulli probability: p(y x) = g(θ T x) y (1 g(θ T x)) 1 y which shows the great benefit of the chosen notation for class numbers.

32 Log-Likelihood Function 32 / 43 Now we can compute the log-likelihood function (assuming that the training samples are mutually independent): m l(θ) = log p(y i x i ) = = i=1 m log g(θ T x i ) y i (1 g(θ T x i )) 1 y i i=1 m y i log g(θ T x i ) + (1 y i ) log(1 g(θ T x i )) i=1

33 33 / 43 Log-Likelihood Function Notes for the expert: The negative of the log-likelihood function is the cross entropy of y and g(θ T x). The negative of the log-likelihood function is a convex function.

34 Gradient of log-likelihood Function 34 / 43 The gradient: θ j l(θ) = m i=1 ( ) yi g(θ T x i ) 1 y i 1 g(θ T g(θ T x i ) x i ) θ j now we use the derivative of the sigmoid function and get θ j l(θ) = = m i=1 m i=1 ( ) yi g(θ T x i ) 1 y i 1 g(θ T g(θ T x i )(1 g(θ T x i ))x i,j x i ) ( ) y i (1 g(θ T x i )) (1 y i )g(θ T x i ) x i,j where x i,j is the j th component of the i th training feature vector.

35 Gradient of log-likelihood Function 35 / 43 Finally we have a quite simple gradient: θ j l(θ) = m i=1 ( ) y i g(θ T x i ) x i,j where x i,j is the j th component of the i th training feature vector. Or in vector notation: m θ l(θ) = ( ) y i g(θ T x i ) x i i=1

36 Hessian of log-likelihood Function 36 / 43 The log-likelihood function is concave. We use the Newton-Raphson algorithm to solve the unconstrained optimization problem. For that purpose the Hessian is required (remember the derivative of the sigmoid function!): 2 m θ θ T l(θ) = i=1 ( ) g(θ T x i ) 1 g(θ T x i ) x i x T i

37 Newton-Raphson Iteration 37 / 43 For the (k + 1)-st iteration step, we get: ( ) θ (k+1) = θ (k) 2 1 θ θ T l(θ) θ l(θ) Note: If you write the Newton-Raphson iteration in matrix form, you will end up with a weighted least squares iteration scheme.

38 Pattern Analysis 38 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

39 Perceptron and Logistic Regression 39 / 43

40 Pattern Analysis 40 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

41 41 / 43 Lessons Learned Posteriors can be rewritten in terms of a logistic function. Given the decision boundary F (x) = 0, we can write down the posterior p(y x) right away. Decision boundary for normally distributed feature vectors for each class is a quadratic function. If Gaussians share the same covariances, the decision boundary is a linear function.

42 Pattern Analysis 42 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision Boundary Learning in Logistic Regression Log-Likelihood Function Gradient Perceptron and Logistic Regression Lessons Learned Further Readings

43 43 / 43 Further Readings T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer, David W. Hosmer, Stanley Lemeshow: Applied Logistic Regression, 2nd Edition, John Wiley & Sons, Hoboken 2000.

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### CS229 Lecture notes. Andrew Ng

CS229 Lecture notes Andrew Ng Supervised learning Let s start by talking about a few examples of supervised learning problems Suppose we have a dataset giving the living areas and prices of 47 houses from

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y

### Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, Discrete Changes JunXuJ.ScottLong Indiana University August 22, 2005 The paper provides technical details on

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

### The Probit Link Function in Generalized Linear Models for Data Mining Applications

Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/\$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

### Maximum Likelihood Estimation

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

### Introduction to Logistic Regression

OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

### Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission

### Probability Theory. Elementary rules of probability Sum rule. Product rule. p. 23

Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction

### Principle of Data Reduction

Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Lecture 3: QR, least squares, linear regression Linear Algebra Methods for Data Mining, Spring 2007, University

### 3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

3. Convex functions Convex Optimization Boyd & Vandenberghe basic properties and examples operations that preserve convexity the conjugate function quasiconvex functions log-concave and log-convex functions

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation

SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu September 19, 2015 Outline

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Logistic Regression for Data Mining and High-Dimensional Classification

Logistic Regression for Data Mining and High-Dimensional Classification Paul Komarek Dept. of Math Sciences Carnegie Mellon University komarek@cmu.edu Advised by Andrew Moore School of Computer Science

### Mathematical Background

Appendix A Mathematical Background A.1 Joint, Marginal and Conditional Probability Let the n (discrete or continuous) random variables y 1,..., y n have a joint joint probability probability p(y 1,...,

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### Classification by Pairwise Coupling

Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

### 1 Prior Probability and Posterior Probability

Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

### The Method of Least Squares

The Method of Least Squares Steven J. Miller Mathematics Department Brown University Providence, RI 0292 Abstract The Method of Least Squares is a procedure to determine the best fit line to data; the

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2015 Timo Koski Matematisk statistik 24.09.2015 1 / 1 Learning outcomes Random vectors, mean vector, covariance matrix,

### NON-LIFE INSURANCE PRICING USING THE GENERALIZED ADDITIVE MODEL, SMOOTHING SPLINES AND L-CURVES

NON-LIFE INSURANCE PRICING USING THE GENERALIZED ADDITIVE MODEL, SMOOTHING SPLINES AND L-CURVES Kivan Kaivanipour A thesis submitted for the degree of Master of Science in Engineering Physics Department

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Lecture 1 Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x-5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

### Factorial experimental designs and generalized linear models

Statistics & Operations Research Transactions SORT 29 (2) July-December 2005, 249-268 ISSN: 1696-2281 www.idescat.net/sort Statistics & Operations Research c Institut d Estadística de Transactions Catalunya

### Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

### Models for Count Data With Overdispersion

Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extra-poisson variation and the negative binomial model, with brief appearances

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Definition of a Linear Program

Definition of a Linear Program Definition: A function f(x 1, x,..., x n ) of x 1, x,..., x n is a linear function if and only if for some set of constants c 1, c,..., c n, f(x 1, x,..., x n ) = c 1 x 1

### Time Series Analysis

Time Series Analysis hm@imm.dtu.dk Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby 1 Outline of the lecture Identification of univariate time series models, cont.:

### Introduction to Generalized Linear Models

to Generalized Linear Models Heather Turner ESRC National Centre for Research Methods, UK and Department of Statistics University of Warwick, UK WU, 2008 04 22-24 Copyright c Heather Turner, 2008 to Generalized

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

### CSI:FLORIDA. Section 4.4: Logistic Regression

SI:FLORIDA Section 4.4: Logistic Regression SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 We can generalize this roblem to two class roblem as well! SI:FLORIDA Reisit Masked lass

### MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

### Classification. Chapter 3

Chapter 3 Classification In chapter we have considered regression problems, where the targets are real valued. Another important class of problems is classification problems, where we wish to assign an

### 1. χ 2 minimization 2. Fits in case of of systematic errors

Data fitting Volker Blobel University of Hamburg March 2005 1. χ 2 minimization 2. Fits in case of of systematic errors Keys during display: enter = next page; = next page; = previous page; home = first

### What you CANNOT ignore about Probs and Stats

What you CANNOT ignore about Probs and Stats by Your Teacher Version 1.0.3 November 5, 2009 Introduction The Finance master is conceived as a postgraduate course and contains a sizable quantitative section.

### Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 2012-13 school year.

This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Algebra

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Probabilistic Discriminative Kernel Classifiers for Multi-class Problems

c Springer-Verlag Probabilistic Discriminative Kernel Classifiers for Multi-class Problems Volker Roth University of Bonn Department of Computer Science III Roemerstr. 164 D-53117 Bonn Germany roth@cs.uni-bonn.de

### Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

### MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

### Quadratic forms Cochran s theorem, degrees of freedom, and all that

Quadratic forms Cochran s theorem, degrees of freedom, and all that Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 1, Slide 1 Why We Care Cochran s theorem tells us

### Computer exercise 4 Poisson Regression

Chalmers-University of Gothenburg Department of Mathematical Sciences Probability, Statistics and Risk MVE300 Computer exercise 4 Poisson Regression When dealing with two or more variables, the functional

### Efficient Streaming Classification Methods

1/44 Efficient Streaming Classification Methods Niall M. Adams 1, Nicos G. Pavlidis 2, Christoforos Anagnostopoulos 3, Dimitris K. Tasoulis 1 1 Department of Mathematics 2 Institute for Mathematical Sciences

### GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators