# Statistical Machine Learning

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25

2 Linear separation For two classes in R d : simple idea: separate the classes using a hyperplane H w,b = {X : X w + b 0} ; Simplest extension for several classes: consider a family of linear scores s y (x) = w y x b y and the rule f (x) = Arg Max s y (x). y Y Then the separation between any two classes is linear. Using linear regression 2 / 25

3 Classification via linear regression Simplest idea for two classes: perform a standard linear regression of Y (coded e.g. in {0, 1}) by X, ŵ = (X T X) 1 X T Y where X is the (n, d + 1) extended data matrix and Y the (n, 1) vector of training classes; For a new point x, the linear regression function predicts x ŵ, and the decision function would be 1{x ŵ 1 2 }. Using linear regression 3 / 25

4 We can extend this idea to K classes by performing regression on each of the class indicator variables 1{Y = y}, y Y. In matrix form: same as above, replacing Y by the matrix of indicator responses. This is equivalent to solving globally the least squares problem: min W n Y i X i W 2, i=1 where W is a coefficient matrix and Y i is the class indicator vector. Using linear regression 4 / 25

5 Problems using linear regression For multiple classes, a masking problem is likely to occur. A possible fix is to extend the data vectors with quadratic components. Two disadvantages however: masking can still occur when there are many classes. increasing the degree of the components lead to too many parameters and overfitting. Using linear regression 5 / 25

6 Separating two Gaussians We can adopt a simple parametric generative approach and model the classes by simple Gaussians. Assume we take a Gaussian generative model for the classes distribution: 1 p(x Y = i) = ( (2π) d Σ i exp 1 ) 2 (x m i) T Σ 1 i (x m i ) What is the Bayes classifier for this model? Remember that if the generating densities for classes 0 and 1 are f 0, f 1 and the marginal probability p = P(Y = 1) then the Bayes decision is given by F(x) = Arg Max(pf 1 (x), (1 p)f 2 (x)). Gaussian case LDA and QDA 6 / 25

7 Hence, denoting d i the Mahalanobis distance corresponding to Σ i, d 2 i (x, y) = (x y) T Σ 1 i (x y), Then the decision rule is class 1 or 2 depending whether d 2 1 (x, m 1) d 2 2 (x, m 2) t(p, Σ 1, Σ 2 ) ; it is a quadratic decision rule (QDA). Case Σ 1 = Σ 2 : becomes a linear decision rule (LDA). We can use a pooled estimate for the two covariance matrices: Σ = 1 n 2 (S 1 + S 2 ), where S l = i:y i =l (X i m l )(X i m l ) T. Gaussian case LDA and QDA 7 / 25

8 Multiclass: the previous analysis suggests to look at the criterion where or Arg Min δ y (x), y Y δ y (x) = 1 2 (x m y) T Σ 1 y (x m y ) + t y (p y, Σ y ), in the general QDA case (then the decision regions are intersections of quadratic regions), δ y (x) = x T Σ 1 m y + t y (p y ), in the common variance (LDA) case (then the decision regions are intersections of half-planes) Gaussian case LDA and QDA 8 / 25

9 Relation to linear regression Theorem In the two-class case the direction of w found by the Gaussian model coincides with the one found by classical linear regression.... but the constants b differ. In practice it is recommended not to trust either but to consider this as a separate parameter to optimize to reduce the empirical classification error. Regression using quadratic terms does not give the same result as QDA. Gaussian case LDA and QDA 9 / 25

10 Fisher s linear discriminant Yet another approach to the problem: find the projection maximizing the ratio of inter-class to intra-class variance, for two classes: J(w) = (w m 1 w m 2 ) 2 w T (S 1 + S 2 )w, where m l are the empirical class means and S l = i:y i =l (X i m l )(X i m l ) T Finding dj dw = 0 leads to the solution (again, the scaling is arbitrary). w = λ(s 1 + S 2 ) 1 ( m 1 m 2 ) The projection direction coincides with the previous methods; Fisher s criterion only provides the projection direction (again, optimize the constant separately) Fisher s discriminant analysis 10 / 25

11 Fisher s discriminant in multi-class Fisher s criterion can be extended to the multi-class case by maximizing the ratio (Rayleigh coefficient) J(w) = w T Mw w T Sw where S = y S y is the pooled intraclass covariance and M = y (m y m)(m y m) T is the interclass covariance (covariance of the class centroids). (Note that normalization of the matrices is unimportant) Leads to the generalized eigenvalue problem Mw = λsw Can be iterated to find Y 1 dimensions by constraining orthogonality (for the scalar product w, w = w T Sw ) with previously found directions. Equivalent to the following: whiten the data by applying S 1 2 ; perform PCA on the transformed class centroids; apply S 1 2 to the found directions. Fisher s discriminant analysis 11 / 25

12 Properties of Fisher s canonical projections This is a linear dimension reduction method aimed at separating the classes (using 1st and 2nd moment information only). Invariant by any linear transform of the input space. When we take L = min(d, Y 1) canonical coordinates, this commutes with LDA. When we take L < min(d, Y 1) canonical coordinates, this is equivalent to a reduced rank LDA, i.e. where we require the mean of the Gaussians in the model to belong to a space of dimension L (and perform ML fitting). It can also be seen as a CCA of X wrt. the class indicator function Y. Fisher s discriminant analysis 12 / 25

13 Regularized linear and quadratic discriminant When the dimension d is too large, overfitting and instability can occur. Looking back at standard linear regression, a possible is ridge regression finding 2 N d p β λ = Arg Min Y i β 0 x ij β j + λ βj 2 ; β i=1 The solution is given by regularization by shrinkage. j=1 β 1 i d = (X T X + λi) 1 X T Y i=1 Fisher s discriminant analysis 13 / 25

14 By a (weak) analogy with ridge regression we can consider the following regularized version for the covariance estimation in LDA: another possibility is Σ γ = γ Σ + (1 γ) σ 2 I Σ γ = γ Σ + (1 γ)d, where D is the diagonal matrix formed with entries σ i 2. We can also regularize QDA using the following scheme for the estimator of the covariance matrix for class k: Σ k (α) = α Σ k + (1 α) Σ... we can even combine the two. In practice, as usual it is recommended to use cross-validation to tune the parameters. Fisher s discriminant analysis 14 / 25

15 Linear Logistic regression Recall that in the 2-class case logistic regression aims at finding the log-odds ratio function log P(Y = 1 X = x) P(Y = 0 X = x) = log η(x) 1 η(x) ; In the multiclass case, this can be generalized to log-odds ratio wrt. some (arbitrary) reference class: s i (x) = log P(Y = i X = x) P(Y = 0 X = x) ; again the resulting (plug-in) classifier outputs the max of the score functions. If we model scores by linear functions, we get again a linear classifier. Using logistic regression 15 / 25

16 The model s i (x) = β i0 + β i x gives rise to the conditional class probabilities P(Y = i X = x) = we can fit this using Maximum Likelihood. exp (β i0 + β i x) 1 + K 1 l=1 exp (β l0 + β l x), Using logistic regression 16 / 25

17 Algorithm for linear logistic regression We consider the 2-class case (Y = {0, 1}). The log-likelihood function is l(β) = n (Y i β X i log(1 + exp(β X i ))) ; i=1 (where the data points X i are augmented with a contant coordinate) and dl dβ = n X i (Y i η(x i, β)) ( = 0 ) ; i=1 we can solve this using a Newton-Raphson algorithm with step ( β new = β d old 2 ) 1 l dl dβdβ T dβ. Using logistic regression 17 / 25

18 We have ( d 2 ) l dβdβ T = n i=1 X i X T i η(x i, β)(1 η(x i, β)) ; If we denote W the diagonal matrix of weights η(x i, β)(1 η(x i, β)), we can rewrite the NR step in matrix form as β new = (X T WX) 1 X T WZ, where Z = X β old + W 1 (Y η). This can be seen as an iterated modified least squares fitting. Using logistic regression 18 / 25

19 LDA vs. logistic regression LDA and logistic regression both fit a linear model to the log-odds ratio. They do not result in the same output however. Why? The answer is that logistic regression only fits the conditional densities P(Y = i X) and remains agnostic as to the distribution of covariate X. LDA on the other hand implicitly fits a distribution for the joint distribution P(Y X) (mixture of Gaussians). In practice, logistic regression is therefore considered more adaptive, but also less robust. Using logistic regression 19 / 25

20 The linear perceptron Assume that the point clouds for two the classes in the training set turn out to be perfectly separable by some hyperplane. Then LDA will not necessarily return a hyperplane having zero training error. On the other hand, logistic regression will return infinite parameters (why?) Other approach: consider minimizing a criterion based on the distance of misclassified examples to the hyperplane: D(w, b) = Y i (X i w + b). i:y i (X i w+b)<0 (here we assume Y i { 1, 1}!). (Note that actually the average distance to the hyperplane would be w 1 D(w, b).) Principle of perceptron training (more or less): minimize the above by a kind of stochastic gradient descent. Linear Methods in high dimension 20 / 25

21 Convergence of perceptron training Very simple iterative rule: first, put R = max i X i. If all points are correctly classified, stop. If there are misclassified points, choose such a point (Xi, Y i ) arbitrarily. Theorem Put [ ] [ w new w old = Repeat. b new b old ] + [ ] Yi X i Y i R 2. If there exists (w, b ) a separating hyperplane, such that for all i Y i (w X i + b ) γ, then above algorithm will eventually find a separating hyperplane in a ( ) 2R 2 finite number of steps bounded by. γ Linear Methods in high dimension 21 / 25

22 Some problems with the perceptron algorithm: The number of steps required to converge can be large! If the classes are not separable, there is no guarantee of convergence. In fact, cycles can occur. There is no regularization and so no protection against overfitting (the number of steps can be used as regularization though) Linear Methods in high dimension 22 / 25

23 The naive Bayes classifier Assume that X is a vector of binary features (possibly in very high dimension, X {0, 1} d. We don t want to model very complicated dependencies. Naive assumption: given Y, the coordinates of X are independent! Can be seen as an elementary graphical model: Y X (1) X (2) (d)... X Linear Methods in high dimension 23 / 25

24 Assume we know the probability distribution of each coordinate (a Bernoulli variable) given Y, p k, 1 = P(X (k) = 1 Y = 1) ; p i,1 = P(X (k) = 1 Y = 1). In this case the Bayes classifier is given by the sign of log P(Y = 1 X) P(Y = 1 X) = k X (k) α k + a where α k = log p k,1(1 p k, 1 ) (1 p k,1 )p k, 1 ; a = k log 1 p k,1 P(Y = 1) + log 1 p k, 1 P(Y = 1) ; In practice: like for LDA, it is recommended to optimize the constant a separately (to minimize the training error). Linear Methods in high dimension 24 / 25

25 Naive Bayes classifier generalized We can generalize this idea to continuous-valued coordinates with the same conditional independence hypothesis. We estimate the conditional density of each coordinate (e.g. with a Gaussian) The decision function becomes an additive model: ) f (k) (x (k) ) f (x) = sign ( k Advantage of naive Bayes: robust also in high dimension, simple and surprisingly good in a number of situations even when the assumption obviously does not hold. Disadvantage: generally does not match the performance of more flexible methods. Linear Methods in high dimension 25 / 25

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### Efficient Streaming Classification Methods

1/44 Efficient Streaming Classification Methods Niall M. Adams 1, Nicos G. Pavlidis 2, Christoforos Anagnostopoulos 3, Dimitris K. Tasoulis 1 1 Department of Mathematics 2 Institute for Mathematical Sciences

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### Lecture 2: The SVM classifier

Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

### LCs for Binary Classification

Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Local classification and local likelihoods

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

### Introduction to Machine Learning

Introduction to Machine Learning Felix Brockherde 12 Kristof Schütt 1 1 Technische Universität Berlin 2 Max Planck Institute of Microstructure Physics IPAM Tutorial 2013 Felix Brockherde, Kristof Schütt

### ELEC-E8104 Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems

Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems Minimum Mean Square Error (MMSE) MMSE estimation of Gaussian random vectors Linear MMSE estimator for arbitrarily distributed

### Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit

### Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz

AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why

### CSI:FLORIDA. Section 4.4: Logistic Regression

SI:FLORIDA Section 4.4: Logistic Regression SI:FLORIDA Reisit Masked lass Problem.5.5 2 -.5 - -.5 -.5 - -.5.5.5 We can generalize this roblem to two class roblem as well! SI:FLORIDA Reisit Masked lass

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Several Views of Support Vector Machines

Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

### Gaussian Classifiers CS498

Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### MACHINE LEARNING. Introduction. Alessandro Moschitti

MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it Course Schedule Lectures Tuesday, 14:00-16:00

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### Classification by Pairwise Coupling

Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

### 171:290 Model Selection Lecture II: The Akaike Information Criterion

171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information

### Introduction: Overview of Kernel Methods

Introduction: Overview of Kernel Methods Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University

### D-optimal plans in observational studies

D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

### Improving Generalization

Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction

### Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

### Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

### Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

### Didacticiel - Études de cas

1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### Factorial experimental designs and generalized linear models

Statistics & Operations Research Transactions SORT 29 (2) July-December 2005, 249-268 ISSN: 1696-2281 www.idescat.net/sort Statistics & Operations Research c Institut d Estadística de Transactions Catalunya

### Classifiers & Classification

Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### A Survey of Kernel Clustering Methods

A Survey of Kernel Clustering Methods Maurizio Filippone, Francesco Camastra, Francesco Masulli and Stefano Rovetta Presented by: Kedar Grama Outline Unsupervised Learning and Clustering Types of clustering

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories