Lecture 6: Logistic Regression

Similar documents
Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Statistical Machine Learning

Lecture 3: Linear methods for classification

Linear Threshold Units

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Lecture 2: The SVM classifier

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Introduction to Logistic Regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

STA 4273H: Statistical Machine Learning

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Machine Learning Logistic Regression

Big Data - Lecture 1 Optimization reminders

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Linear Classification. Volker Tresp Summer 2015

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Machine Learning and Pattern Recognition Logistic Regression

CSCI567 Machine Learning (Fall 2014)

Big Data Analytics: Optimization and Randomization

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Lecture 2: August 29. Linear Programming (part I)

Linear Models for Classification

CSC 411: Lecture 07: Multiclass Classification

Several Views of Support Vector Machines

Adaptive Online Gradient Descent

Introduction to Online Learning Theory

Online learning of multi-class Support Vector Machines

Support Vector Machines Explained

2.3 Convex Constrained Optimization Problems

Distributed Machine Learning and Big Data

Big Data Analytics CSCI 4030

Machine Learning in Spam Filtering

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

Lecture 8 February 4

Least-Squares Intersection of Lines

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Support Vector Machine (SVM)

An Introduction to Machine Learning

Date: April 12, Contents

Regression III: Advanced Methods

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Classification using Logistic Regression

Logistic Regression for Spam Filtering

A Simple Introduction to Support Vector Machines

OPTIMAL CONTROL OF A COMMERCIAL LOAN REPAYMENT PLAN. E.V. Grigorieva. E.N. Khailov

Penalized Logistic Regression and Classification of Microarray Data

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Christfried Webers. Canberra February June 2015

L3: Statistical Modeling with Hadoop

Support Vector Machines

Simple and efficient online algorithms for real world applications

Course: Model, Learning, and Inference: Lecture 5

Solutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

The p-norm generalization of the LMS algorithm for adaptive filtering

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

VI. Introduction to Logistic Regression

Making Sense of the Mayhem: Machine Learning and March Madness

Discrete Optimization

Content-Based Recommendation

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

GI01/M055 Supervised Learning Proximal Methods

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

10. Proximal point method

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Logistic Regression (1/24/13)

Follow the Perturbed Leader

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Predict Influencers in the Social Network

Max-Min Representation of Piecewise Linear Functions

Active Learning SVM for Blogs recommendation

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Data Mining III: Numeric Estimation

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Proximal mapping via network optimization

Fóra Gyula Krisztián. Predictive analysis of financial time series

Numerical methods for American options

Multi-Class and Structured Classification

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

LCs for Binary Classification

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Largest Fixed-Aspect, Axis-Aligned Rectangle

1 Review of Least Squares Solutions to Overdetermined Systems

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Server Load Prediction

(Quasi-)Newton methods

Transcription:

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011

Outline

Outline

Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1, 1} m : m-vector of corresponding labels. Classification task : design a linear classification rule of the form ŷ = sign(w T x + b), where w R n, b R are to be found. Main solution idea : formulate the task of finding w, b as a loss function minimization problem.

Separable data Separability condition y i (w T x i + b) 0, i = 1,..., m. Ensures that negative (resp. positive) class is contained in half-space w T x + b 0 (resp. w T x + b 0).

0/1 loss function minimization When data is not strictly separable, we seek to minimize the number of errors, which is the number of indices i for which y i (w T x i + b) < 0: min w,b m L 0/1 (y i (w T x i + b)) i=1 where L 0/1 is the 0/1 loss function L 0/1 (z) = { 1 if z < 0, 0 otherwise. Above problem is very hard to solve exactly.

From 0/1 loss to hinge loss We approximate (from above) the 0/1 loss by the hinge loss: H(z) = max(0, 1 z). This function is convex (its slope is always increasing).

Support vector machine (SVM) SVMs are based on hinge loss function minimization: min w,b m max(0, 1 y i (w T x i + b)) + λ w 2 2 i=1 Above problem much easier to solve than with 0/1 loss (see why later). In lecture 5 we have seen the geometry of this approximation.

Outline

Work with a smooth (differentiable) approximation to the 0/1 loss function. There are many possible choices, this particular one having a nice probabilistic interpretation.

We model the probability of a label Y to be equal y { 1, 1}, given a data point x R n, as: P(Y = y x) = 1 1 + exp( y(w T x + b)). This amounts to modeling the log-odds ratio as a linear function of X: log P(Y = 1 x) P(Y = 1 x) = w T x + b. The decision boundary P(Y = 1 x) = P(Y = 1 x) is the hyperplane with equation w T x + b = 0. The region P(Y = 1 x) P(Y = 1 x) (i.e., w T x + b 0) corresponds to points with predicted label ŷ = +1.

The likelihood function is l(w, b) = m i=1 Now maximize the log-likelihood: max w,b L(w, b) := 1 1 + e y i (wt x i +b). m log(1 + e y i (wt x i +b) ) i=1 In practice, we may consider adding a regularization term max w,b with r(w) = w 2 2 or r(x) = w 1. L(w, b) + λr(w),

Outline

Learning problems seen so far Least-squares linear regression, SVMs, and logistic regression problems can all be expressed as minimization problems: with min w f (w, b) f (w, b) = X T w y 2 2 (square loss); f (w, b) = m i=1 max(0, 1 y i(w T x i + b)) (hinge loss); f (w, b) = m i=1 log(1 + exp( y i(w T x i + b))) (logistic loss). The regularized version involves an additional penalty term. How can we solve these problems?

convex_fcn.svg 9/11/11 6:30 PM The square, hinge, and logistic functions share the property of being convex. These functions have bowl-shaped graphs. Formal definition : f is convex if the chord joining any two points is always above the graph. If f is differentiable, this is equivalent to the fact that the derivative function is increasing. If f is twice differentiable, this is the same as f (t) 0 for every t. file:///users/elghaoui/dropbox/cs194/lectures/lect6/figures/convex_fcn.svg Page 1 of 1

How to prove convexity A function is convex if it can be written as a maximum of linear functions. (You may need an infinite number of them.) If f is a function of one variable, and is convex, then for every x R n, (w, b) f (w T x + b) also is. The sum of convex functions is convex. Example : logistic loss l(z) = log(1 + e z ) = max zv + v log v + (1 v) log(1 v). 0 v 1

Non-convex minimization: hard Optimizing non-convex functions (such as in 0/1 loss minimization) is usually very hard. may be trapped in so-called local minima (in green), which do not correspond to the true minimal value of the objective function (in red).

Convex minimization: easy ensures that there are no local minima. It is generally easy to minimize convex functions numerically via specialized algorithms. The algorithms can be adapted to cases when the function is convex but not differentiable (such as the hinge loss). This is why the convexity properties of square, hinge and logistic loss functions are computationally attractive.

One-dimensional case To minimize a one-dimensional convex function, we can use bisection. We start with an interval that is guaranteed to contain a minimizer. At each step, depending on the slope of the function at the middle of the interval, we shrink the interval by choosing either the leftor right-sided interval. ensures that this procedure converges (very fast).

Coordinate descent Many specialized algorithms exist for the regularized problems min w,b where r(w) = w 2 2 or r(x) = w 1. L(w, b) + λr(w), One of the most efficient involves coordinate descent : Fix all the variables except for one. Minimize the resulting one-dimensional convex function by bisection. Now proceed to minimizing w. r. t. the next variable. For SVMs, the actual procedure involves taking two variables at a time. Problems with millions of features and data points can be solved that way.

In case you need to try For moderate-size convex problems, try free matlab toolbox CVX (not Chevron!), at http://cvxr.com/cvx/. For convex learning problems, look at libraries such as WEKA (http://www.cs.waikato.ac.nz/ml/weka/).