L1 vs. L2 Regularization and feature selection.

Similar documents
MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

A Simple Introduction to Support Vector Machines

Solving Linear Systems, Continued and The Inverse of a Matrix

Statistical Machine Learning

Big Data Analytics CSCI 4030

1 Norms and Vector Spaces

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

NOTES ON LINEAR TRANSFORMATIONS

Section Inner Products and Norms

Inner Product Spaces

Chapter 6. Cuboids. and. vol(conv(p ))

Introduction to Online Learning Theory

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

1 if 1 x 0 1 if 0 x 1

Vector and Matrix Norms

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Chapter 6. Orthogonality

Machine Learning and Pattern Recognition Logistic Regression

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

Support Vector Machines

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

A note on companion matrices

Self Organizing Maps: Fundamentals

The p-norm generalization of the LMS algorithm for adaptive filtering

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

University of Lille I PC first year list of exercises n 7. Review

DERIVATIVES AS MATRICES; CHAIN RULE

Lattice-Based Threshold-Changeability for Standard Shamir Secret-Sharing Schemes

Support Vector Machine (SVM)

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Methods for Finding Bases

Support Vector Machines Explained

160 CHAPTER 4. VECTOR SPACES

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Big Data - Lecture 1 Optimization reminders

Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Principal components analysis

Computing the Symmetry Groups of the Platonic Solids With the Help of Maple

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Vector Spaces 4.4 Spanning and Independence

Machine Learning in Spam Filtering

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Lecture 2: August 29. Linear Programming (part I)

Class Meeting # 1: Introduction to PDEs

Systems of Linear Equations

1 Introduction to Matrices

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Orthogonal Projections

by the matrix A results in a vector which is a reflection of the given

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Linear Programming Problems

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Introduction to Matrix Algebra

Inner product. Definition of inner product

Applied Linear Algebra I Review page 1

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Adaptive Online Gradient Descent

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Tail inequalities for order statistics of log-concave vectors and applications

Multivariate Analysis (Slides 13)

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Neural Networks and Support Vector Machines

Lecture 3: Finding integer solutions to systems of linear equations

Notes on Factoring. MA 206 Kurt Bryan

The PageRank Citation Ranking: Bring Order to the Web

Orthogonal Diagonalization of Symmetric Matrices

THE DIMENSION OF A VECTOR SPACE

Several Views of Support Vector Machines

October 3rd, Linear Algebra & Properties of the Covariance Matrix

Predict Influencers in the Social Network

Simple and efficient online algorithms for real world applications

Tiers, Preference Similarity, and the Limits on Stable Partners

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

The Goldberg Rao Algorithm for the Maximum Flow Problem

Geometric Transformations

1 Homework 1. [p 0 q i+j p i 1 q j+1 ] + [p i q j ] + [p i+1 q j p i+j q 0 ]

Data Mining: Algorithms and Applications Matrix Math Review

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Some probability and statistics

Factor analysis. Angela Montanari

Arrangements And Duality

Practice with Proofs

Solution to Homework 2

Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA

Virtual Landmarks for the Internet

Linearly Independent Sets and Linearly Dependent Sets

Quadratic forms Cochran s theorem, degrees of freedom, and all that

CHAPTER FIVE. Solutions for Section 5.1. Skill Refresher. Exercises

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Lecture 1: Course overview, circuits, and formulas

Ideal Class Group and Units

Similarity and Diagonalization. Similar Matrices

Factoring. Factoring 1

Transcription:

L1 vs. L2 Regularization and feature selection. Paper by Andrew Ng (2004) Presentation by Afshin Rostami

Main Topics Covering Numbers Definition Convergence Bounds L1 regularized logistic regression L1 Regression Convergence Upper Bound Rotational Invariance Rotational Invariance Convergence Lower Bound

Main Theorems L1 Regularized Logistic Regression requires a sample size that grows logarithmically in the number of irrelevant features. Rotationally invariant algorithms (i.e. L2 Regularized Logistic Regression) requires a sample size that grows linearly in the number of irrelevant features.

Covering Numbers Definition: Let x = x 1, x 2,, x m X m Then for a function class F with domain U and range [ M, M], let F x ={f x 1,f x 2,,f x m : f F } Then the covering number N p F,, x is defined as the smallest set of points, such that for any u F x there is a point in the set no further than away as measured by the p norm.

Covering Numbers F x

Covering Numbers Then, define: N p F,,m =max x N p F,, x Notice the similarity to the shattering coefficient. Both functions measure the number of different labeled sets of size m realizable by a certain function class. In particular, notice if some function class H takes values in {0,1} and positive < 0 then: N H,,m = H m

Convergence with Covering Numbers Convergence Result by Pollard (1984): For iid x= x 1, x 2,, x m P[ f F: 1 m m i=1 f x i E x~d [f x ] ] 8E[N 1 F, /8, x ]exp m 2 512M 2 Again, notice similarity to sample bound using the shattering coefficient. Proof is similar too, use symmetrization and analyze permutations.

Logistic Regression When using logistic regression, we model the conditional probability that our label is 1 as: 1 p y=1 x, = 1 exp T x Where are the parameters we wish to learn. Thus we wish to maximize the following m argmax i=1 log p y i x i, R

Regularization Term In the last slide, R( ) is the regularization term, which forces the parameters to be small (for > 0). This paper wishes to compare the performance of L1 and L2 regularization. L1: L2: n R = 1 = i i=1 n R = 2 2 = i=1 i 2

An Equivalent Optimization Another way to state the optimization problem (which our algorithm will actually use) is as the following: m max logp y i x i, i=1 subject to: R B Where for every there exists a corresponding B. In fact notice that our previous optimization is the Lagrangian of this one.

Algorithm Implementation Scale data so that for all x, x < 1 Split the data (S) into training (S 1 ) and hold out (S 2 ) of size (1 )m and m respectively. For B = 0, 1, 2, 4,..., C Solve optimization problem and store solution B From step 2, choose the B with the lowest hold out error on S 2. (Notice we are tuning the parameter when searching through values for B, and in a sense doing Structural Risk Minimization).

Measuring Error The upper bound proof will be based on the logloss error: l =E x, y ~D [ log p y x, ]] l S = 1 S log p y x, x, y S If we denote the usual 0,1 error as m then it can be shown l log2 m So an upper bound on the logloss is also an upper bound on the 0,1 misclassification error. Similarly a lower bound on misclassification error is also a lower bound on the logloss.

L1 Logistic Regression Upper Bound Theorem: Suppose we are in a situation where only r (which we assume to be much smaller than n) of our features are necessary to learn a good parameter vector r. Then for produced by the previously mentioned algorithm and K > i, we have: With sample complexity: l l r m = logn poly r,k,log 1/,1/,C

L1 Logistic Regression Upper Bound Proof: The proof makes use of many supporting lemmas regarding covering numbers. Lemma1: If: G={g : g x = T x, x R n, q a} x x p b such that 1 p 1 =1 2 p q Then: log 2 N 2 G,,m ceil[ a2 b 2 ] log 2n 1 2 2

A few more Lemmas Proof cont: Two more useful Lemmas: Lemma 2: For any parameters: N 1 N 2 Lemma 3: For G as defined before and F defined as: F={f g x, y =l g x, y : g G, y {0,1}} Where the function l(., y) for fixed y is Liphschitz with constant L, we have: N 1 F,,m N 1 G, /L,m

Upper Bound Proof Proof cont: Let B' be the first values in {0,1,2,4,...} larger than rk, so rk < B' < 2rK. Then using Lemma 1 and 2, we have: log 2 N 1 G,, m log 2 N 2 G,, m ceil[ x ] 1 log 2 2n 1 2 = ceil[ B' 2 ] log 2 2n 1 2

Upper Bound Proof Proof cont: Notice, if we define l(g(x), y) as the logloss suffered by logistic regression, then we have l(., y) for fixed y is Lipschitz with L = 1. Thus, using Lemma 3 we have: log 2 N 1 F,, m ceil[ B' 2 Now to bound M: f x, y = l g x, y = l T x, y T x 1 1 x 1 B' 1 2 ] log 2 2n 1 The first inequality is due to the fact that l(., y) is defined as the logloss and the second inequality is due to Holder's Inequality ( ab = a p b q for conj. p, q.

Putting it all Together Now putting together all the steps of the proof, we have shown: P [ f F : l l ] 8 2 64 B ' 2 / 1 m 2n 1 exp 2 512 B ' 1 2 Thus, for confidence (1 δ) we have m satisfy: m= logn poly r, K, 1,log 1 Then using standard convergence result (Vapnik, 1982) we have l max ': ' 1 B' l ' 2 l r 2

Rotationally Invariant Algorithms M is a rotation matrix if M T M = MM T = I and M = 1 An algorithm, L, is Rotationally Invariant if for training set S and test x, L (x) = L (Mx) S MS Examples: Log Regression w/ L2 Reg., SVMs, Perceptron, Unregularized Log Regression, Neural Network w/ back propagation, any algorithm that uses PCA preprocessing.

Rotationally Invariant Algorithm Lower Bound For any 0 < ε < 1/8 and 0 < δ < 1/100, if we have a rotational algorithm L, and a labeling that is determined by a single feature (i.e. y = 1 if x 1 > t, y = 0 otherwise), then in order for L to attain a 0/1 misclassification lower than ε with probability at least (1 δ) it is necessary that the training size be at least: m= n/

Lower Bound Proof Proof: Let C be the concept class of linear separators: C={h ;h x =1{ T x }, 0 } Recall that such a concept class has VC dimension n + 1. Also, recall that the standard VC lower bound states there exists a distribution D X that induces a sample complexity of Ω(d / ε), which, in the case of linear separators, can be simplified to Ω(n / ε). Let h x =1{ T x } be the target concept and without loss of generality let 2 =1. There exists an orthogonal matrix M whose first row is. T

Lower Bound Proof Proof cont: Notice, M =[1,0,,0] T =e 1 Also, if we need do, we can flip the sign of any row (other than the first) to ensure M = 1, so that M is a rotation matrix. Now, examine the problem with datapoints x = (x 1,x 2,...,x n ) ~ D X and let x' = Mx, and labels determined by the first feature y'=1{x' 1 }.

Lower Bound Proof Proof cont: y'=1{x' 1 }=1{e 1 T x ' } =1{ M T Mx } =1{ T x }= y So the problem on (x',y') is actually a rotated version of (x,y), and, since we assume our learning algorithm is rotationally invariant, we expect it to perform the same on both problems. In other words, we are trying to learn a linear separation in n dimensional space giving us a sample complexity lower bound of. m= n/

Experiments 1 relevant feature, 100 training examples 3 relevant feature, 100 training examples exp decay in relevance, 100 training examples