A note on the group lasso and a sparse group lasso

Similar documents
Least Squares Estimation

Lasso on Categorical Data

Regularized Logistic Regression for Mind Reading with Parallel Validation

Machine Learning Big Data using Map Reduce

Lecture 3: Linear methods for classification

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

BLOCKWISE SPARSE REGRESSION

Blockwise Sparse Regression

Degrees of Freedom and Model Search

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

α = u v. In other words, Orthogonal Projection

Penalized Logistic Regression and Classification of Microarray Data

STA 4273H: Statistical Machine Learning

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

8. Linear least-squares

6. Cholesky factorization

Classification by Pairwise Coupling

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Data Mining: Algorithms and Applications Matrix Math Review

Penalized regression: Introduction

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Efficient variable selection in support vector machines via the alternating direction method of multipliers

Machine Learning Methods for Demand Estimation

Lasso-based Spam Filtering with Chinese s

Multiple Linear Regression in Data Mining

Smoothing and Non-Parametric Regression

Factorization Theorems

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Linear Algebra Notes for Marsden and Tromba Vector Calculus

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

Systems of Linear Equations

Similarity and Diagonalization. Similar Matrices

Statistical Machine Learning

Strong Rules for Discarding Predictors in Lasso-type Problems

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Linear Algebra Methods for Data Mining

10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistical machine learning, high dimension and big data

The degrees of freedom of the Lasso in underdetermined linear regression models

Solving Systems of Linear Equations Using Matrices

Linear Algebra Review. Vectors

GI01/M055 Supervised Learning Proximal Methods

3. INNER PRODUCT SPACES

Question 2: How do you solve a matrix equation using the matrix inverse?

Big Data - Lecture 1 Optimization reminders

We shall turn our attention to solving linear systems of equations. Ax = b

TD(0) Leads to Better Policies than Approximate Value Iteration

SYSTEMS OF REGRESSION EQUATIONS

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

Applications to Data Smoothing and Image Processing I

Predicting Health Care Costs by Two-part Model with Sparse Regularization

Notes on Determinant

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

Nonlinear Iterative Partial Least Squares Method

Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery

October 3rd, Linear Algebra & Properties of the Covariance Matrix

SOLVING LINEAR SYSTEMS

Linear Programming. March 14, 2014

Linear Models for Classification

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Linear Models and Conjoint Analysis with Nonlinear Spline Transformations

Solving Mass Balances using Matrix Algebra

General Framework for an Iterative Solution of Ax b. Jacobi s Method

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Spazi vettoriali e misure di similaritá

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Chapter 6. Orthogonality

Partial Least Squares (PLS) Regression.

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Bilinear Prediction Using Low-Rank Models

Package bigdata. R topics documented: February 19, 2015

Nonlinear Programming Methods.S2 Quadratic Programming

(Quasi-)Newton methods

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Chapter 7. Matrices. Definition. An m n matrix is an array of numbers set out in m rows and n columns. Examples. (

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Inner Product Spaces

1 Determinants and the Solvability of Linear Systems

Factor analysis. Angela Montanari

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Transcription:

A note on the group lasso and a sparse group lasso Jerome Friedman Trevor Hastie and Robert Tibshirani February 11, 2010 (With corrections; original version Jan 5, 2010) Abstract We consider the group lasso penalty for the linear model. We note that the standard algorithm for solving the problem assumes that the model matrices in each group are orthonormal. Here we consider a more general penalty that blends the lasso (L 1 ) with the group lasso ( two-norm ). This penalty yields solutions that are sparse at both the group and individual feature levels. We derive an efficient algorithm for the resulting convex problem based on coordinate descent. This algorithm can also be used to solve the general form of the group lasso, with non-orthonormal model matrices. 1 Introduction In this note, we consider the problem of prediction using a linear model. Our data consist of y, a vector of N observations, and X, a N p matrix of features. Suppose that the p predictors are divided into L groups, with p l the number in group l. For ease of notation, we use a matrix X l to represent the predictors corresponding to the lth group, with corresponding coefficient vector β l. Assume that y and X has been centered, that is, all variables have mean zero. Dept. of Statistics, Stanford Univ., CA 94305, jhf@stanford.edu Depts. of Statistics, and Health, Research & Policy, Stanford Univ., CA 94305, hastie@stanford.edu Depts. of Health, Research & Policy, and Statistics, Stanford Univ, tibs@stanford.edu 1

In an elegant paper, Yuan & Lin (2007) proposed the group lasso which solves the convex optimization problem ( ) L L y X l β l 2 2 + λ pl β l 2, (1) min β R p l=1 where the p l terms accounts for the varying group sizes, and 2 is the Euclidean norm (not squared). This procedure acts like the lasso at the group level: depending on λ, an entire group of predictors may drop out of the model. In fact if the group sizes are all one, it reduces to the lasso. Meier et al. (2008) extend the group lasso to logistic regression. The group lasso does not, however, yield sparsity within a group. That is, if a group of parameters is non-zero, they will all be non-zero. In this note we propose a more general penalty that yields sparsity at both the group and individual feature levels, in order to select groups and predictors within a group. We also point out that the algorithm proposed by Yuan & Lin (2007) for fitting the group lasso assumes that the model matrices in each group are orthonormal. The algorithm that we provide for our more general criterion also works for the standard group lasso with non-orthonormal model matrices. We consider the sparse group lasso criterion ( min β R p y L X l β l 2 2 + λ 1 l=1 l=1 ) L β l 2 + λ 2 β 1. (2) where β = (β 1, β 2,... β l ) is the entire parameter vector. For notational simplicity we omit the weights p l. Expression (2) is the sum of convex functions and is therefore convex. Figure 1 shows the constraint region for the group lasso, lasso and sparse group lasso. A similar penalty involving both group lasso and lasso terms is discussed in Peng et al. (2009). When λ 2 = 0, criterion (2) reduces to the group lasso, whose computation we discuss next. l=1 2 Computation for the group lasso Here we briefly review the computation for the group lasso of Yuan & Lin (2007). In the process we clarify a confusing issue regarding orthonormality of predictors within a group. The subgradient equations (see e.g. Bertsekas (1999)) for the group lasso are X T l (y l X l β l ) + λ s l = 0; l = 1, 2,... L, (3) 2

PSfrag replacements β2 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 β 1 Figure 1: Contour lines for the penalty for the group lasso (dotted), lasso (dashed) and sparse group lasso penalty (solid), for a single group with two predictors. where s l = β l / β l if β l 0 and s l is a vector with s l 2 < 1 otherwise. Let the solutions be ˆβ 1, ˆβ 2... ˆβ l. If X T l (y k l X k ˆβk ) < λ (4) then ˆβ l is zero; otherwise it satisfies ˆβ l = (X T l X l + λ/ ˆβ l ) 1 X T l r l (5) where r l = y k l X k ˆβk Now if we assume that X T l X l = I, and let v l = X T l r l, then (5) simplifies to ˆβ l = (1 λ/ v l )v l. This leads to an algorithm that cycles through the groups k, and is a blockwise coordinate descent procedure. It is given in Yuan & Lin (2007). If however the predictors are not orthonormal, one approach is to orthonormalize them before applying the group lasso. However this will not generally provide a solution to the original problem. In detail, if X l = UDV T, then the columns of U = X l VD 1 are orthonormal. Then X l β l = UVD 1 β l = U[VD 1 β l ] = Uβ l. 3

But β l = β l only if D = I. This will not be true in general, e.g. if X is a set of dummy variables for a factor, this is true only if the number of observations in each category is equal. Hence an alternative approach is needed. In the non-orthonormal case, we can think of equation (5) as a ridge regression, with the ridge parameter depending on ˆβ l. A complicated scalar equation can be derived for ˆβ l from (5); then substituting into the right-hand side of (5) gives the solution. However this is not a good approach numerically, as it can involve dividing by the norm of a vector that is very close to zero. It is also not guaranteed to converge. In the next section we provide a better solution to this problem, and to the sparse group lasso. 3 Computation for the sparse group lasso The criterion (1) is separable so that block coordinate descent can be used for its optimization. Therefore we focus on just one group l, and denote the predictors by X l = Z = (Z 1, Z 2,... Z k ), the coefficients by β l = θ = (θ 1, θ 2,... θ k ) and the residual by r = y k l X kβ k. The subgradient equations are Z T j (r j Z j θ j ) + λ 1 s j + λ 2 t j = 0 (6) for j = 1, 2,... k where s j = θ j / θ if θ l 0 and s is a vector satisfying s 2 1 otherwise, and t j sign(θ j ), that is t j = sign(θ j ) if θ j 0 and t j [ 1, 1] if θ j = 0. Letting a = X T l r, then a necessary and sufficient condition for θ to be zero is that the system of equations a j = λ 1 s j + λ 2 t j (7) have a solution with s 2 1 and t j [ 1, 1]. We can determine this by minimizing J(t) = (1/λ 2 1) (a j λ 2 t j ) 2 = s 2 j (8) with respect to the t j [ 1, 1] and then checking if J(ˆt) 1. The minimizer is easily seen to be { aj λ ˆt j = 2 if a j λ 2 1, sign( a j λ 2 ) if a j λ 2 > 1. Now if J(ˆt) > 1, then we must minimize the criterion 1 2 N ( r i i=1 ) 2 Z ij θ j + λ1 θ 2 + λ 2 4 θ j (9)

This is the sum of a convex differentiable function (first two terms) and a separable penalty, and hence we can use coordinate descent to obtain the global minimum. Here are the details of the coordinate descent procedure. For each j let w j = r k j Z k ˆθ k. Then ˆθ j = 0 if Z T j w j < λ 2. This follows easily by examining the subgradient equation corresponding to (9). Otherwise if Z T j w j λ 2 we minimize (9) by a one-dimensional search over θ j. We use the optimize function in the R package, which is a combination of golden section search and successive parabolic interpolation. This leads to the following algorithm: Algorithm for the sparse group lasso 1. Start with ˆβ = β 0 2. In group l, define r = y k l X kβ k, X l = (Z 1, Z 2,... Z k ), β l = (θ 1, θ 2,... θ k ) and w j = (w 1, w 2,... w N ) = r k j Z kθ k. Check if J(ˆt) 1 according to (8) and if so set ˆβ l = 0. Otherwise for j = 1, 2, 3... k, 1, 2, 3..., if Z T j w j < λ 2 then ˆθ j = 0; if instead Z T j w j λ 2 then minimize 1 2 N (w i i=1 Z ij θ j ) 2 + λ 1 θ 2 + λ 2 θ j (10) over θ j by a one-dimensional optimization. This cyclic optimization for j = 1, 2, 3... k, 1, 2, 3... is iterated until convergence. 3. Iterate the entire step (2) over groups l = 1, 2,... L until convergence. If λ 2 is zero, we instead use condition (4) for the group-level test and we don t need to check the condition Z T j w j < λ 2. With these modifications, this algorithm also gives a effective method for solving the group lasso with non-orthogonal model matrices. Note that in the special case where X T l X l = I, with X l = (Z 1, Z 2,... Z k ) then its is easy to show that ) ˆθ j = ( S(Z Tj y, λ 2 ) 2 λ 1 and this reduces to the algorithm of Yuan & Lin (2007). + S(Zj T y, λ 2) S(Zj T y, λ (11) 2) 2 5

4 An example We generated n = 200 observations with p = 100 predictors, in ten blocks of ten. The second fifty predictors iall have coefficients of zero. The number of non-zero coefficients in the first five blocks of 10 are (10, 8, 6, 4, 2, 1) respectively, with coefficients equal to ±1, the sign chosen at random. The predictors are standard Gaussian with correlation 0.2 within a group and zero otherwise. Finally, Gaussian noise with standard deviation 4.0 was added to each observation. Figure 2 shows the signs of the estimated coefficients from the lasso, group lasso and sparse group lasso, using a well chosen tuning parameter for each method (we set λ 1 = λ 2 for the sparse group lasso). The corresponding misclassification rates for the groups and individual features are shown in Figure 3. We see that the sparse group lasso strikes an effective compromise between the lasso and group lasso, yielding sparseness at the group and individual predictor levels. Authors note: After completion of this note, we became aware the related recent work of Puig et al. (2009) on multidimensional shrinkage thresholding operators. References Bertsekas, D. (1999), Nonlinear programming, Athena Scientific. Meier, L., van de Geer, S. & Bühlmann, P. (2008), The group lasso for logistic regression, Journal of the Royal Statistical Society B 70, 53 71. Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R. & Wang, P. (2009), Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer, Annals of Applied Statistics (to appear). Puig, A., Wiesel, A. & Hero, A. (2009), A multidimensional shrinkagethresholding operator, statistical signal processing, in SSP 09. IEEE/SP 15th Workshop on Statistical Signal Processing, pp. 113 116. Yuan, M. & Lin, Y. (2007), Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society, Series B 68(1), 49 67. 6

Group lasso Coefficient 3 2 1 0 1 2 3 0 20 40 60 80 100 Predictor Sparse Group lasso Coefficient 3 2 1 0 1 2 3 0 20 40 60 80 100 Predictor Lasso Coefficient 3 2 1 0 1 2 3 0 20 40 60 80 100 Predictor Figure 2: Results for the simulated example. True coefficients are indicated by the open triangles while the filled green circles indicate the sign of the estimated coefficients from each method. 7

Number of groups misclassified 0 1 2 3 4 5 Lasso Group lasso Sparse Group lasso 2 4 6 8 10 12 Regularization parameter Number of features misclassified 10 20 30 40 50 60 70 2 4 6 8 10 12 Regularization parameter Figure 3: Results for the simulated example. The top panel shows the number of groups that are misclassified as the regularization parameter is varied. A misclassified group is one with at least one nonzero coefficient whose estimated coefficients are all set to zero, or vice versa. The bottom panel shows the number of individual coefficients that are misclassified, that is, estimated to be zero when the true coefficient is nonzero or vice-versa. 8