Sparse Prediction with the ksupport Norm


 Kathryn Melton
 1 years ago
 Views:
Transcription
1 Sparse Prediction with the Support Norm Andreas Argyriou École Centrale Paris Rina Foygel Department of Statistics, Stanford University Nathan Srebro Toyota Technological Institute at Chicago Abstract We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an l penalty. We show that this new support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. Introduction Regularizing with the l norm, when we expect a sparse solution to a regression problem, is often justified by w being the convex envelope of w 0 (the number of nonzero coordinates of a vector w R d ). That is, w is the tightest convex lower bound on w 0. But we must be careful with this statement for sparse vectors with large entries, w 0 can be small while w is large. In order to discuss convex lower bounds on w 0, we must impose some scale constraint. A more accurate statement is that w w w 0, and so, when the magnitudes of entries in w are bounded by, then w w 0, and indeed it is the largest such convex lower bound. Viewed as a convex outer relaxation, S ( ) := { w w 0, w } { w w }. Intersecting the righthandside with the l unit ball, we get the tightest convex outer bound (convex hull) of S ( ) : { w w, w } = conv(s ( ) ). However, in our view, this relationship between w and w 0 yields disappointing learning guarantees, and does not appropriately capture the success of the l norm as a surrogate for sparsity. In particular, the sample complexity of learning a linear predictor with nonzero entries by empirical ris minimization inside this class (an NPhard optimization problem) scales as O( log d), but relaxing to the constraint w yields a sample complexity which scales as O( log d), because the sample complexity of l regularized learning scales quadratically with the l norm [, 0]. Perhaps a better reason for the l norm being a good surrogate for sparsity is that, not only do we expect the magnitude of each entry of w to be bounded, but we further expect w to be small. In a regression setting, with a vector of features x, this can be justified when E[(x w) ] is bounded (a reasonable assumption) and the features are not too correlated see, e.g. [5]. More broadly, We define this as the number of observations needed in order to ensure expected prediction error no more than ɛ worse than that of the best sparse predictor, for an arbitrary constant ɛ (that is, we suppress the dependence on ɛ and focus on the dependence on the sparsity and dimensionality d).
2 especially in the presence of correlations, we might require this as a modeling assumption to aid in robustness and generalization. In any case, we have w w w 0, and so if we are interested in predictors with bounded l norm, we can motivate the l norm through the following relaxation of sparsity, where the scale is now set by the l norm: { w w 0, w B } { w w B }. The sample complexity when using the relaxation now scales as O( log d). Sparse + l constraint. Our starting point is then that of combining sparsity and l regularization, and learning a sparse predictor with small l norm. We are thus interested in classes of the form S () := { w w 0, w }. As discussed above, the class { w } (corresponding to the standard Lasso) provides a convex relaxation of S (). But clearly we can get a tighter relaxation by eeping the l constraint: conv(s () {w ) w } {, w w w }. () Constraining (or equivalently, penalizing) both the l and l norms, as in (), is nown as the elastic net [5, ] and has indeed been advocated as a better alternative to the Lasso. In this paper, we as whether the elastic net is the tightest convex relaxation to sparsity plus l (that is, to S () ) or whether a tighter, and better, convex relaxation is possible. A new norm. We consider the convex hull (tightest convex outer bound) of S (), C := conv(s () ) = conv { w w 0, w }. () We study the gauge function associated with this convex set, that is, the norm whose unit ball is given by (), which we call the support norm. We show that, for >, this is indeed a tighter convex relaxation than the elastic net (that is, both inequalities in () are in fact strict inequalities), and is therefore a better convex constraint than the elastic net when seeing a sparse, low l norm linear predictor. We thus advocate using it as a replacement for the elastic net. However, we also show that the gap between the elastic net and the support norm is at most a factor of, corresponding to a factor of two difference in the sample complexity. Thus, our wor can also be interpreted as justifying the use of the elastic net, viewing it as a fairly good approximation to the tightest possible convex relaxation of sparsity intersected with an l constraint. Still, even a factor of two should not necessarily be ignored and, as we show in our experiments, using the tighter support norm can indeed be beneficial. To better understand the support norm, we show in Section that it can also be described as the group lasso with overlaps norm [0] corresponding to all ( d ) subsets of features. Despite the exponential number of groups in this description, we show that the support norm can be calculated efficiently in time O(d log d) and that its dual is given simply by the l norm of the largest entries. We also provide efficient firstorder optimization algorithms for learning with the support norm. Related Wor In many learning problems of interest, Lasso has been observed to shrin too many of the variables of w to zero. In particular, in many applications, when a group of variables is highly correlated, the Lasso may prefer a sparse solution, but we might gain more predictive accuracy by including all the correlated variables in our model. These drawbacs have recently motivated the use of various other regularization methods, such as the elastic net [], which penalizes the regression coefficients w with a combination of l and l norms: min Xw y + λ w + λ w : w R d, (3) More precisely, the sample complexity is O(B log d), where the dependence on B is to be expected. Note that if feature vectors are l bounded (i.e. individual features are bounded), the sample complexity when using only w B (without a sparsity or l constraint) scales as O(B d). That is, even after identifying the correct support, we still need a sample complexity that scales with B.
3 where for a sample of size n, y R n is the vector of response values, and X R n d is a matrix with column j containing the values of feature j. The elastic net can be viewed as a tradeoff between l regularization (the Lasso) and l regularization (Ridge regression [9]), depending on the relative values of λ and λ. In particular, when λ = 0, (3) is equivalent to the Lasso. This method, and the other methods discussed below, have been observed to significantly outperform Lasso in many real applications. The pairwise elastic net (PEN) [3] is a penalty function that accounts for similarity among features: w P EN R = w + w w R w, where R [0, ] p p is a matrix with R j measuring similarity between features X j and X. The trace Lasso [6] is a second method proposed to handle correlations within X, defined by w trace X = Xdiag(w), where denotes the matrix tracenorm (the sum of the singular values) and promotes a lowran solution. If the features are orthogonal, then both the PEN and the Trace Lasso are equivalent to the Lasso. If the features are all identical, then both penalties are equivalent to Ridge regression (penalizing w ). Another existing penalty is OSCAR [3], given by w OSCAR c = w + c j< max{ w j, w }. Lie the elastic net, each one of these three methods also prefers averaging similar features over selecting a single feature. The Support Norm One argument for the elastic net has been the flexibility of tuning the cardinality of the regression vector w. Thus, when groups of correlated variables are present, a larger may be learned, which corresponds to a higher λ in (3). A more natural way to obtain such an effect of tuning the cardinality is to consider the convex hull of cardinality vectors, C = conv(s () ) = conv{w Rd w 0, w }. Clearly the sets C are nested, and C and C d are the unit balls for the l and l norms, respectively. Consequently we define the support norm as the norm whose unit ball equals C (the gauge function associated with the C ball). 3 An equivalent definition is the following variational formula: Definition.. Let {,..., d}. The support norm sp is defined, for every w Rd, as := min, w sp I G v I : supp(v I ) I, I G v I = w where G denotes the set of all subsets of {,..., d} of cardinality at most. The equivalence is immediate by rewriting v I = µ I z I in the above definition, where µ I 0, z I C, I G, I G µ I =. In addition, this immediately implies that sp is indeed a norm. In fact, the support norm is equivalent to the norm used by the group lasso with overlaps [0], when the set of overlapping groups is chosen to be G (however, the group lasso has traditionally been used for applications with some specific nown group structure, unlie the case considered here). Although the variational definition. is not amenable to computation because of the exponential growth of the set of groups G, the support norm is computationally very tractable, with an O(d log d) algorithm described in Section.. As already mentioned, sp = and sp d =. The unit ball of this new norm in R 3 for = is depicted in Figure. We immediately notice several differences between this unit ball and the elastic net unit ball. For example, at points with cardinality and l norm equal to, the support norm is not differentiable, but unlie the l or elasticnet norm, it is differentiable at points with cardinality less than. Thus, the support norm is less biased towards sparse vectors than the elastic net and the l norm. 3 The gauge function γ C : R d R {+ } is defined as γ C (x) = inf{λ R + : x λc }. 3
4 . The Dual Norm Figure : Unit ball of the support norm (left) and of the elastic net (right) on R 3. It is interesting and useful to compute the dual of the support norm. For w R d, denote w for the vector of absolute values, and w i for the ith largest element of w []. We have ( ) ( ) u sp = max { w, u : w sp } = max u i : I G = ( u i ) =: u () (). i I i= This is the l norm of the largest entries in u, and is nown as the  symmetric gauge norm []. Not surprisingly, this dual norm interpolates between the l norm (when = d and all entries are taen) and the l norm (when = and only the largest entry is taen). This parallels the interpolation of the support norm between the l and l norms.. Computation of the Norm In this section, we derive an alternative formula for the support norm, which leads to computation of the value of the norm in O(d log d) steps. ( ) Proposition.. For every w R d, w sp = r d ( w i ) + w i, i= r+ i= r where, letting w 0 denote +, r is the unique integer in {0,..., } satisfying w r > d w i w r r +. (4) i= r This result shows that sp trades off between the l and l norms in a way that favors sparse vectors but allows for cardinality larger than. It combines the uniform shrinage of an l penalty for the largest components, with the sparse shrinage of an l penalty for the smallest components. Proof of Proposition.. We will use the inequality w, u w, u [7]. We have { ( w sp ) = max u, w } { d ( u () () ) : u R d = max α i w i αi : i= i= } { } d α α d 0 = max α i w i + α w i αi : α α 0. Let A r := d i= r i= i= w i for r {0,..., }. If A 0 < w then the solution α is given by α i = w i for i =,..., ( ), α i = A 0 for i =,..., d. If A 0 w then the optimal α, α lie between w and A 0, and have to be equal. So, the maximization becomes { } max α i w i αi + A α α : α α 0. i= i= 4 i=
5 If A 0 w and w > A then the solution is α i = w i for i =,..., ( ), α i = A for i = ( ),..., d. Otherwise we proceed as before and continue this process. At stage r the process terminates if A 0 w,..., Ar r w r, A r r+ < w r and all but the last two inequalities are redundant. Hence the condition can be rewritten as (4). One optimal solution is α i = w i for i =,..., r, α i = Ar r+ for i = r,..., d. This proves the claim..3 Learning with the support norm We thus propose using learning rules with support norm regularization. These are appropriate when we would lie to learn a sparse predictor that also has low l norm, and are especially relevant when features might be correlated (that is, in almost all learning tass) but the correlation structure is not nown in advance. E.g., for squared error regression problems we have: { min Xw y + λ } ( w sp ) : w R d (5) with λ > 0 a regularization parameter and {,..., d} also a parameter to be tuned. As typical in regularizationbased methods, both λ and can be selected by cross validation [8]. Despite the relationship to S (), the parameter does not necessarily correspond to the sparsity of the actual minimizer of (5), and should be chosen via crossvalidation rather than set to the desired sparsity. 3 Relation to the Elastic Net Recall that the elastic net with penalty parameters λ and λ selects a vector of coefficients given by arg min Xw y + λ w + λ w. (6) For ease of comparison with the support norm, we first show that the set of optimal solutions for the elastic net, when the parameters are varied, is the same as for the norm { := max w, w / }, w el when [, d], corresponding to the unit ball in () (note that is not necessarily an integer). To see this, let ŵ be a solution to (6), and let := ( ŵ / ŵ ) [, d]. Now for any w ŵ, if w el ŵ el, then w p ŵ p for p =,. Since ŵ is a solution to (6), therefore, Xw y Xŵ y. This proves that, for some constraint parameter B, ŵ = arg min n Xw y : w el B. Lie the support norm, the elastic net interpolates between the l and l norms. In fact, when is an integer, any sparse unit vector w R d must lie in the unit ball of el. Since the support norm gives the convex hull of all sparse unit vectors, this immediately implies that w el w sp w R d. The two norms are not equal, however. The difference between the two is illustrated in Figure, where we see that the support norm is more rounded. To see an example where the two norms are not equal, we set d = + for some large, and let w = (.5,,,..., ) R d. Then { 3 = max +,.5 + } ( =.5 + ). w el Taing u = (,,,..., ), we have u () () <, and recalling this norm is dual to the support norm: w sp > w, u =.5 + =.5. In this example, we see that the two norms can differ by as much as a factor of. We now show that this is actually the most by which they can differ. 5
6 Proposition 3.. el sp < el. Proof. We show that these bounds hold in the duals of the two norms. First, since el maximum over the l and l norms, its dual is given by u (el) Now tae any u R d. First we show u () () u u d 0. For any a R d, { := inf a + } u a a R d is a u (el). Without loss of generality, we tae u () () = u : a : + u : a : a + u a. Finally, we show that u (el) < u () (). Let a = (u u +,..., u u +, 0,..., 0). Then u (el) a + u a = (u i u + ) + u + i= (u i u + ) + u + (u i u + ) + u + = u () i= Furthermore, this yields a strict inequality, because if u > u +, the nexttolast inequality is strict, while if u = = u +, then the last inequality is strict. 4 Optimization Solving the optimization problem (5) efficiently can be done with a firstorder proximal algorithm. Proximal methods see [, 4, 4, 8, 9] and references therein are used to solve composite problems of the form min{f(x) + ω(x) : x R d }, where the loss function f(x) and the regularizer ω(x) are convex functions, and f is smooth with an LLipschitz gradient. These methods require fast computation of the gradient f and the proximity operator prox ω (x) := argmin u x + ω(u) : u R d. To obtain a proximal method for support regularization, it suffices to compute the proximity map of g = β ( sp ), for any β > 0 (in particular, for problem (5) β corresponds to L λ ). This computation can be done in O(d( + log d)) steps with Algorithm. Algorithm Computation of the proximity operator. Input v R d Output q = prox )(v) β ( sp Find r {0,..., }, l {,..., d} such that i= (). β+ z T r > r,l l +(β+)r+β+ β+ z r (7) z l > where z := v, z 0 := +, z d+ :=, T r,l := β β+ z i if i =,..., r T q i z i r,l l +(β+)r+β+ if i = r,..., l 0 if i = l +,..., d Reorder and change signs of q to conform with v T r,l l +(β+)r+β+ z l+ (8) l z i i= r 6
7 Figure : Solutions learned for the synthetic data. Left to right: support, Lasso and elastic net. Proof of Correctness of Algorithm. Since the supportnorm is sign and permutation invariant, prox g (v) has the same ordering and signs as v. Hence, without loss of generality, we may assume that v v d 0 and require that q q d 0, which follows from inequality (7) and the fact that z is ordered. Now, q = prox g (v) is equivalent to βz βq = βv βq ( sp ) (q). It suffices to show that, for w = q, βz βq is an optimal α in the proof of Proposition.. Indeed, A r corresponds to d q i = l ( ) T z i r,l l +(β+)r+β+ = T r,l (l +r+)t r,l β T l +(β+)r+β+ = (r + ) r,l l +(β+)r+β+ i= r i= r and (4) is equivalent to condition (7). For i r, we have βz i βq i = q i. For r i l, we have βz i βq i = r+ A r. For i l +, since q i = 0, we only need βz i βq i r+ A r, which is true by (8). We can now apply a standard accelerated proximal method, such as FISTA [], to (5), at each iteration using the gradient of the loss and performing a prox step using Algorithm. The FISTA guarantee ensures us that, with appropriate step sizes, after T such iterations, we have: Xw T y + λ ( w T sp ) ( Xw y + λ ( w sp ) ) + L w w (T + ). 5 Empirical Comparisons Our theoretical analysis indicates that the support norm and the elastic net differ by at most a factor of, corresponding to at most a factor of two difference in their sample complexities and generalization guarantees. We thus do not expect huge differences between their actual performances, but would still lie to see whether the tighter relaxation of the support norm does yield some gains. Synthetic Data For the first simulation we follow [, Sec. 5, example 4]. In this experimental protocol, the target (oracle) vector equals w = (3,..., 3, 0..., 0), with y = (w } { } { ) x + N (0, ). 5 5 The input data X were generated from a normal distribution such that components,..., 5 have the same random mean Z N (0, ), components 6,..., 0 have mean Z N (0, ) and components,..., 5 have mean Z 3 N (0, ). A total of 50 data sets were created in this way, each containing 50 training points, 50 validation points and 350 test points. The goal is to achieve good prediction performance on the test data. We compared the support norm with Lasso and the elastic net. We considered the ranges = {,..., d} for support norm regularization, λ = 0 i, i = { 5,..., 5}, for the regularization parameter of Lasso and support regularization and the same range for the λ, λ of the elastic net. For each method, the optimal set of parameters was selected based on mean squared error on the validation set. The error reported in Table 5 is the mean squared error with respect to the oracle w, namely MSE = (ŵ w ) V (ŵ w ), where V is the population covariance matrix of X test. To further illustrate the effect of the support norm, in Figure 5 we show the coefficients learned by each method, in absolute value. For each image, one row corresponds to the w learned for one of the 50 data sets. Whereas all three methods distinguish the 5 relevant variables, the elastic net result varies less within these variables. South African Heart Data This is a classification tas which has been used in [8]. There are 9 variables and 46 examples, and the response is presence/absence of coronary heart disease. We 7
8 Table : Mean squared errors and classification accuracy for the synthetic data (median over 50 repetition), SA heart data (median over 50 replications) and for the 0 newsgroups data set. (SE = standard error) Synthetic Heart Newsgroups Method MSE (SE) MSE (SE) Accuracy (SE) MSE Accuracy Lasso (0.0) 0.8 (0.005) 66.4 (0.53) Elastic net 0.74 (0.0) 0.8 (0.005) 66.4 (0.53) support 0.43 (0.0) 0.8 (0.005) 66.4 (0.53) normalized the data so that each predictor variable has zero mean and unit variance. We then split the data 50 times randomly into training, validation, and test sets of sizes 400, 30, and 3 respectively. For each method, parameters were selected using the validation data. In Tables 5, we report the MSE and accuracy of each method on the test data. We observe that all three methods have identical performance. 0 Newsgroups This is a binary classification version of 0 newsgroups created in [] which can be found in the LIBSVM data repository. 4 The positive class consists of the 0 groups with names of form sci.*, comp.*, or misc.forsale and the negative class consists of the other 0 groups. To reduce the number of features, we removed the words which appear in less than 3 documents. We randomly split the data into a training, a validation and a test set of sizes 4000,000 and 4996, respectively. We report MSE and accuracy on the test data in Table 5. We found that support regularization gave improved prediction accuracy over both other methods. 5 6 Summary We introduced the support norm as the tightest convex relaxation of sparsity plus l regularization, and showed that it is tighter than the elastic net by exactly a factor of. In our view, this sheds light on the elastic net as a close approximation to this tightest possible convex relaxation, and motivates using the support norm when a tighter relaxation is sought. This is also demonstrated in our empirical results. We note that the support norm has better prediction properties, but not necessarily better sparsityinducing properties, as evident from its more rounded unit ball. It is well understood that there is often a tradeoff between sparsity and good prediction, and that even if the population optimal predictor is sparse, a denser predictor often yields better predictive performance [3, 0, ]. For example, in the presence of correlated features, it is often beneficial to include several highly correlated features rather than a single representative feature. This is exactly the behavior encouraged by l norm regularization, and the elastic net is already nown to yield less sparse (but more predictive) solutions. The support norm goes a step further in this direction, often yielding solutions that are even less sparse (but more predictive) compared to the elastic net. Nevertheless, it is interesting to consider whether compressed sensing results, where l regularization is of course central, can be refined by using the support norm, which might be able to handle more correlation structure within the set of features. Acnowledgements The construction showing that the gap between the elastic net and the  overlap norm can be as large as is due to joint wor with Ohad Shamir. Rina Foygel was supported by NSF grant DMS References [] A. Bec and M. Teboulle. A fast iterative shrinagethresholding algorithm for linear inverse problems. SIAM Journal of Imaging Sciences, ():83 0, 009. [] R. Bhatia. Matrix Analysis. Graduate Texts in Mathematics. Springer, cjlin/libsvmtools/datasets/ 5 Regarding other sparse prediction methods, we did not manage to compare with OSCAR, due to memory limitations, or to PEN or trace Lasso, which do not have code available online. 8
9 [3] H.D. Bondell and B.J. Reich. Simultaneous regression shrinage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64():5 3, 008. [4] P.L. Combettes and V.R. Wajs. Signal recovery by proximal forwardbacward splitting. Multiscale Modeling and Simulation, 4(4):68 00, 006. [5] C. De Mol, E. De Vito, and L. Rosasco. Elasticnet regularization in learning theory. Journal of Complexity, 5():0 30, 009. [6] E. Grave, G. R. Obozinsi, and F. Bach. Trace lasso: a trace norm regularization for correlated designs. In J. ShaweTaylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 4, 0. [7] G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Cambridge University Press, 934. [8] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Verlag Series in Statistics, 00. [9] A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, pages 55 67, 970. [0] L. Jacob, G. Obozinsi, and J.P. Vert. Group Lasso with overlap and graph Lasso. In Proceedings of the 6th Annual International Conference on Machine Learning, pages ACM, 009. [] S.M. Kaade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Ris bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, volume, 008. [] S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:34 36, 005. [3] A. Lorbert, D. Eis, V. Kostina, D.M. Blei, and P.J. Ramadge. Exploiting covariate similarity in sparse regression via the pairwise elastic net. In Proceedings of the 3th International Conference on Artificial Intelligence and Statistics, 00. [4] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE, 007. [5] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, lownoise and fast rates. In Advances in Neural Information Processing Systems 3, 00. [6] T. Suzui and R. Tomioa. SpicyMKL: a fast algorithm for multiple ernel learning with thousands of ernels. Machine learning, pages 3, 0. [7] R. Tibshirani. Regression shrinage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 58():67 88, 996. [8] P. Tseng. On accelerated proximal gradient methods for convexconcave optimization. Preprint, 008. [9] P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 5():63 95, 00. [0] T. Zhang. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, :57 550, 00. [] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67():30 30,
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationWhen Is There a Representer Theorem? Vector Versus Matrix Regularizers
Journal of Machine Learning Research 10 (2009) 25072529 Submitted 9/08; Revised 3/09; Published 11/09 When Is There a Representer Theorem? Vector Versus Matrix Regularizers Andreas Argyriou Department
More informationOptimization with SparsityInducing Penalties. Contents
Foundations and Trends R in Machine Learning Vol. 4, No. 1 (2011) 1 106 c 2012 F. Bach, R. Jenatton, J. Mairal and G. Obozinski DOI: 10.1561/2200000015 Optimization with SparsityInducing Penalties By
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationDecoding by Linear Programming
Decoding by Linear Programming Emmanuel Candes and Terence Tao Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 Department of Mathematics, University of California, Los Angeles, CA 90095
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationFast Solution of l 1 norm Minimization Problems When the Solution May be Sparse
Fast Solution of l 1 norm Minimization Problems When the Solution May be Sparse David L. Donoho and Yaakov Tsaig October 6 Abstract The minimum l 1 norm solution to an underdetermined system of linear
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to
More informationObject Detection with Discriminatively Trained Part Based Models
1 Object Detection with Discriminatively Trained Part Based Models Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan Abstract We describe an object detection system based on mixtures
More informationWORKING PAPER SERIES FORECASTING USING A LARGE NUMBER OF PREDICTORS IS BAYESIAN REGRESSION A VALID ALTERNATIVE TO PRINCIPAL COMPONENTS?
WORKING PAPER SERIES NO 700 / DECEMBER 2006 FORECASTING USING A LARGE NUMBER OF PREDICTORS IS BAYESIAN REGRESSION A VALID ALTERNATIVE TO PRINCIPAL COMPONENTS? by Christine De Mol, Domenico Giannone and
More informationSome Sharp Performance Bounds for Least Squares Regression with L 1 Regularization
Some Sharp Performance Bounds for Least Squares Regression with L 1 Regularization Tong Zhang Statistics Department Rutgers University, NJ tzhang@stat.rutgers.edu Abstract We derive sharp performance bounds
More informationStructured Variable Selection with SparsityInducing Norms
Journal of Machine Learning Research 1 (011) 77784 Submitted 9/09; Revised 3/10; Published 10/11 Structured Variable Selection with SparsityInducing Norms Rodolphe Jenatton INRIA  SIERRA Projectteam,
More informationProbability Estimates for Multiclass Classification by Pairwise Coupling
Journal of Machine Learning Research 5 (2004) 975005 Submitted /03; Revised 05/04; Published 8/04 Probability Estimates for Multiclass Classification by Pairwise Coupling TingFan Wu ChihJen Lin Department
More informationFrom Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images
SIAM REVIEW Vol. 51,No. 1,pp. 34 81 c 2009 Society for Industrial and Applied Mathematics From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images Alfred M. Bruckstein David
More informationOPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
More informationStatistical challenges with high dimensionality: feature selection in knowledge discovery
Statistical challenges with high dimensionality: feature selection in knowledge discovery Jianqing Fan and Runze Li Abstract. Technological innovations have revolutionized the process of scientific research
More informationRegression. Chapter 2. 2.1 Weightspace View
Chapter Regression Supervised learning can be divided into regression and classification problems. Whereas the outputs for classification are discrete class labels, regression is concerned with the prediction
More informationSketching as a Tool for Numerical Linear Algebra
Foundations and Trends R in Theoretical Computer Science Vol. 10, No. 12 (2014) 1 157 c 2014 D. P. Woodruff DOI: 10.1561/0400000060 Sketching as a Tool for Numerical Linear Algebra David P. Woodruff IBM
More informationLearning to Select Features using their Properties
Journal of Machine Learning Research 9 (2008) 23492376 Submitted 8/06; Revised 1/08; Published 10/08 Learning to Select Features using their Properties Eyal Krupka Amir Navot Naftali Tishby School of
More informationHighRate Codes That Are Linear in Space and Time
1804 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 7, JULY 2002 HighRate Codes That Are Linear in Space and Time Babak Hassibi and Bertrand M Hochwald Abstract Multipleantenna systems that operate
More informationFrom Few to Many: Illumination Cone Models for Face Recognition Under Variable Lighting and Pose. Abstract
To Appear in the IEEE Trans. on Pattern Analysis and Machine Intelligence From Few to Many: Illumination Cone Models for Face Recognition Under Variable Lighting and Pose Athinodoros S. Georghiades Peter
More informationMore Generality in Efficient Multiple Kernel Learning
Manik Varma manik@microsoft.com Microsoft Research India, Second Main Road, Sadashiv Nagar, Bangalore 560 080, India Bodla Rakesh Babu rakeshbabu@research.iiit.net CVIT, International Institute of Information
More informationScalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park
More informationSee All by Looking at A Few: Sparse Modeling for Finding Representative Objects
See All by Looking at A Few: Sparse Modeling for Finding Representative Objects Ehsan Elhamifar Johns Hopkins University Guillermo Sapiro University of Minnesota René Vidal Johns Hopkins University Abstract
More informationThe Capital Asset Pricing Model: Some Empirical Tests
The Capital Asset Pricing Model: Some Empirical Tests Fischer Black* Deceased Michael C. Jensen Harvard Business School MJensen@hbs.edu and Myron Scholes Stanford University  Graduate School of Business
More informationLazier Than Lazy Greedy
Baharan Mirzasoleiman ETH Zurich baharanm@inf.ethz.ch Lazier Than Lazy Greedy Ashwinumar Badanidiyuru Google Research Mountain View ashwinumarbv@google.com Amin Karbasi Yale University amin.arbasi@yale.edu
More informationAn Introduction to Variable and Feature Selection
Journal of Machine Learning Research 3 (23) 11571182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478151, USA
More informationNearOptimal Sensor Placements in Gaussian Processes
Carlos Guestrin Andreas Krause Ajit Paul Singh School of Computer Science, Carnegie Mellon University GUESTRIN@CS.CMU.EDU KRAUSEA@CS.CMU.EDU AJIT@CS.CMU.EDU Abstract When monitoring spatial phenomena,
More informationTHE adoption of classical statistical modeling techniques
236 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 2, FEBRUARY 2006 Data Driven Image Models through Continuous Joint Alignment Erik G. LearnedMiller Abstract This paper
More informationSEEDB: Supporting Visual Analytics with DataDriven Recommendations
SEEDB: Supporting Visual Analytics with DataDriven Recommendations Manasi Varta MIT mvarta@mit.edu Samuel Madden MIT madden@csail.mit.edu Aditya Parameswaran University of Illinois (UIUC) adityagp@illinois.edu
More information