Efficient variable selection in support vector machines via the alternating direction method of multipliers
|
|
|
- Loreen Chandler
- 10 years ago
- Views:
Transcription
1 Efficient variable selection in support vector machines via the alternating direction method of multipliers Gui-Bo Ye Yifei Chen Xiaohui Xie University of California, Irvine University of California, Irvine University of California, Irvine Abstract The support vector machine (SVM) is a widely used tool for classification. Although commonly understood as a method of finding the maximum-margin hyperplane, it can also be formulated as a regularized function estimation problem, corresponding to a hinge loss function plus an l 2 -norm regulation term. The doubly regularized support vector machine (DrSVM) is a variant of the standard SVM, which introduces an additional l -norm regularization term on the fitted coefficients. The combined l and l 2 regularization, termed elastic net penalty, has the property of achieving simultaneous variable selection and margin-maximization within a single framework. However, because of the nondifferentiability of both the loss function and the regularization term, there is no efficient method available to solve DrSVM for large-scale problems. Here we develop an efficient algorithm based on the alternating direction method of multipliers (ADMM) to solve the optimization problem in DrSVM. The utility of the method is illustrated using both simulated and real-world data. INTRODUCTION Datasets with tens of thousands variables have become increasingly common in many real-world applications. For example, in the biomedical domain a microarray dataset typically contains about 20,000 genes, while a genotype dataset commonly includes half of a million SNPs. Regularization terms that encourage sparsity in coefficients are increasingly being used for simul- Appearing in Proceedings of the 4 th International Conference on Artificial Intelligence and Statistics (AISTATS) 20, Fort Lauderdale, FL, USA. Volume 5 of JMLR: W&CP 5. Copyright 20 by the authors. taneous variable selection and prediction (Tibshirani, 996; Zou and Hastie, 2005). A widely used strategy for imposing sparsity on regression or classification coefficients is to use the l -norm regularization. Perhaps the most well-known example is the least absolute shrinkage and selection operator (lasso) method for linear regression. The method minimizes the usual sum of squared errors while penalizing the l norm of the regression coefficients (Tibshirani, 996). Due to the nondifferentiability of the l norm, lasso is able to perform continuous shrinkage and automatic variable selection simultaneously. Although the lasso method has shown success in many situations and has been generalized for different settings (Zhu et al., 2003; Lin and Zhang, 2006), it has several limitations. First, when the dimension of the data (p) is larger than the number of training samples (n), lasso selects at most n variable before it saturates (Efron et al., 2004). Second, if there is a group of variables among which the pairwise correlations are very high, the lasso tends to select only one variable from the group and does not care which one is selected. The elastic net penalty proposed by Zou et al. in Zou and Hastie (2005) is a convex combination of the lasso and ridge penalty, which has the characteristics of both the lasso and ridge regression in the regression setting. More specifically, the elastic net penalty simultaneously does automatic variable selection and continuous shrinkage, and it can select groups of correlated variables. It is especially useful for large p, small n problems, where the grouped variables situation is a particularly important concern and has been addressed many times in the literature (Hastie et al., 2000, 2003). The idea of using l -norm constraints to automatically select variables has also been extended to classification problems. Zhu et al. (2003) proposed an l - norm support vector machine, whereas Wang et al. (2006) proposed a SVM with the elastic net penalty term, which they named doubly regularized support vector machine (DrSVM). By using a mixture of the l -norm and the l 2 -norm penalties, DrSVM is able to perform automatic variable selection as the l -norm 832
2 Efficient variable selection in support vector machines via the alternating direction method of multipliers SVM. Additionally, it also encourages highly correlated variables to be selected (or removed) together, and thus achieves the grouping effect. Although DrSVM has a number of desirable features, solving DrSVM is, however, non-trivial because of the nondifferentiability in both the loss function and the regularization term. This is especially problematic for large scale problems. To circumvent this difficulty, Wang et al. (2008) proposed a hybrid huberized support vector machine (HHSVM), which uses a huberized hinge loss function to approximate the hinge loss in DrSVM. Because the huberized hinge loss function is differentiable, HHSVM is easier to solve than DrSVM. Wang et al. (2008) proposed a path algorithm to solve the HHSVM problem. However, because the path algorithm requires tracking disappearance of variables along a regularization path, it is not easy to implement and still does not handle large-scale data well. Our main contribution in this paper is to introduce a new algorithm to directly solve DrSVM without resorting to approximation as in HHSVM. Our method is based on the alternating direction method of multipliers (ADMM) (Gabay and Mercier, 976; Glowinski and Marroco). We demonstrate that the method is efficient even for large-scale problems with tens of thousands variables. The rest of the paper is organized as follows. In Section 2, we provide a description of the SVM model with elastic net penalty. In Section 3, we derive an iterative algorithm based on ADMM to solve the optimization problem in DrSVM and prove its convergence property. In Section 4, we benchmark the performance of the algorithm on both simulated and real-world data. 2 SUPPORT VECTOR MACHINES WITH ELASTIC NET PENALTY 2. SVM as regularized function estimation Consider the classification of the training data {(x i, y i )} n i=, where x i = (x i,..., x ip ) T are the predictor variables and y i {, } is the corresponding class label. The support vector machine (SVM) was originally proposed to find the optimal separating hyperplane that separates the two classes of data points with the largest margin (Vapnik, 998). It can be equivalently reformulated as an l 2 -norm penalized optimization problem: min β 0,β n ( y i (β 0 + x T i β)) + + λ 2 β 2 2, () i= where the loss function ( ) + := max(, 0) is called the hinge loss, and λ 0 is a regularization parameter, which controls the balance between the loss and the penalty. By shrinking the magnitude of the coefficients, the l 2 norm penalty in () reduces the variance of the estimated coefficients, and thus can achieve better prediction accuracy. However, the l 2 norm penalty cannot produce sparse coefficients and hence cannot automatically perform variable selection. This is a major limitation for applying SVM to do classification in some high-dimensional data, such as gene expression data from microarrays (Guyon et al., 2002; Mukherjee et al., 999), where variable selection is essential for both achieving better prediction accuracy and providing reasonable interpretations. To include variable selection, Zhu et al. (2003) proposed an l -norm support vector machine, min β 0,β n ( y i (β 0 + x T i β)) + + λ β, (2) i= which do variable selection automatically via the l penalty. However, it shares similar disadvantages as the lasso method for large p, small n problems, such as selecting at most n relevant variables, and disregarding group effects. This is not satisfying for some application problems. In microarray analysis, we almost always have p n. Furthermore, the genes in the same biological pathway frequently show highly correlated expression; it is desirable to identify all, instead a subset, of them for both providing biological interpretations and building prediction models. One natural way to overcome the limitations outlined above is to apply the elastic net penalty to the SVM: min β 0,β n i= ( y i (β 0 +x T i β)) + +λ β + λ 2 2 β 2 2, (3) where λ, λ 2 0 are regularization parameters. The model was originally proposed by Wang et al. (2006), and was named doubly regularized SVM (DrSVM). However, to emphasize the role of elastic net penalty, we refer to this model (3) as elastic net SVM or simply ENSVM in the rest of the paper. Due to the properties of the elastic net penalty, the optimal solution of (3) will enjoy both the sparse and the grouping effect the same as the elastic net method in regression. 2.2 RELATED WORK A similar model has been proposed by Wang et al. (2008) who have applied the elastic net penalty to the huberized hinge function and proposed the HHSVM: min β 0,β n i= φ(y i (β 0 + x T i β)) + λ β + λ 2 2 β 2 2, (4) 833
3 Gui-Bo Ye, Yifei Chen, Xiaohui Xie where φ is the huberized hinge loss function: 0, for t >, φ(t) = ( t) 2 /2δ, for δ < t, t δ/2, for t δ (5) with δ > 0 being a pre-specified constant. The main motivation for Wang et al. (2008) to use huberized hinge loss function (5) is that it is an approximation of the hinge loss and differentiable everywhere, thereby making the optimization problem easier to solve while at the same time preserving the variable selection feature. The minimizer of (4) is piecewise linear with respect to λ for a fixed λ 2. Based on this observation, Wang et al. (2008) proposed a path algorithm to solve the HHSVM problem. The path algorithm keeps track of four sets as λ decreases, and calls an event happening if any one of the four sets changes. Between any two consecutive events, the solutions are linear in λ, and after an event occurs, the derivative of the solution with respect to λ is changed. When each event happens, the algorithm solves a linear system. If the dimension of the data p is large, solving many largescale linear systems will be required to obtain the solution path. Furthermore, those linear equations are quite different from each other, and there are no special structures involved. As a result, the path algorithm is computational very expensive for large p problems. 3 ALGORITHM FOR ELASTIC NET SVM The alternating direction method of multipliers (ADMM) developed in the 970s (Gabay and Mercier, 976; Glowinski and Marroco) has recently become a method of choice for solving many large-scale problems (Candes et al., 2009; Cai et al., 2009; Goldstein and Osher, 2009). It is equivalent or closely related to many other algorithms, such as Douglas-Rachford splitting (Wu and Tai, 200), split Bregman method (Goldstein and Osher, 2009) and the method of multipliers (Rockafellar, 973). In this section, we propose an efficient algorithm based on ADMM to solve ENSVM in (3) by introducing auxiliary variables and reformulating the original problem. 3. DERIVING ADMM FOR ELASTIC NET SVM Because of the two nondifferentiable terms in (3), it is hard to solve the ENSVM problem directly. In order to derive an ADMM algorithm, we introduce some auxiliary variables to handle the nondifferentiability of the hinge loss and l norm term. Let X = (x ij ) n,p i=,j= and Y be a diagonal matrix with its diagonal elements to be the vector y = (y,..., y n ) T. The unconstrained problem in (3) can be reformulated into an equivalent constrained problem arg min β,β 0 n i= (a i ) + + λ c + λ 2 2 β 2 2 s.t. a = Y (Xβ + β 0 ), c = β, (6) where a = (a i,..., a n ) T and is an n-column vector of s. Note that the Lagrangian function of (6) is L (β, β 0, a, c, u, v) = (a i ) + + λ c + λ 2 n 2 β 2 2 i= + u, Y (Xβ + β 0 ) a + v, β c,(7) where u R n is a dual variable corresponding to the linear constraint a = Y (Xβ + β 0 ), v R p is a dual variable corresponding to the linear constraint c = β,, denotes the standard inner product in Euclidean space. The augmented Lagrangian function of (6) is similar to (7) except for adding two terms u 2 Y (Xβ + β 0) a 2 2 and u2 2 β c 2 2 to penalize the violation of linear constraints a = Y (Xβ + β 0 ) and c = β, thereby making the function strictly convex. That is, L (β, β 0, a, c, u, v) = L (β, β 0, a, c, u, v) + μ 2 Y (Xβ + β 0) a μ 2 2 β c 2 2, (8) where μ > 0 and μ 2 > 0 are two parameters. It is easy to see that solving (6) is equivalent to finding a saddle point (β, β 0, a, c, u, v ) of L (β, β 0, a, c, u, v) such that L (β, β 0, a, c, u, v) L (β, β 0, a, c, u, v ) for all β, β 0, a, c, u and v. L (β, β 0, a, c, u, v ), We solve the saddle point problem through gradient ascent on the dual problem max E(u, v), (9) u,v where E(u, v) = min β,β0,a,c L (β, β 0, a, c, u, v). Note that the gradient E(u, v) can be calculated by the following (Bertsekas, 982) ( ) Y (Xβ(u, v) + β0 (u, v)) a(u, v) E(u, v) =, β(u, v) c(u, v) (0) 834
4 Efficient variable selection in support vector machines via the alternating direction method of multipliers with (β(u, v), β 0 (u, v), a(u, v), c(u, v)) = arg min β,β 0,a,c L (β, β 0, a, c, u, v). () Using gradient ascent on the dual problem (9), Eq. (0) and Eq. (), we get the method of multipliers (Rockafellar, 973) to solve (6) (β k+, β0 k+, a k+, c k+ ) = arg min β,β0,a,c L ( β, β 0, a, c, u k, v k) u k+ = u k + μ ( Y (Xβ k+ + β0 k+ ) a k+ ), v k+ = v k + μ 2 (β k+ c k+ ). (2) The efficiency of the iterative algorithm (2) lies on whether the first equation of (2) can be solved quickly. The augmented Lagrangian function L still contains nondifferentiable terms. But different from the original objective function (3), the hinge loss induced nondifferentiability has now been transferred from terms involving y i (x T i β + β 0) to terms involving a i ; and l induced nondifferentiability has now been transferred from terms involving β to terms involving c. Moreover, the nondifferentiable terms involving a and c are now completely decoupled, and thus we can solve the first equation of (2) by alternating minimization of (β, β 0 ), a and c, (β k+, β0 k+ ) = arg min β,β0 L ( β, β 0, a k, c k, u k, v k), a k+ = arg min a L ( β k+, β0 k+, a, c k, u k, v k), c k+ = arg min c L ( β k+, β0 k+, a k+, c, u k, v k). (3) For the method of multipliers, the alternate minimization (3) needs to run multiple times until convergence. However, we do not have to completely solve the first equation of (2) since it is only one step of the overall iterative algorithm. We use only one alternation, it is called alternating direction method of multipliers (Gabay and Mercier, 976). That is, we use the following iterations to solve (6) (β k+, β0 k+ ) = arg min β,β0 L ( β, β 0, a k, c k, u k, v k), a k+ = arg min a L ( β k+, β0 k+, a, c k, u k, v k), c k+ = arg min c L ( β k+, β0 k+, a k+, c, u k, v k), u k+ = u k + μ ( Y (Xβ k+ + β0 k+ ) a k+ ), v k+ = v k + μ 2 (β k+ c k+ ). (4) For the first equation in (4), it is equivalent to λ 2 (β, β 0 ) = arg min β,β 0 2 β v k, β c k + u k, Y (Xβ + β 0 ) a k + μ 2 Y (Xβ + β 0) a k μ 2 2 β ck 2 2. The objective function in the above minimization problem is quadratic and differentiable, and thus the optimal solution can be found by solving a set of linear equations: ( (λ2 + μ 2 )I + μ X T X ) ( ) μ X T β k+ μ T X μ n β k+ 0 ( ) X = T Y u k μ X T Y (a k ) v k + μ 2 c k T Y u k μ T Y (a k. (5) ) Note that the coefficient matrix in (5) is a (p+) (p+ ) matrix, independent of the optimization variables. For small p, we can store its inverse in the memory, so the linear equations can be solved with minimal cost. For large p, we use the conjugate gradient algorithm (CG) to solve it at each iteration efficiently. The linear system (5) is very special for large p, small n problems in that X T X will be a positive low rank matrix with rank at most n. Thus the coefficient matrix in (5) is a linear combination of identity matrix and a positive low rank matrix with rank at most n+. If we use CG to solve the linear system (5), it converges in less than n+ steps (Saad, 2003). In our numerical implementation, we found that CG converges in a few steps much smaller than n +. For the second equation in (4), it is equivalent to a k+ = arg min a n (a i ) + i= + μ 2 Y (Xβk+ + β0 k+ ) a u k, Y (Xβ k+ + β0 k+ ) a. (6) In order to solve (6), we need the following Proposition (Ye and Xie, 200). Proposition Let s λ (ω) = arg min x R λx x ω 2 2. Then ω λ, ω > λ s λ (ω) = 0, 0 ω λ, ω, ω <
5 Gui-Bo Ye, Yifei Chen, Xiaohui Xie Note that each a i is independent of each other in (6) and u 2 2 2μ + μ 2 Y (Xβk+ + β k+ 0 ) a u k, Y (Xβ k+ + β k+ 0 ) a = μ 2 a ( + u μ Y (Xβ k+ + β k+ 0 )) 2 2. Together with Proposition, we can then update a k+ in (6) according to Corollary The update of a k+ in (6) is equivalent to a k+ = S nμ ( + uk μ Y (Xβ k+ + β k+ 0 )), (7) where S λ (ω) = (s λ (ω ), s λ (ω 2 ),..., s λ (ω n )) T, ω R n. For the third equation in (4), it is equivalent to c k+ = arg min λ c + v k, β k+ c + μ 2 c 2 βk+ c 2 2. (8) Minimization of c in (8) can be done efficiently using soft thresholding, because the objective function is quadratic and nondifferentiable terms are completely separable. Let T λ be a soft thresholding operator defined on vector space and satisfying where T λ (ω) = (t λ (ω ),..., t λ (ω p )), ω R p, (9) t λ (ω i ) = sgn(ω i ) max{0, ω i λ}. Using the soft thresholding operator (9), the optimal solution of c in (8) can be written as c k+ = T λ μ 2 ( v k μ 2 + β k+ ). (20) Finally, by combining (4), (5), (7) and (20) together, we obtain the algorithm ADMM for ENSVM (3) (Algorithm ). It is a practical algorithm for large p, small n problems and very easy to code. 3.2 Convergence analysis The convergence property of Algorithm can be derived from the standard convergence theory of the alternating direction method of multipliers (Gabay and Mercier, 976; Eckstein and Bertsekas, 992). Algorithm ADMM for ENSVM (3) Initialize β 0, β0, 0 a 0, c 0, u 0, and v 0. repeat ) Update β k+, β0 k+ by solving the following linear equation system: ( ) ( ) (λ2 + μ 2 )I + μ X T X μ X T β k+ μ T X μ n β k+ 0 ( ) X = T Y u k μ X T Y (a k ) v k + μ 2 c k T Y u k μ T Y (a k ) ( ) 2) a k+ = S + uk nμ μ Y (Xβ k+ + β0 k+ ) ) 3) c k+ v = T k λ μ 2 + β k+ μ 2 ( 4) u k+ = u k +μ ( Y (Xβ k+ +β k+ 0 ) a k+ ) 5) v k+ = v k + μ 2 (β k+ c k+ ) until Convergence Theorem Suppose there exists at least one solution (β, β 0) of (3). Assume λ > 0, λ 2 > 0. Then the following property for Algorithm holds: lim k n = n i= i= Furthermore, ( y i (x T i β k + β k 0 )) + + λ β k + λ 2 2 βk 2 2 ( y i (x T i β + β 0)) + + λ β + λ 2 2 β 2 2. lim k (βk, β0 k ) (β, β0) = 0, whenever (3) has a unique solution. 3.3 Computational cost The efficiency of Algorithm lies mainly on whether we can quickly solve the linear equations (5). As we have described in Section 3., the coefficient of the linear equations (5) has a special structure and thus can be efficiently solved by the conjugate gradient method for large p, small n problems. More specifically, the computational cost for solving (5) is O(n 2 p). The number of iterations of Algorithm () is hard to predict and it depends on the choice of μ and μ 2. According to our experience, we only need to iterate a few hundred iterations to get a reasonable result by choosing μ and μ 2 correctly. Similar to our algorithm for (3), the major computational cost in each iteration for HHSVM also comes from solving a linear system. However, the linear system in HHSVM has no special structures. It takes at 836
6 Efficient variable selection in support vector machines via the alternating direction method of multipliers least O( A 2 ) with A being the number of unknown variables. Moreover, A can increase at each iteration. Furthermore, for large scale problems, it usually takes a few thousand steps for the algorithm converges. That s why our algorithm for (3) is much faster than the path algorithm for HHSVM for large scale problems. 4 NUMERICAL RESULTS In this section, we use time trials on both simulated data as well as real microarray data to illustrate the efficiency of ADMM algorithm for solving elastic net SVM (ENSVM). To evaluate the performance of ADMM for ENSVM, we also compare it with the stochastic sub-gradient method and the path algorithm for HHSVM. Our algorithm and the stochastic sub-gradient method were implemented in Matlab, while HHSVM was implemented in R using the R code provided by the authors in (Wang et al., 2008). All algorithms were compiled on a windows platform and time trials were generated on an Intel Core 2 Duo desktop PC (E7500, 2.93GHz). The stopping criteria of Algorithm for ENSVM is specified as follows. Let Φ(β k, β0 k ) = n n i= ( y i (x T i βk + β0 k )) + + λ β k + λ 2 2 β k 2 2. According to Theorem, lim k Φ(β k, β0 k ) = Φ(β, β0). It is reasonable to terminate the algorithm when the relative change of the energy functional Φ(β, β 0 ) falls below certain threshold δ. Furthermore, Algorithm is solving (6), linear constraints are satisfied when it converges. Therefore, we would expect that n Y (Xβ k + β0 k ) a k 2 δ and p β k c k 2 δ when we terminate the algorithm. We used δ = 0 5 in our simulation, i.e., we stop Algorithm whenever and RelE := Φ(βk, β k 0 ) Φ(β, β 0) max{, Φ(β k, β k 0 )} 0 5, n Y (Xβ k + β k 0 ) a k p β k c k Note that the convergence of Algorithm is guaranteed no matter what values of μ and μ 2 are used as shown in Theorem. However, the speed of the algorithm can be influenced by the choices of μ and μ 2 as it would affect the number of iterations involved. In our implementation, we found empirically that choosing μ = 00 n and μ 2 [25, 00] works well for all the problems we tested, though the parameter selecting procedure can certainly be further improved. 4. SIMULATION We consider a binary classification problem in which the sample data are lying in a p dimensional space with only the first 0 dimensions being relevant for classification and the remaining variables being noises. More specifically, we generate n samples with half from + and the other half from class. For the samples from Table : Run times (CPU seconds) of various sizes p and n, different correlation ρ between the features. Methods are ADMM algorithm for elastic-net SVM (ENSVM), path algorithm for HHSVM and stochastic sub-gradient method (SSG). The results for ENSVM and stochastic sub-gradient method are averaged over 25 runs (using 25 different values of λ, λ 2) and the ones for HHSVM are averaged over 5 runs (using 5 different values of λ 2). n, p Method ρ = 0 ρ = 0.8 n=50 ENSVM HHSVM p=300 SSG n=00 ENSVM HHSVM p=500 SSG n=200 ENSVM HHSVM p=000 SSG n=300 ENSVM HHSVM 2.07 hours 2.03 hours p=2000 SSG n=400 ENSVM HHSVM > 6 hours > 6 hours p=5000 SSG n=500 ENSVM HHSVM - - p=0000 SSG class, they are i.i.d drawn from a normal distribution with mean μ + = (,...,, 0,..., 0 } {{ } } {{ } 0 p 0 and covariance ( ) Σ Σ = (p 0), 0 (p 0) 0 I (p 0) (p 0) where the diagonal elements of Σ are and the offdiagonal elements are all equal to ρ. The class has a similar distribution except that ) T μ = (,...,, 0,..., 0) T. } {{ } } {{ } 0 (p 0) 837
7 Gui-Bo Ye, Yifei Chen, Xiaohui Xie So the Bayes optimal classification rule depends on x,..., x 0, which are highly correlated if ρ is large. The Bayes error is independent of the dimension p. This simulated data were also used in Wang et al. (2008). CPU seconds Number of predictors p (a) to achieve 20-fold speedup than the path algorithm. The ADMM algorithm is also significantly faster than the stochastic sub-gradient method, achieving about 5-0 fold speedup in all cases. We should also note that unlike the ADMM method, the objective function in the stochastic sub-gradient method can go up and down, which makes it difficult to design the stopping criteria for the sub-gradient method. To evaluate how the performance of our algorithm scales with the problem size, we plotted the CPU time that Algorithm took to solve (3) for the data described above as a function of p and n. Figure shows such a curve, where the CPU times are averaged over 0 runs with different data. We note that the CPU times are roughly linear in both n and p. We also compared the performance of prediction accuracy and variable selection from three different models: l -norm SVM (L SVM), HHSVM and ENSVM. The optimal (λ, λ 2 ) pair is chosen from a large grid using 0-fold cross validation. As shown in Table 2 and Table 3, HHSVM and ENSVM are similar in prediction and variable selection accuracy, but both are significantly better than l -norm SVM. 5 CPU seconds Number of samples n (b) Table 2: Comparison of test errors. The number of training samples is 50. The total number of input variables is 300, with only 0 being relevant for classification. The results are averages of test errors over 00 repetitions on a 0000 test set, and the numbers in parentheses are the corresponding standard errors. ρ = 0 corresponds to the case where the input variables are independent, while ρ = 0.8 corresponds to a pairwise correlation of 0.8 between relevant variables. ρ = 0 ρ = 0.8 SVM 0.24(0.004) 0.60(0.003) L SVM 0.43(0.007) 0.60(0.002) HHSVM 0.33(0.005) 0.43(0.00) ENSVM 0.(0.002) 0.44(0.00) Figure : CPU times of the ADMM method for ENSVM for the same problem as in Table, for different values of n and p. In each case the times are averaged over 0 runs. (a) n is fixed and equals to 300; (b) p is fixed and equals to Table shows the average CPU times (seconds) used by the ADMM algorithm, the path algorithm for HHSVM, and the stochastic sub-gradient method. Our algorithm consistently outperforms both the stochastic sub-gradient method and the path algorithm in all cases we have tested. For the data with n = 300, p = 2000, the ADMM algorithm is able Table 3: Comparison of variable selection. The setup are the same as those described in Table 2. q signal is the number of selected relevant variables, and q noise is the number of selected noise variables. ρ = 0 ρ = 0.8 q signal q noise q signal q noise L SVM 7.2(0.3) 6.5(.4) 2.5(0.2) 2.9(.2) HHSVM 7.6(0.3) 7.(.3) 7.9(0.4) 3.3(2.5) ENSVM 8.6(0.) 6.4(0.4) 6.6(0.2) 2.0(0.2) 838
8 Efficient variable selection in support vector machines via the alternating direction method of multipliers 4.2 GENE EXPRESSION DATA A microarray gene expression dataset typically contains the expression values of tens of thousands of mr- NAs collected from a relatively small number of samples. The genes sharing the same biological pathways are often highly correlated in gene expression (Segal et al.). Because of these two features, it is more desirable to apply the elastic net SVM to do variable selection and classification on the microarray data than the standard SVM or the l -SVM (Zou and Hastie, 2005; Wang et al., 2008). The data we use is taken from the paper published by Alon et al. (999). It contains microarray gene expression collected from 62 samples (40 colon tumor tissues and 22 from normal tissues). Each sample consists the expression values of p = 2000 genes. We applied the elastic net SVM (ENSVM) to select variables (i.e. genes) that can be used to predict sample labels and compared its performance to the path algorithm developed for HHSVM. The results are summarized in Table 4, which shows the computational times spent by different solvers in a ten-fold cross-validation procedure for different parameters λ and λ 2. The ADMM algorithm for ENSVM is consistently many times faster than the path algorithm for HHSVM, with an approximately ten-fold speedup in almost all cases. Table 4: Run times (CPU seconds) for different values of the regularization parameters λ and λ 2. The methods are ADMM algorithm for ENSVM and path algorithm for HHSVM. λ λ 2 0-CV error ENSVM HHSVM / / / / We also tested the prediction and the variable selection functionality of ENSVM with Algorithm. Following the method in Wang et al. (2008), we randomly split the samples into a training set (27 cancer samples and 5 normal tissues) and a testing set (3 cancer samples and 7 normal tissues). In training phase, we adopt 0-fold cross validation to tune the parameter λ, λ 2. This experiment is repeated 00 times. Table 5 shows the statistics on the testing error and the number of selected genes, in comparison to the statistics of SVM and HHSVM. We note that in terms of testing error, ENSVM is slightly better than HHSVM, which in turn is better than the standard SVM. In terms of variable selection, ENSVM tends to select a smaller number of genes than HHSVM. Table 5: Comparison of testing error and variable selection on the gene expression data. Shown are the averages from 00 repetitions and included in the parenthesis are the standard deviations. Test error Number of genes selected SVM 7.9% (0.69%) All HHSVM 5.45%(0.59%) 38.37(8.67) ENSVM 4.95% (0.53%) 87.7 (7.9) 5 CONCLUSION In this paper, we have derived an efficient algorithm based on the alternating direction method of multipliers to solve the optimization problem in the elastic net SVM (ENSVM). We show that the algorithm is substantially faster than both the sub-gradient method and the path algorithm used in HHSVM, an approximation of the ENSVM problem (Wang et al., 2006). We also illustrate the advantage of ENSVM in both variable selection and prediction accuracy using simulated and real-world data. 6 ACKNOWLEDGEMENT The research is supported by a grant from University of California, Irvine. References U. Alon, N. Barkai, DA Notterman, K. Gish, S. Ybarra, D. Mack, and AJ Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA. D.P. Bertsekas. Constrained optimization and lagrange multiplier methods J.-F. Cai, S. Osher, and Z. Shen. Split bregman methods and frame based image restoration. Multiscale Model. Simul., 8(2): , E.J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Arxiv preprint arxiv: , J. Eckstein and D.P. Bertsekas. On the douglas rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55():293 38, 992. ISSN B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Statist., 32(2): , 839
9 Gui-Bo Ye, Yifei Chen, Xiaohui Xie With discussion, and a rejoinder by the authors. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers and Mathematics with Applications, 2():7 40, 976. R. Glowinski and A. Marroco. Sur l approximation, par éléments finis d ordre un, et la résolution, par penalisation-dualité, d une classe de problèmes de dirichlet non linéaires. Rev. Franc. Automat. Inform. Rech. Operat. T. Goldstein and S. Osher. The split Bregman method for L-regularized problems. SIAM J. Imaging Sci., 2(2): , ISSN I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46(): , T. Hastie, R. Tibshirani, M.B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W.C. Chan, D. Botstein, and P. Brown. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology, (2): 0003, T. Hastie, R. Tibshirani, D. Botstein, and P. Brown. Supervised harvesting of expression trees. Genome Biology, 2(): , Y. Lin and H. H. Zhang. Component selection and smoothing in multivariate nonparametric regression. Ann. Statist., 34(5): , ISSN S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golub, J. Mesirov, and T. Poggio. Support vector machine classification of microarray data. CBCL Paper, 82, 999. R. T. Rockafellar. A dual approach to solving nonlinear programming problems by unconstrained optimization. Math. Programming, 5: , 973. ISSN Y. Saad. Iterative methods for sparse linear systems. Society for Industrial Mathematics, M.R. Segal, K.D. Dahlquist, and B.R. Conklin. Regression approaches for microarray data analysis. J. Comput. Biol. R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(): , 996. V. N. Vapnik. Statistical learning theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. John Wiley & Sons Inc., New York, 998. A Wiley-Interscience Publication. L. Wang, J. Zhu, and H. Zou. The doubly regularized support vector machine. Statistica Sinica, 6(2):589, ISSN L. Wang, J. Zhu, and H. Zou. Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics, 24(3):42, C. Wu and X.C. Tai. Augmented Lagrangian method, dual methods, and split Bregman iteration for ROF, vectorial TV, and high order models. SIAM Journal on Imaging Sciences, 3:300, 200. G.B. Ye and X. Xie. Split Bregman method for large scale fused Lasso. Arxiv preprint arxiv: , 200. J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. -norm support vector machines. In Advances in Neural Information Processing Systems 6: Proceedings of the 2003 Conference, H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol., 67(2):30 320,
Big Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data
Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA
Regularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Lasso on Categorical Data
Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.
Big Data Techniques Applied to Very Short-term Wind Power Forecasting
Big Data Techniques Applied to Very Short-term Wind Power Forecasting Ricardo Bessa Senior Researcher ([email protected]) Center for Power and Energy Systems, INESC TEC, Portugal Joint work with
Penalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Support Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
TOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
Statistical issues in the analysis of microarray data
Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data
High-Dimensional Data Visualization by PCA and LDA
High-Dimensional Data Visualization by PCA and LDA Chaur-Chin Chen Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan Abbie Hsu Institute of Information Systems & Applications,
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
Online Lazy Updates for Portfolio Selection with Transaction Costs
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Online Lazy Updates for Portfolio Selection with Transaction Costs Puja Das, Nicholas Johnson, and Arindam Banerjee Department
Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen
(für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained
Distributed Machine Learning and Big Data
Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Several Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Statistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
10. Proximal point method
L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing
Degrees of Freedom and Model Search
Degrees of Freedom and Model Search Ryan J. Tibshirani Abstract Degrees of freedom is a fundamental concept in statistical modeling, as it provides a quantitative description of the amount of fitting performed
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
A Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
An Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013
Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
BLOCKWISE SPARSE REGRESSION
Statistica Sinica 16(2006), 375-390 BLOCKWISE SPARSE REGRESSION Yuwon Kim, Jinseog Kim and Yongdai Kim Seoul National University Abstract: Yuan an Lin (2004) proposed the grouped LASSO, which achieves
Applications to Data Smoothing and Image Processing I
Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Support Vector Machines
Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1
Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 J. Zhang Institute of Applied Mathematics, Chongqing University of Posts and Telecommunications, Chongqing
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Introduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
Lecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients
Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with
Systems of Linear Equations
Systems of Linear Equations Beifang Chen Systems of linear equations Linear systems A linear equation in variables x, x,, x n is an equation of the form a x + a x + + a n x n = b, where a, a,, a n and
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
Understanding the Impact of Weights Constraints in Portfolio Theory
Understanding the Impact of Weights Constraints in Portfolio Theory Thierry Roncalli Research & Development Lyxor Asset Management, Paris [email protected] January 2010 Abstract In this article,
Large-Scale Similarity and Distance Metric Learning
Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March
Date: April 12, 2001. Contents
2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........
KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA
Rahayu, Kernel Logistic Regression-Linear for Leukemia Classification using High Dimensional Data KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA S.P. Rahayu 1,2
Blockwise Sparse Regression
Blockwise Sparse Regression 1 Blockwise Sparse Regression Yuwon Kim, Jinseog Kim, and Yongdai Kim Seoul National University, Korea Abstract: Yuan an Lin (2004) proposed the grouped LASSO, which achieves
Lasso-based Spam Filtering with Chinese Emails
Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1
Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725
Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T
Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery
: Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical
A Study on the Comparison of Electricity Forecasting Models: Korea and China
Communications for Statistical Applications and Methods 2015, Vol. 22, No. 6, 675 683 DOI: http://dx.doi.org/10.5351/csam.2015.22.6.675 Print ISSN 2287-7843 / Online ISSN 2383-4757 A Study on the Comparison
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Big Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Statistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Treelets A Tool for Dimensionality Reduction and Multi-Scale Analysis of Unstructured Data
Treelets A Tool for Dimensionality Reduction and Multi-Scale Analysis of Unstructured Data Ann B Lee Department of Statistics Carnegie Mellon University Pittsburgh, PA 56 annlee@statcmuedu Abstract In
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
Airport Planning and Design. Excel Solver
Airport Planning and Design Excel Solver Dr. Antonio A. Trani Professor of Civil and Environmental Engineering Virginia Polytechnic Institute and State University Blacksburg, Virginia Spring 2012 1 of
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1
A MapReduce based distributed SVM algorithm for binary classification
A MapReduce based distributed SVM algorithm for binary classification Ferhat Özgür Çatak 1, Mehmet Erdal Balaban 2 1 National Research Institute of Electronics and Cryptology, TUBITAK, Turkey, Tel: 0-262-6481070,
Exploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany [email protected] Visualization
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Support Vector Machines for Classification and Regression
UNIVERSITY OF SOUTHAMPTON Support Vector Machines for Classification and Regression by Steve R. Gunn Technical Report Faculty of Engineering, Science and Mathematics School of Electronics and Computer
How to assess the risk of a large portfolio? How to estimate a large covariance matrix?
Chapter 3 Sparse Portfolio Allocation This chapter touches some practical aspects of portfolio allocation and risk assessment from a large pool of financial assets (e.g. stocks) How to assess the risk
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION
BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
Feature Selection with Monte-Carlo Tree Search
Feature Selection with Monte-Carlo Tree Search Robert Pinsler 20.01.2015 20.01.2015 Fachbereich Informatik DKE: Seminar zu maschinellem Lernen Robert Pinsler 1 Agenda 1 Feature Selection 2 Feature Selection
Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
1 Portfolio mean and variance
Copyright c 2005 by Karl Sigman Portfolio mean and variance Here we study the performance of a one-period investment X 0 > 0 (dollars) shared among several different assets. Our criterion for measuring
Lecture 6: Logistic Regression
Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,
Numerical methods for American options
Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment
2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
Semi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
Sparse LMS via Online Linearized Bregman Iteration
1 Sparse LMS via Online Linearized Bregman Iteration Tao Hu, and Dmitri B. Chklovskii Howard Hughes Medical Institute, Janelia Farm Research Campus {hut, mitya}@janelia.hhmi.org Abstract We propose a version
Logistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
Mathematical Models of Supervised Learning and their Application to Medical Diagnosis
Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Optimal shift scheduling with a global service level constraint
Optimal shift scheduling with a global service level constraint Ger Koole & Erik van der Sluis Vrije Universiteit Division of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam The
