Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data

Transcription

1 Computatonal Statstcs & Data Analyss 51 (26) Multclass sparse logstc regresson for classfcaton of multple cancer types usng gene expresson data Yongda Km a,, Sunghoon Kwon a, Seuck Heun Song b a Seoul Natonal Unversty, Korea b Korea Unversty, Korea Receved 22 March 26; receved n revsed form 23 May 26; accepted 5 June 26 Avalable onlne 3 June 26 Abstract Montorng gene expresson profles s a novel approach to cancer dagnoss. Several studes have showed that the sparse logstc regresson s a useful classfcaton method for gene expresson data. Not only does t gve a sparse soluton wth hgh accuracy, t provdes the user wth explct probabltes of classfcaton apart from the class nformaton. However, ts optmal extenson to more than two classes s not obvous. In ths paper, we propose a multclass extenson of sparse logstc regresson. Analyss of fve publcly avalable gene expresson data sets shows that the proposed method outperforms the standard multnomal logstc model n predcton accuracy as well as gene selectvty. 26 Elsever B.V. All rghts reserved. Keywords: Classfcaton; Gene expresson data; Multnomal logt model; One-aganst-all; Sparse logstc regresson 1. Introducton Constructng a classfcaton rule for tssue samples based on gene expresson profles has receved much attenton recently due to emergng mcroarray technology. A new challenge s that the number of genes (.e. the dmenson of nputs) s much larger than the number of tssue samples, n whch case standard classfcaton methods ether are not applcable or perform badly. Also, dentfyng a small subset of nformatve genes, called marker genes, whch dscrmnate types of tumors or tumor versus normal tssues, has become an mportant subject. Hence, good learnng algorthms wth gene expresson data should provde a classfcaton rule whch not only yelds hgh accuracy but also has the ablty to dentfy marker genes. In related lterature, Guyon et al. (22) proposed a recursve feature elmnaton technque wth support vector machnes, L et al. (22) ntroduced two Bayesan approaches wth the technque of automatc relevance determnaton, and Shevade and Keerth (23) and Roth (22) appled the sparse logstc regresson, to name just a few. Correspondng author. E-mal address: ydkm@stats.snu.ac.kr (Y. Km) /$ - see front matter 26 Elsever B.V. All rghts reserved. do:1.116/j.csda

2 1644 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) Among these tools, sparse logstc regresson s a useful classfcaton method for gene expresson data. It gves a sparse soluton wth hgh accuracy and also t provdes the user wth explct probabltes of classfcaton apart from the class nformaton. However, ts optmal extenson to more than two classes s not obvous. A standard multclass extenson of sparse logstc regresson mght be sparse multnomal logstc (SML) regresson (Krshnapuram et al., 24), whch s a sparse verson of the multnomal logt model a popular multclass formulaton n statstcs (see, for example, Agrest, 199). SML, however, has a problem n gene selecton. Snce the estmates of the regresson coeffcents depend on the choce of the baselne class (see Secton 2 for defnton), and so do the selected genes. Hence, some mportant genes are dropped n the fnal model, whch n turn degrades the predcton accuraces. Emprcal results n Secton 4 confrms ths observaton. In ths paper, we propose a new multclass extenson of sparse logstc regresson called sparse one-aganst-all logstc (SOVAL) regresson, whose man dea s to reduce a multclass problem to multple bnary problems and to construct a classfer usng the reduced multple bnary problems smultaneously. By analyzng fve real data sets of gene expressons, we show that SOVAL outperforms SML n predcton accuracy as well as gene selectvty. The paper s organzed as follows. In Secton 2, SOVAL as well as SML are presented. A computatonal algorthm based on the gradent LASSO algorthm of Km et al. (25) s gven n Secton 3. Results of numercal experments are presented n Secton 4 and concludng remarks follow n Secton Models Let {(x 1,y 1 ),...,(x n,y n )} be nput output pars of a gven data set where x R p s a gene expresson level and y {1, 2,...,J} s a type of cancer of the th tssue sample. Here, n s the number of tssues, p the number of genes and J the number of classes (.e. tumor types). We frst present SML and then propose SOVAL SML regresson SML starts wth the multnomal logt model exp ( f j (x ) ) Pr (y = j x ) = Jm=1 exp (f m (x )) for j = 1,...,J where f j (x ) = β (j) + β (j) 1 x 1 + +β (j) p x p. For dentfablty, ( we let β (J ) k ) = for k ( =, 1,...,p. ) Let β = β (1) 1),...,β(J, β j = β (j) 1,...,β(j) p and β = (β 1,...,β J 1 ). For the sparse model, we estmate β and β by maxmzng the log-lkelhood ( n J ))) J L 1 (β, β) = I (y = j) f j (x ) log exp (f m (x (1) =1 j=1 m=1 wth the constrant J 1 p j=1 k=1 β (j) k λ. Here, λ > s a regularzaton parameter, whch should be selected n advance usng cross valdaton or any other method. Once the regresson coeffcents β and β are estmated, the classfer s constructed as follows. Let c( j)be the cost of classfyng an observaton to the th class when the true class s j. Then, a new tssue sample wth gene expresson x s classfed nto class C(x) where C(x) = arg mn j J c( j)pr(y = j x). =1 If c( j) are all equal, whch s most frequent n practce, C(x) becomes arg max j Pr(y = j x).

3 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) The mportance of the kth gene for classfcaton of tumor types s measured by ρ k where J 1 ρ k = β (j) k j=1. The larger ρ k s, the more mportant the kth gene s for classfyng the tumor type and so genes wth suffcently large ρ k can be consdered as marker genes. Usng ρ k, we can reformulate SML as f j (x ) = θ (j) + ρ 1 θ (j) 1 x 1 + +ρ p θ (j) p x p wth J 1 j=1 θ (j) k = 1, ρ k for k = 1,...,p and p k=1 ρ k λ. Hence, SML can be consdered as a garrot type estmate (Breman, 1995) for ρ k, and so we expect that the soluton of ρ k s sparse. In SML, we set β (J ) k = for k = 1,...,pfor dentfablty of the model, and the regresson coeffcent β (j) k,j = J can be nterpreted as the log odds rato of the jth group versus the Jth group for the kth gene. In ths sense, we call the Jth class the baselne class. Ths conventon has a problem that the estmates depends on the choce of the baselne class. For an example, consder the followng smple stuaton. Let p = 1,J = 3 and λ = 1. Suppose x 1 s bnary (.e. x 1 {, 1}). Let Odd(k, j) be the odds rato of the kth group versus the jth group. That s, Odd(k, j) = n=1 I (y = k, x 1 = 1) n =1 I (y = j,x 1 = ) n=1 I (y = k, x 1 = ) n =1 I (y = j,x 1 = 1). Suppose log Odd(1 3) =.5 and log Odd(2 3) =.5. Then, the estmates of the regresson coeffcents from SML become β (1) 1 =.5 and β(2) 1 =.5 f we choose the thrd class as the baselne class. Now, suppose we change the baselne class to the second class. Then snce log Odd(1 2) = 1. and log Odd(3 2) =.5, n order for the class probabltes to reman the same, the estmates of β (1) 1 and β (3) 1 have to be 1. and.5, respectvely, whch s mpossble snce t volates the constrant (.e. β (1) 1 + β (3) 1 > 1). Hence, there s a danger that some mportant genes may be dropped n the fnal model due to the choce of the baselne class, whch results n poor predcton accuracy. Emprcal results n Secton 4 confrms ths observaton. Instead of choosng the baselne class, there are other ways to resolve the dentfcaton problem. An example s to let J j=1 β (j) k = (2) for all k. Ths constrant, however, makes the computaton harder. A man techncal dffculty of sparse logstc regresson s that computaton s relatvely demandng. Ths s manly because the objectve functon to be optmzed s not dfferentable due to L 1 constrant, and hence specal optmzaton technques are requred. Wthn the authors knowledge, there s no specal optmzaton algorthm for sparse logstc regresson whch can deal wth the constrant (2), n partcular for large number of genes Sparse one-aganst-all logstc regresson For gven y, the standard one-aganst-all (OVA) approach makes J many bnary outputs y (1),...,y (J ) I (y = j), and assumes ( Pr y (j) = 1 x ) = exp ( f j (x ) ) 1 + exp ( f j (x ) ) va y (j) =

4 1646 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) for j = 1,...,J where f j (x) = β (j) + β (j) 1 x 1 + +β (j) p x p. ( ) ( ) Let β = β (1),...,β(J ), β j = β (j) 1,...,β(j) p and β = (β 1,...,β J ). Then, t estmates β and β by estmatng β (j) and β j for j = 1,...,J va maxmzng the log-lkelhood of y (j) gven by n =1 [ y (j) f j (x ) log ( exp ( f j (x ) ) + 1 )] subject to p k=1 β (j) k λ j. There are multple regularzaton parameters λ 1,...,λ J, whch should be selected smultaneously n advance usng cross valdaton or any other method. Note that selectng multple regularzaton parameters s computatonally very hard snce computatonal complexty s exponentally proportonal to the number of regularzaton parameters. To resolve ths problem, SOVAL estmates β and β by maxmzng the followng (pseudo) log-lkelhood L 2 (β, β) = n J =1 j=1 [ y (j) f j (x ) log ( exp ( f j (x ) ) + 1 )] (3) subject to p Jj=1 β (j) k=1 k λ. Note that there s a sngle regularzaton parameter λ. Moreover, SOVAL s equally flexble to the standard OVA approach n the sense that f the optmal model s constructed usng the standard OVA approach wth the regularzaton parameters λ 1,...,λ J, the same model can be constructed usng SOVAL wth the regularzaton parameter λ = J j=1 λ j. Once the regresson coeffcents are estmated, the class probabltes are estmated by Pr(y = j x) = 1 ( ) C(x) Pr y (j) = 1 x, where C(x) = J m=1 Pr ( y (m) = 1 ) x. And the correspondng classfer can be constructed smlarly to the SML case. Also, the gene mportance measure s defned smlarly (that s, ρ k = J β (j) j=1 ). k 3. A computatonal algorthm We frst present a general verson of the gradent LASSO algorthm developed by Km et al. (25), and explan how to modfy t for SOVAL as well as SML. Let z R q and L(z) be a convex functon defned on R q. The objectve of the gradent LASSO s to fnd the mnmzer of L(z) over z D where D s the subset of R q defned by D = { z R q q : k=1 z k 1 }. Let e k be the vector n R q wth the kth component equal 1 and the others. Fg. 1 s the gradent LASSO algorthm for ths problem. The hardest part of the gradent LASSO s the step (a)() and (b)(v) for obtanng ˆα and ˆδ, but t can be done usng standard optmzaton technques such as the Newton Raphson algorthm. That s, the gradent LASSO algorthm does not requre any specal non-lnear optmzaton algorthms. Also, Km et al. (25) proved that the convergence rate of the gradent LASSO s 1/m where m s the number of teratons under some regularty condtons. A surprsng result s that ths convergence rate does not depend on the dmenson of nputs whch s very large for gene expresson data. Ths feature makes the gradent LASSO algorthm well suted for analyzng gene expresson data. In SOVAL as well as SML, the ntercept term β s not constraned, and hence the gradent LASSO algorthm cannot be appled drectly. For ths, we propose to estmate the ntercept term β by lettng β =, and maxmze

5 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) Fg. 1. Gradent LASSO algorthm. the log-lkelhood functons L 1 and L 2 wth respect to β only. For SML, β becomes β (j) = log ȳ(j) ȳ (J ) for j = 1,...,J 1 where ȳ (j) = n =1 I (y = j) /n. Smlarly, for SOVAL, we have β (j) ȳ (j) = log 1 ȳ (j) for j = 1,...,J. The gradent LASSO algorthm can be modfed for two multclass sparse logstc regressons by lettng z = β/λ and replacng L by ether L 1 or L 2. Remark. The gradent LASSO algorthm presented here s a smpler verson of the orgnal gradent LASSO algorthm of Km et al. (25). In fact, usng a more complcated verson of the gradent LASSO algorthm, we can estmate β and β smultaneously. But, the algorthm for ths s much more nvolved, and the results from estmatng β and β sequentally as s done here are not much dfferent from those that result from estmatng β and β smultaneously.

6 1648 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) Numercal experments We compare the two multclass extensons of sparse logstc regressons on fve publcly avalable data sets Data descrpton Leukema: The data set for ths project s the gene expresson data from leukema patents used n Golub et al. (1999). Ths data set comes from a study of gene expressons n two types of acute leukemas, acute lymphoblastc leukema (ALL) and acute myelod leukema (AML). There are two key subclasses of ALL, those arsng from T-cells and those arsng from B-cells. Ths data set s composed of 38 samples classfed as ALL T cell or ALL B cell or AML n the tranng set and an ndependent test set of 34 samples. The tranng set contans 8 ALL T-cell and 19 ALL B-cell samples and 11 AML samples. The ndependent test set consst of 1 ALL T cell and 19 ALL B cell samples and 14 AML samples. Each sample contans 7129 gene expresson values obtaned from Affymetrx olgonucleotde mcroarrays. In ths paper, we combne the tranng and test samples and analyze them together. Ths data set can be downloaded at Lymphoma: Ths data set s avalable at and contans gene expresson levels of the 3 most prevalent adult lymphod malgnances: 42 samples of dffuse large Bcell lymphoma (DLBCL, class ), 9 observatons of follcular lymphoma (FL, class 1), and 11 cases of chronc lymphocytc leukema (CLL, class 2). The total sample sze s n = 62, and the expresson of p = 426 well-measured genes, preferentally expressed n lymphod cells or wth known mmunologcal or oncologcal mportance, are documented. More nformaton on these data can be found n Alzadeh et al. (2). We mputed mssng values and standardzed the data as descrbed n Dudot et al. (22). Small, round blue-cell tumors: Ths data set about the small, round blue cell tumors (SRBCTs) of chldhood ncludes 63 samples classfed as neuroblastoma, rhabdomyosarcoma, non-hodgkn lymphoma and the Ewng famly of tumors. Gene-expresson data from the cdna mcroarray experment contans 6567 genes. For data preprocessng, we followed the protocol detaled n the supplementary nformaton to Khan et al. (21). Ths data set can be downloaded at Bran cancer: Ths data set, presented n Pomeroy et al. (22), contans n = 42 mcroarray gene expresson profles from fve dfferent tumors of the central nervous system, that s, 1 medulloblastomas, 1 malgnant glomas, 1 atypcal teratod/rhabdod tumors (AT/RTs), 8 prmtve neuro-ectodermal tumors (PNETs) and 4 human cerebella. The raw data were orgnated usng the Affymetrx technology and are publcly avalable at For data preprocessng, we followed the protocol descrbed n the supplementary nformaton to Pomeroy et al. (22). After thresholdng, flterng, applyng a logarthmc transformaton and standardzng each expresson profle to zero mean and unt varance, a data set comprsng p = 5597 genes remaned. NCI6: NCI6 s a data set of gene expresson profles of 6 Natonal Cancer Insttute (NCI) cell lnes. These 6 human tumor cell lnes are derved from patents wth leukema, melanoma, lung, colon, central nervous system, ovaran, renal, breast and prostate cancers. The data set s comprsed of gene-expresson levels of p = 7129 genes for n = 6 human tumor cell lnes whch can be dvded nto 8 classes: eght breast, sx CNS, seven colon, sx leukema, eght melanoma, nne non-small-cell lung carcnoma, sx ovaran and eght renal tumors. A more detaled descrpton of the data can be found at Staunton et al. (21). Ths data set can be downloaded at Predcton accuracy We evaluated the predcton accuracy of the two sparse multclass logstc regresson models usng random partton. Ths means that we dvded the data set at random such that 7% of the data set becomes tranng samples and the other 3% test samples. We repeated ths procedure 1 tmes and the averaged msclassfcaton errors were reported. For selectng λ, we used the fve-fold cross valdaton. We used a number of preprocessng steps as was done by Guyon et al. (21) that ncluded: takng the logarthm of all values, normalzng sample vectors, normalzng feature vectors, and passng the results through a squashng functon of the type f(x)= c arctan(x/c) to dmnsh the mportance of outlers. Along wth the predcton errors, we nvestgated the effect of prescreenng of genes to the predcton accuracy. One of the standard approaches for analyzng gene expresson data s to pck out relevant genes usng smple prescreenng

7 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) Table 1 Average test errors Data Method Number of covarates (The number of classes) p = 1 p = 5 p = 1 p = 5 p = 1 Full Leukema SML (3) SOVAL Lymphoma SML (3) SOVAL Small, round blue-cell SML (4) SOVAL Bran SML (5) SOVAL NCI6 SML (8) SOVAL measures to reduce computatonal costs as well as to mprove predcton accuracy (see for example, Golub et al., 1999; Dudot et al., 22). Snce multclass problems are of current concern n ths paper, we used the F-rato of between class sum of squares to wthn class sum of squares for each gene, followng Dudot et al. (22). For gene l, the F-rato s defned as n=1 ( ) Jj=1 BSS(l) I (y = j) x (j) 2 WSS(l) = l x l n=1 ) Jj=1 I (y = j) (x l x (j) 2, l where x (j) l ndcates the average expresson level of gene l for class j samples, and x l s the overall mean expresson level of gene l n the tranng set. We use the F-rato for ts smplcty, and there are dfferent types of the F-rato. Table 1 and Fg. 2 reports the test errors wth dfferent gene subset szes obtaned by the prescreenng wth the F-rato, whch shows that SOVAL s more accurate n most cases than SML. In some cases, the mprovements are larger than 5%. Second, we can see from Tables 1 and 2 that the prescreenng affects the accuracy sgnfcantly. The optmum test errors are acheved around p = 1 or p = 5 (except for the data set small, round blue-cell where the optmum error s acheved when p = 1). From ths fndng, we may conclude that the purpose of prescreenng s not to select relevant genes but to elmnate rrelevant genes. Ths result somehow contrasts wth that of Dudot et al. (22) where fndng small numbers of relevant genes by prescreenng affects predcton accuraces sgnfcantly n some cases. A reason for ths dfference would be that we use sparse methods whle Dudot et al. (22) do not. For non-sparse methods, the classfer depends on all genes used as nputs and so prescreenng would be mportant. However, sparse methods automatcally select genes whle they construct a classfer, and so prescreenng s not necessary. Moreover, the prescreenng may drop some nformatve genes n an early stage, and the resultng model would be suboptmal. In ths vew, for sparse methods, effcent computatonal algorthms for dealng wth large dmensonal nputs wthout prescreenng are necessary, and our algorthm s such an algorthm Performance of gene selecton Table 2 presents the average number of genes selected from the two sparse methods. It shows that SML tends to yeld more sparse models than SOVAL, n partcular when the number of classes s large. Along wth the error rates n Table 1, we can conclude that SML fals to detect some mportant genes, whch results n hgher error rates. To confrm our concluson, we dd the followng experment. The effectveness of gene dentfcaton was tested on mnature data sets syntheszed from the orgnal data. The mnature data sets of 1 genes were constructed as follows. Frst, usng the F-rato as a measure of margnal assocaton between each gene and the tumor type, we ranked the genes and selected the top 2 genes as varables truly assocated wth the class. As rrelevant varables, we ncluded the bottom 8 genes wth the class label correspondng to each covarate vector of 8 genes randomly mxed together, so that they

8 165 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) Leukema Number of Covarates 8 4 Lymphoma Number of Covarates 6 4 Small, round blue-cell Number of Covarates Bran Number of Covarates NCI Number of Covarates Fg. 2. Average test errors. were genunely unrelated to the class, but the potental correlatons between those genes were ntact. Ten replcates of synthetc tranng data were obtaned by the 1-fold cross valdaton from these mnature data sets, keepng the class proportons n each sample the same as these n the orgnal data. See Ln (25) and Jung and Jang (26) for smlar experments. We appled the two sparse multclass logstc regresson models to these 1 replcates, and the optmal regularzaton parameters were selected wth the 1 test data sets constructed from the 1 fold cross valdaton. Fg. 3 s the boxplot of the number of selected genes and the number of the selected genes among the 2 nformatve genes from the 1 replcates of the mnature data sets by the 1-fold cross valdaton of the orgnal data sets. It shows that SOVAL ncludes more nformatve genes than SML when the number of classes s large.

9 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) Table 2 The averaged numbers of genes selected Data Method Number of covarates (The number of classes) p = 1 p = 5 p = 1 p = 5 p = 1 Full Leukema SML (3) SOVAL Lymphoma SML (3) SOVAL Small, round blue-cell SML (4) SOVAL Bran SML (5) SOVAL NCI6 SML (8) SOVAL Leukema Lymphoma Small, round blue-cell Bran (a) NCI (b) Fg. 3. The boxplots of: (a) the total number of genes selected and (b) the number of genes selected among the top 2 nformatve genes.

10 1652 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) Leukema mportance of GENE GENE rank by F rato Lymphoma mportance of GENE GENE rank by F rato Small, round blue-cell mportance of GENE GENE rank by F rato Bran mportance of GENE GENE rank by F rato NCI mportance of GENE GENE rank by F rato Fg. 4. The plots of the mportance versus gene rank by F-rato.

11 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) class class class class class class 6 class 7 The gene wth the hghest F-rato class class class class class class class 6 class 7 The gene wth the hghest mportance class 8 Fg. 5. The boxplots of the expresson levels of the two genes havng the hghest F-rato and hghest mportance accordng to the class labels n the NCI data set. Fnally, we compared genes selected from SOVAL and genes selected from the margnal F-rato. Fg. 4 shows the plots where the x-axs dsplays the gene ranks obtaned by the margnal F-rato and the y-axs s gene mportance measured by the SOVAL. The results are strkng, n partcular when the number of classes s large (.e. n the data sets Bran and NCI). There are many genes havng smultaneously lower ranks of the margnal F-rato but havng larger mportance. To understand why ths happens, we select the two genes from the NCI data sets, one whch have the largest F-rato and the other whch has the largest mportance. The rank of the F-rato of the gene wth the hghest mportance s 132, and the mportance of the gene wth the hghest F-rato s. That s, these two genes have sgnfcantly dfferent F-rato and gene mportance values. Fg. 5 presents the boxplot of the gene expresson levels of these two genes accordng to the class labels. Frst of all, the dstrbutons of the expresson levels of the two genes are smlar. They have large postve expresson levels at the seventh class and negatve expresson levels for the other classes. An excepton s the thrd class, where the gene wth the hghest F-rato has expresson levels around whle the gene wth the hghest mportance has negatve expresson levels. Ths dfference partally explans why the ranks from the F-rato and from gene mportance are qute dfferent. The F-rato measures the varaton of the mean expresson levels of the classes, and so the gene wth the hghest F-rato has addtonal varaton due to the thrd class compared to the gene wth the hghest mportance. In contrast, SOVAL bascally measures the dfference of the means from one class to the other classes. For the seventh class, ths dfference s larger for the gene wth the hghest mportance than for the gene wth

12 1654 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) the hghest F-rato. So, we conclude that f we want to detect genes whch affect all the classes, the F-rato would be more approprate. However, f we want to detect genes whch affect a certan class, sparse logstc regresson would be more appealng. 5. Concludng remarks In ths paper, we proposed a multclass extenson of sparse logstc regresson, so called SOVAL, compared t wth SML, and developed the effcent computatonal algorthm sutable for gene expresson data. The numercal experments showed that SOVAL outperforms SML n many aspects. The former: () gves better accuraces n partcular; () has hgher power of detectng mportant genes and () does not requre the choce of a baselne class. The man dea of SOVAL s somehow related to the Scott s method of estmatng a mxture model (Scott, 21, 24). The Scott s method relaxed a constrant of the densty functon and focused on a partcular component rather than all components. SOVAL also relaxed a constrant that the sum of the probabltes of the classes s 1 and mplctly found genes mportant for a specfc class rather than all classes. Ths smlarty would partally explan the good predcton performance of SOVAL. We leave ths conjecture as a future work. We have seen that the selected genes by SOVAL are much dfferent from those selected by the margnal F-rato. Ths s partly because SOVAL measures the classfcaton power of genes for a specfc class whle the margnal F-rato measures the overall effect of genes on all classes. Hence, f one wants to detect genes whch affect a specfc class, SOVAL s more sutable. In ths vew, SOVAL can be consdered as a new way of detectng relevant genes and can be used as a preprocessng procedure for more complcated non-lnear classfcaton methods such as the support vector machne or boostng. For ths purpose, however, effcent computatonal algorthms are requred snce we should work wth large numbers of genes wthout prescreenng, and the algorthm proposed n ths paper can serve for ths purpose. Acknowledgments The frst author and second author were supported n part by KOSEF through the Statstcal Research Center for Complex Systems at Seoul Natonal Unversty. The thrd author was supported n part by KOSEF (R ). References Agrest, A., 199. Categorcal Data Analyss. Wley, New York. Alzadeh, A., Esen, M., Davs, R., Ma, C., Lossos, I., Rosenwald, A., Boldrck, J., Sabet, H., Tran, T., Yu, X., et al., 2. Dstnct types of dffuse large B-cell lymphoma dentfed by gene expresson proflng. Nature 43, Breman, L., Better subset regresson usng the nonnegatve garrote. Technometrcs 37 (4), Dudot, S., Frdlyand, J., Speed, T., 22. Comparson of dscrmnaton methods for the classfcaton of tumors usng gene expresson data. J. Amer. Statst. Assoc. 97, Golub, T., Slonm, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesrov, J., Coller, H., Loh, M., Downng, J., Calgur, M., Bloomfeld, C., Lander, E., Molecular classfcaton of cancer: class dscovery and class predcton by gene expresson montorng. Scence 286, Guyon, I., Weston, J., Barnhll, S., Vapnk, V., 22. Gene selecton for cancer classfcaton usng support vector machnes. Mach. Learn. 46, Jung, S.H., Jang, W., 26. How accurately can we control the FDR n analyzng mcroarray data? Bonformatcs, to appear. oxfordjournals.org/cg/reprnt/btl161? Khan, J., We, J., Rngner, M., Saal, L., Ladany, M., Westermann, F., Berthold, F., Schwab, M., Atonescu, C., Peterson, C., Meltzer, P., 21. Classfcaton and dagnostc predcton of cancers usng gene expresson proflng and artfcal neural networks. Nature Med. 7, Km, J., Km, Y., Km, Y., 25. A gradent descent algorthm for generalzed LASSO. Techncal Report, Department of Statstcs, Seoul Natonal Unversty, Korea. Krshnapuram, B., Carln, L., Fgueredo, M., Hartemnk, A., 24. Learnng sparse classfer: mult-class formulaton, fast algorthms and generalzaton bounds. IEEE Trans. Pattern Anal. Mach. Intell. 27, L, Y., Campbell, C., Tppng, M., 22. Bayesan automatc relevance determnaton algorthms for classfyng gene expresson data. Bonformatcs 18, Ln, D.Y., 25. An effcent Monte Carlo approach to assessng statstcal sgnfcance n genomc studes. Bonformatcs 43, Pomeroy, S., Tamayo, P., Gaasenbeek, M., Sturla, L., Angelo, M., McLaughln, M., Km, J., Goumnerova, L., Black, P., Lau, C., et al., 22. Predcton of central nervous system embryonal tumor outcome based on gene expresson. Nature 415, Roth, V., 22. The generalzed LASSO: a wrapper approach to gene selecton for mcroarray data. Techncal Report, Unversty of Bonn, Computer Scence III.

13 Y. Km et al. / Computatonal Statstcs & Data Analyss 51 (26) Scott, D.W., 21. Parametrc statstcal modelng by mnmum ntegrated square error. Technometrcs 43, Scott, D.W., 24. Partal mxture estmaton and outler detecton n data and regresson. In: Theory and Applcatons of Recent Robust Methods.Brkhäuser, Basel, pp Shevade, K., Keerth, S., 23. A smple and effcent algorthm for gene selecton usng sparse logstc regresson. Bonformatcs 19, Staunton, J., Slonm, D., Coller, H., Tamayo, P., Angelo, M., Park, J., Scherf, U., Lee, J., Renhold, W., Wensten, J., Mesrov, J., Lander, E., Golub, T., 21. Chemosenstvty predcton by transcrptonal proflng. Proc. Nat. Acad. Sc. 98 (19),