Modelling high-dimensional data by mixtures of factor analyzers

Transcription

1 Computatonal Statstcs & Data Analyss 41 (2003) Modellng hgh-dmensonal data by mxtures of factor analyzers G.J. McLachlan, D. Peel, R.W. Bean Department of Mathematcs, Unversty of Queensland, St. Luca, Brsbane 4072, Australa Receved 1 March 2002 Abstract We focus on mxtures of factor analyzers from the perspectve of a method for model-based densty estmaton from hgh-dmensonal data, and hence for the clusterng of such data. Ths approach enables a normal mxture model to be tted to a sample of n data ponts of dmenson p, where p s large relatve to n. The number of free parameters s controlled through the dmenson of the latent factor space. By workng n ths reduced space, t allows a model for each component-covarance matrx wth complexty lyng between that of the sotropc and full covarance structure models. We shall llustrate the use of mxtures of factor analyzers n a practcal example that consders the clusterng of cell lnes on the bass of gene expressons from mcroarray experments. c 2002 Elsever Scence B.V. All rghts reserved. Keywords: Mxture modellng; Factor analyzers; EM algorthm 1. Introducton Fnte mxtures of dstrbutons have provded a mathematcal-based approach to the statstcal modellng of a wde varety of random phenomena; see, for example, McLachlan and Peel (2000a). For multvarate data of a contnuous nature, attenton has focussed on the use of multvarate normal components because of ther computatonal convenence. Wth the normal mxture model-based approach to densty estmaton and clusterng, the densty of the (p-dmensonal) random varable Y of nterest s modelled as a mxture of a number (g) of multvarate normal denstes n some Correspondng author. Tel.: ; fax: E-mal address: gjm@maths.uq.edu.au (G.J. McLachlan) /03/$ - see front matter c 2002 Elsever Scence B.V. All rghts reserved. PII: S (02)

2 380 G.J. McLachlan et al. / Computatonal Statstcs & Data Analyss 41 (2003) unknown proportons 1 ;:::; g. That s, each data pont s taken to be a realzaton of the mxture probablty densty functon (p.d.f.), g f(y; )= (y; ; ); (1) =1 where (y; ; ) denotes the p-varate normal densty functon wth mean and covarance matrx. Here the vector of unknown parameters conssts of the mxng proportons, the elements of the component means, and the dstnct elements of the component-covarance matrx. The normal mxture model (1) can be tted teratvely to an observed random sample y 1 ;:::;y n by maxmum lkelhood (ML) va the expectaton-maxmzaton (EM) algorthm of Dempster et al. (1977); see also McLachlan and Krshnan (1997). The number of components g can be taken sucently large to provde an arbtrarly accurate estmate of the underlyng densty functon; see, for example, L and Barron (2000). For clusterng purposes, a probablstc clusterng of the data nto g clusters can be obtaned n terms of the tted posteror probabltes of component membershp for the data. An outrght assgnment of the data nto g clusters s acheved by assgnng each data pont to the component to whch t has the hghest estmated posteror probablty of belongng. The g-component normal mxture model (1) wth unrestrcted component-covarance matrces s a hghly parameterzed model wth 1 2p(p + 1) parameters for each component-covarance matrx ( =1;:::;g). Baneld and Raftery (1993) ntroduced a parameterzaton of the component-covarance matrx based on a varant of the standard spectral decomposton of ( =1;:::;g). A common approach to reducng the number of dmensons s to perform a prncpal component analyss (PCA). But as s well-known, projectons of the feature data y j onto the rst few prncpal axes are not always useful n portrayng the group structure; see McLachlan and Peel (2000a, p. 239). Ths pont was also stressed by Chang (1983), who showed n the case of two groups that the prncpal component of the feature vector that provdes the best separaton between groups n terms of Mahalanobs dstance s not necessarly the rst component. Another approach for reducng the number of unknown parameters n the forms for the component-covarance matrces s to adopt the mxture of factor analyzers model, as consdered n McLachlan and Peel (2000a, 2000b). Ths model was orgnally proposed by Ghahraman and Hnton (1997) and Hnton et al. (1997) for the purposes of vsualzng hgh dmensonal data n a lower dmensonal space to explore for group structure; see also Tppng and Bshop (1997, 1999) and Bshop (1998) who consdered the related model of mxtures of prncpal component analyzers for the same purpose. Further references may be found n McLachlan and Peel (2000a, Chapter 8). In ths paper, we nvestgate further the modellng of hgh-dmensonal data through the use of mxtures of factor analyzers, focussng on computatonal ssues not addressed n McLachlan and Peel (2000a, Chapter 8). We shall also demonstrate the usefulness of the methodology n ts applcaton to the clusterng of mcroarray expresson data, whch s a very mportant but nonstandard problem n cluster analyss. Intal attempts on ths problem used herarchcal clusterng, but there s no reason why the clusters

3 G.J. McLachlan et al. / Computatonal Statstcs & Data Analyss 41 (2003) should be herarchcal for ths problem. Also, a mxture model-based approach enables the clusterng of mcroarray data to be approached on a sound mathematcal bass. Indeed, as remarked by Atkn et al. (1981), when clusterng samples from a populaton, no cluster analyss method s a pror belevable wthout a statstcal model. For mcroarray data, the number of tssues n s usually very small relatve to the number of genes (the dmenson p), and so the use of factor models to represent the component-covarance matrces allows the mxture model to be tted by workng n the lower dmensonal space mpled by the factors. 2. Sngle-factor analyss model Factor analyss s commonly used for explanng data, n partcular, correlatons between varables n multvarate observatons. It can be used also for dmensonalty reducton. In a typcal factor analyss model, each observaton Y j s modelled as Y j = + BU j + e j (j =1;:::;n); (2) where U j s a q-dmensonal (q p) vector of latent or unobservable varables called factors and B s a p q matrx of factor loadngs (parameters). The U j are assumed to be..d. as N (0; I q ), ndependently of the errors e j, whch are assumed to be..d. as N(0; D), where D s a dagonal matrx, D = dag(1;:::; 2 p) 2 and where I q denotes the q q dentty matrx. Thus, condtonal on U j =u j, the Y j are ndependently dstrbuted as N ( +Bu j ; D). Uncondtonally, the Y j are..d. accordng to a normal dstrbuton wth mean and covarance matrx = BB T + D: (3) If q s chosen sucently smaller than p, representaton (3) mposes some constrants on the component-covarance matrx and thus reduces the number of free parameters to be estmated. Note that n the case of q 1, there s an nnty of choces for B, snce (3) s stll satsed f B s replaced by BC, where C s any orthogonal matrx of order q. One (arbtrary) way of unquely specfyng B s to choose the orthogonal matrx C so that B T D 1 B s dagonal (wth ts dagonal elements arranged n decreasng order); see Lawley and Maxwell (1971, Chapter 1). Assumng that the egenvalues of BB T are postve and dstnct, the condton that B T D 1 B s dagonal as above mposes 1 2q(q 1) constrants on the parameters. Hence then the number of free parameters s pq + p 1 2q(q 1). The factor analyss model (2) can be tted by the EM algorthm and ts varants as to be dscussed n the subsequent secton for the more general case of mxtures of such models. Note that wth the factor analyss model, we avod havng to compute the nverses of terates of the estmated p p covarance matrx that may be sngular for large p relatve to n. Ths s because the nverson of the current value of the p p matrx (BB T + D) on each teraton can be undertaken usng the result that (BB T + D) 1 = D 1 D 1 B(I q + B T D 1 B) 1 B T D 1 ; (4)

4 382 G.J. McLachlan et al. / Computatonal Statstcs & Data Analyss 41 (2003) where the rght-hand sde of (4) nvolves only the nverses of q q matrces, snce D s a dagonal matrx. The determnant of (BB T + D) can then be calculated as BB T + D = D = I q B T (BB T + D) 1 B : Unlke the PCA model, the factor analyss model (2) enjoys a powerful nvarance property: changes n the scales of the feature varables n y j, appear only as scale changes n the approprate rows of the matrx B of factor loadngs. 3. Mxtures of factor analyzers A global nonlnear approach can be obtaned by postulatng a nte mxture of lnear submodels for the dstrbuton of the full observaton vector Y j gven the (unobservable) factors u j. That s, we can provde a local dmensonalty reducton method by assumng that the dstrbuton of the observaton Y j can be modelled as Y j = + B U j + e j wth prob: ( =1;:::;g) (5) for j =1;:::;n, where the factors U 1 ;:::;U n are dstrbuted ndependently N(0; I q ), ndependently of the e j, whch are dstrbuted ndependently N(0; D ), where D s a dagonal matrx ( = 1;:::;g). Thus the mxture of factor analyzers model s gven by (1), where the th componentcovarance matrx has the form = B B T + D ( =1;:::;g); (6) where B s a p q matrx of factor loadngs and D s a dagonal matrx ( =1;:::;g). The parameter vector now conssts of the elements of the, the B, and the D, along wth the mxng proportons ( =1;:::;g 1), on puttng g =1 g 1 =1. 4. Maxmum lkelhood estmaton of mxture of factor analyzers models The mxture of factor analyzers model can be tted by usng the alternatng expectaton condtonal maxmzaton (AECM) algorthm (Meng and van Dyk, 1997). The expectaton condtonal maxmzaton (ECM) algorthm proposed by Meng and Rubn (1993) replaces the M-step of the EM algorthm by a number of computatonally smpler condtonal maxmzaton (CM) steps. The AECM algorthm s an extenson of the ECM algorthm, where the speccaton of the complete data s allowed to be derent on each CM-step. To apply the AECM algorthm to the ttng of the mxture of factor analyzers model, we partton the vector of unknown parameters T T as ( 1 ; 2 ) T, where 1 contans the mxng proportons ( =1;:::;g 1) and the elements of the component means ( =1;:::;g). The subvector 2 contans the elements of the B and the D ( =1;:::;g). We let (k) (k) =( T (k) 1 ; T 2 ) T be the value of after the kth teraton of the AECM algorthm. For ths applcaton of the AECM algorthm, one teraton conssts of two

5 G.J. McLachlan et al. / Computatonal Statstcs & Data Analyss 41 (2003) cycles, and there s one E-step and one CM-step for each cycle. The two CM-steps correspond to the partton of nto the two subvectors 1 and 2. For the rst cycle of the AECM algorthm, we specfy the mssng data to be just the component-ndcator vectors, z 1 ;:::;z n, where z j =(z j ) s one or zero, accordng to whether y j arose or dd not arse from the th component ( =1;:::;g; j =1;:::;n). The rst condtonal CM-step leads to (k) n (k+1) = j=1 (y j ; and (k) beng updated to (k) )=n (7) and (k+1) = n (y j ; j=1 (k) )y j / n j=1 (y j ; (k) ) (8) for =1;:::;g, where / g (y j ; )= (y j ; ; ) h (y j ; h ; h ) (9) h=1 s the th component posteror probablty of y j. For the second cycle for the updatng of 2, we specfy the mssng data to be the factors u 1 ;:::;u n, as well as the component-ndcator vectors, z 1 ;:::;z n. On settng (k+1=2) (k+1) equal to ( T (k) 1 ; T 2 ) T, an E-step s performed to calculate Q( ; (k+1=2) ), whch s the condtonal expectaton of the complete-data log lkelhood gven the observed data, usng = (k+1=2). The CM-step on ths second cycle s mplemented by the maxmzaton of Q( ; (k+1=2) (k+1) ) over wth 1 set equal to 1. Ths yelds the updated estmates B (k+1) and D (k+1). The former s gven by where B (k+1) = V (k+1=2) (k) ( (k)t n V (k+1=2) j=1 = (y j ; V (k+1=2) (k) +! (k) ) 1 ; (10) (k+1=2) )(y j (k+1) )(y j (k+1) n j=1 (y j ; (k+1=2) ) ) T ; (11) and (k)! (k) =(B (k) B (k)t + D (k) ) 1 B (k) (12) = I q (k)t B (k) (13) for =1;:::;g. The updated estmate D (k+1) D (k+1) = dag {V (k+1=2) = dag {V (k+1=2) B (k+1) V (k+1=2) s gven by H (k+1=2) B (k+1)t } (k) B (k+1)t }; (14)

6 384 G.J. McLachlan et al. / Computatonal Statstcs & Data Analyss 41 (2003) where n H (k+1=2) j=1 = (y j ; n j=1 (y j ; = (k)t (k+1=2) )E (k+1=2) (U j Uj T y j ) ; (k+1=2) ) V (k+1=2) (k) +! (k) (15) and E (k+1=2) denotes condtonal expectaton gven membershp of the th component, usng (k+1=2) for. Drect derentaton of the log-lkelhood functon shows that the ML estmate of the dagonal matrx D satses ˆD = dag( ˆV ˆB ˆB T ); (16) where ˆV = / n n (y j ; ˆ )(y j ˆ )(y j ˆ ) T (y j ; ˆ ): (17) j=1 j=1 As remarked by Lawley and Maxwell (1971, p. 30) n the context of drect computaton of the ML estmate for a sngle-component factor analyss model, Eq. (16) looks temptngly smple to use to solve for ˆD, but was not recommended due to convergence problems. On comparng (16) wth (14), t can be seen that wth the calculaton of the ML estmate of D drectly from the (ncomplete-data) log-lkelhood functon, the uncondtonal expectaton of U j Uj T, whch s the dentty matrx, s used n place of the condtonal expectaton n (15) on the E-step of the AECM algorthm. Unlke the drect approach of calculatng the ML estmate, the EM algorthm and ts varants such as the AECM verson have good convergence propertes n that they ensure the lkelhood s not decreased after each teraton regardless of the choce of startng pont. It can be seen from (16) that some of the estmates of the elements of the dagonal matrx D (the unquenesses) wll be close to zero f eectvely not more than q observatons are unequvocally assgned to the th component of the mxture n terms of the tted posteror probabltes of component membershp. Ths wll lead to spkes or near sngulartes n the lkelhood. One way to avod ths s to mpose the condton of a common value D for the D, D = D ( =1;:::;g): (18) An alternatve way of proceedng s to adopt some pror dstrbuton for the D as n the Bayesan approaches of Fokoue and Ttterngton (2000), Ghahraman and Beal (2000) and Utsug and Kumaga (2001). The mxture of probablstc component analyzers (PCAs) model, as proposed by Tppng and Bshop (1997), has form (6) wth each D now havng the sotropc structure D = 2 I p ( =1;:::;g): (19)

7 G.J. McLachlan et al. / Computatonal Statstcs & Data Analyss 41 (2003) Under ths sotropc restrcton (19) the teratve updatng of B and D s not necessary snce, gven the component membershp of the mxture of PCAs, B (k+1) and (k+1)2 are gven explctly by an egenvalue decomposton of the current value of V. 5. Intalzaton of AECM algorthm We can make use of the lnk of factor analyss wth the probablstc PCA model (19) to specfy an ntal value (0) for n the ML ttng of the mxture of factor analyzers va the AECM algorthm. On notng that the transformed data D 1=2 Y j satses the probabltstc PCA model (19) wth 2 = 1, t follows that for a gven D (0) and (0), we can specfy B (0) as B (0) = D (0)1=2 A ( 2 I q ) 1=2 ( =1;:::;g); (20) where p 2 = h =(p q): h=q+1 The q columns of the matrx A are the egenvectors correspondng to the egenvalues 1 2 q of D (0) 1=2 (0) D (0) 1=2 (21) and =dag( 1 ;:::; q ). The use of 2 nstead of unty s proposed n (20), because t avods the possblty of negatve values for ( I q ), whch can occur snce estmates are beng used for the unknown values of D and n (21). To specfy (0) for use n (21), we can randomly assgn the data nto g groups and take (0) to be the sample covarance matrx of the th group ( =1;:::;g). Concernng the choce of D (0), we can take D (0) to be the dagonal matrx formed from the dagonal elements of (0) (=1;:::;g). In ths case, the matrx (21) has the form of a correlaton matrx. The egenvalues and egenvectors for use n (21) can be found by a sngular value decomposton of each p p sample component-covarance matrx (0). But f the number of dmensons p s apprecably greater than the sample sze n, then t s much qucker to nd them by a sngular value decomposton of the n n matrx (0), the sample matrx formed by takng the observatons to be the rows rather than the columns of the p n data matrx whose n columns are the p-dmensonal observatons assgned ntally to the th component ( =1;:::;g). The egenvalues of ths latter matrx are equal to those of (0) apart from a common multpler due to the derent dvsors n ther formaton. A formal test for the number of factors can be undertaken usng the lkelhood rato, as regularty condtons hold for ths test conducted at a gven value for the number of components g. For the null hypothess that H 0 : q = q 0 versus the alternatve

8 386 G.J. McLachlan et al. / Computatonal Statstcs & Data Analyss 41 (2003) H 1 : q = q 0 + 1, the statstc 2 log s asymptotcally ch-squared wth d = g(p q 0 ) degrees of freedom. However, n stuatons where n s not large relatve to the number of unknown parameters, we prefer the use of the BIC crteron of Schwarz (1978). Appled n ths context, t means that twce the ncrease n the log-lkelhood ( 2 log ) has to be greater than d log n for the null hypothess to be rejected. 6. Example: colon data In ths example, we consder the clusterng of tssue samples on the bass of two thousand genes for the colon data of Alon et al. (1999). They used Aymetrx olgonucleotde arrays to montor absolute measurements on expressons of over 6500 human gene expressons n 40 tumour and 22 normal colon tssue samples, These samples were taken from 40 derent patents so that 22 patents suppled both a tumour and normal tssue sample. Alon et al. (1999) focussed on the 2000 genes wth hghest mnmal ntensty across the samples, and t s these 2000 genes that comprsed our data set. The matrx A of mcroarray data for ths data set thus has p = 2000 rows and n = 62 columns. Before we consdered the clusterng of ths set, we processed the data by takng the (natural) logarthm of each expresson level n the matrx A. Then each column of ths matrx was standardzed to have mean zero and unt standard devaton. Fnally, each row of the consequent matrx was standardzed to have mean zero and unt standard devaton. We are unable to proceed drectly wth the ttng of a normal mxture model to these data n ths form. But even f we were able to do so, t s not perhaps the deal way of proceedng because wth such a large number p of feature varables, there wll be a lot of nose ntroduced nto the problem and ths nose s unable to be modelled adequately because of the very small number (n = 62) of observatons avalable relatve to the dmenson p = 2000 of each observaton. We therefore appled the screenng procedure n the software EMMIX-GENE of McLachlan et al. (2001). Wth ths screenng procedure, the genes are ranked n decreasng sze of 2 log, where s essentally the lkelhood rato statstc for the test of g = 1 versus g =2 component t dstrbutons tted to the 62 tssues wth each gene consdered ndvdually. If the value of 2 log were greater than some threshold (here taken to be 8) but the mnmum sze of the mpled clusters was less than some threshold (here taken to be 8 also), ths value of was replaced by ts value for the test of g = 2 versus 3 components. Ths screenng of the genes here resulted n 446 genes beng retaned. We rst clustered the n = 62 tssues on the bass of the retaned set of 446 genes. We tted mxtures of factor analyzers for varous levels of the number q of factors rangng from q = 2 to 8. Usng 50 random and 50 k-means-based starts, the clusterng correspondng to the largest of the local maxma obtaned gave the followng clusterng for q = 6 factors, C 1 = {1 12; 20; 25; 41 52} {13 39; 21 24; 26 40; 53 62}: (22) Getz et al. (2000) and Getz (2001) reported that there was a change n the protocol durng the conduct of the mcroarray experments. The 11 tumour tssue samples

9 G.J. McLachlan et al. / Computatonal Statstcs & Data Analyss 41 (2003) (labelled 1 11 here) and 11 normal tssue samples (41 51) were taken from the rst 11 patents usng a poly detector, whle the 29 tumour tssue samples (12 40) and normal tssue samples (52 62) were taken from the remanng 29 patents usng total extracton of RNA. It can be seen from (22) that ths clusterng C 1 almost corresponds to the dchotomy between tssues obtaned under the old and new protocols. A more detaled account of mxture model-based clusterng of ths colon data set may be found n McLachlan et al. (2001). References Atkn, M., Anderson, D., Hnde, J., Statstcal modellng of data on teachng styles (wth dscusson) J. Roy. Statst. Soc. Ser. B 144, Alon, U., Barka, N., Notterman, D.A., Gsh, K., Ybarra, S., Mack, D., Levne, A.J., Broad patterns of gene expresson revealed by clusterng analyss of tumor and normal colon tssues probed by olgonucleotde arrays. Proc. Nat. Acad. Sc. 96, Baneld, J.D., Raftery, A.E., Model-based Gaussan and non-gaussan clusterng. Bometrcs 49, Bshop, C.M., Latent varable models. In: Jordan, M.I. (Ed.), Learnng n Graphcal Models. Kluwer, Dordrecht, pp Chang, W.C., On usng prncpal components before separatng a mxture of two multvarate normal dstrbutons. Appl. Statst. 32, Dempster, A.P., Lard, N.M., Rubn, D.B., Maxmum lkelhood from ncomplete data va the EM algorthm (wth dscusson) J. Roy. Statst. Soc. Ser. B 39, Fokoue, E., Ttterngton, D.M., Bayesan samplng for mxtures of factor analysers. Techncal Report, Department of Statstcs, Unversty of Glasgow, Glasgow. Getz, G., Prvate communcaton. Getz, G., Levne, E., Domany, E., Coupled two-way clusterng analyss of gene mcroarray data. Cell Bol. 97, Ghahraman, Z., Beal, M.J., Varatonal nference for Bayesan mxtures of factor analyzers. In: Solla, S.A., Leen, T.K., Mller, K.-R. (Eds.), Neural Informaton Processng Systems 12. MIT Press, MA, pp Ghahraman, Z., Hnton, G.E., The EM algorthm for factor analyzers. Techncal Report No. CRG-TR-96-1, The Unversty of Toronto, Toronto. Hnton, G.E., Dayan, P., Revow, M., Modelng the manfolds of mages of handwrtten dgts. IEEE Trans. Neural Networks 8, Lawley, D.N., Maxwell, A.E., Factor Analyss as a Statstcal Method, 2nd Edton. Butterworths, London. L, J.Q., Barron, A.R., Mxture densty estmaton. Techncal Report, Department of Statstcs, Yale Unversty, New Haven, Connectcut. McLachlan, G.J., Krshnan, T., The EM Algorthm and Extensons. Wley, New York. McLachlan, G.J., Peel, D., 2000a. Fnte Mxture Models. Wley, New York. McLachlan, G.J., Peel, D., 2000b. Mxtures of factor analyzers. In: Langley, P. (Ed.), Proceedngs of the Seventeenth Internatonal Conference on Machne Learnng. Morgan Kaufmann, San Francsco, pp McLachlan, G.J., Bean, R.W., Peel, D., EMMIX-GENE: a mxture model-based program for the clusterng of mcroarray expresson data. Techncal Report, Centre for Statstcs, Unversty of Queensland. Meng, X.L., Rubn, D.B., Maxmum lkelhood estmaton va the ECM algorthm: a general framework Bometrka 80, Meng, X.L., van Dyk, D., The EM algorthm an old folk song sung to a fast new tune (wth dscusson) J. Roy. Statst. Soc. Ser. B 59, Schwarz, G., Estmatng the dmenson of a model. Ann. Statst. 6,

10 388 G.J. McLachlan et al. / Computatonal Statstcs & Data Analyss 41 (2003) Tppng, M.E., Bshop, C.M., Mxtures of probablstc prncpal component analysers. Techncal Report No. NCRG=97=003, Neural Computng Research Group, Aston Unversty, Brmngham. Tppng, M.E., Bshop, C.M., Mxtures of probablstc prncpal component analysers. Neural Comput. 11, Utsug, A., Kumaga, T., Bayesan analyss of mxtures of factor analyzers. Neural Comput. 13,