Mxtures of Factor Analyzers wth Common Factor Loadngs for the Clusterng and Vsualsaton of Hgh-Dmensonal Data Jangsun Baek 1 and Geoffrey J. McLachlan 2 1 Department of Statstcs, Chonnam Natonal Unversty, Gwangju -77, South Korea and 2 Department of Mathematcs and Insttute for Molecular Boscence, Unversty of Queensland, Brsbane QLD 472, Australa March, 28 Abstract Mxtures of factor analyzers enable model-based densty estmaton and clusterng to be undertaken for hgh-dmensonal data, where the number of observatons n s very large relatve to ther dmenson p. In practce, there s often the need to reduce further the number of parameters n the specfcaton of the componentcovarance matrces. To ths end, we propose the use of common component-factor loadngs, whch consderably reduces further the number of parameters. Moreover, t allows the data to be dsplayed n low-dmensonal plots. Keywords Normal mxture models, mxtures of factor analyzers, common factor loadngs, model-based clusterng. 1
1 Introducton Fnte mxture models are beng ncreasngly used to model the dstrbutons of a wde varety of random phenomena and to cluster data sets; see, for example, [1]. Let Y = (Y 1,..., Y p ) T (1) be a p-dmensonal vector of feature varables. For contnuous features Y j, the densty of Y can be modelled by a mxture of a suffcently large enough number g of multvarate normal component dstrbutons, f(y; Ψ) = g π φ(y; µ, Σ ), (2) =1 where φ(y; µ, Σ) denotes the p-varate normal densty functon wth mean µ and covarance matrx Σ. Here the vector Ψ of unknown parameters conssts of the mxng proportons π, the elements of the component means µ, and the dstnct elements of the component-covarance matrces Σ ( = 1,..., g). The parameter vector Ψ can be estmated by maxmum lkelhood. For an observed random sample, y 1,..., y n, the log lkelhood functon for Ψ s gven by log L(Ψ) = n log f(y j ; Ψ). (3) j=1 The maxmum lkelhood estmate (MLE) of Ψ, ˆΨ, s gven by an approprate root of the lkelhood equaton, log L(Ψ)/ Ψ =. (4) Solutons of (4) correspondng to local maxmzers of log L(Ψ) can be obtaned va the expectaton-maxmzaton (EM) algorthm [2]; see also [3]. Besdes provdng an estmate of the densty functon of Y, the normal mxture model (2) provdes a probablstc clusterng of the observed data y 1,..., y n nto g clusters n terms of ther estmated posteror probabltes of component membershp of the mxture. The posteror probablty τ (y j ; Ψ) that the jth feature vector wth observed value y j belongs to the th component of the mxture can be expressed by Bayes theorem as τ (y j ; Ψ) = π φ(y j ; µ, Σ ) g h=1 π hφ(y j ; µ h, Σ h ) ( = 1,..., g; j = 1,..., n). () An outrght assgnment of the data s obtaned by assgnng each data pont y j to the component to whch t has the hghest estmated posteror probablty of belongng. The g-component normal mxture model (2) wth unrestrcted component-covarance matrces s a hghly parameterzed model wth d = 1 p(p + 1) parameters for each 2 component-covarance matrx Σ ( = 1,..., g). Banfeld and Raftery [4] ntroduced a parameterzaton of the component-covarance matrx Σ based on a varant of the standard spectral decomposton of Σ ( = 1,..., g). But f p s large relatve to the 2
sample sze n, t may not be possble to use ths decomposton to nfer an approprate model for the component-covarance matrces. Even f t s possble, the results may not be relable due to potental problems wth near-sngular estmates of the componentcovarance matrces when p s large relatve to n. In ths paper, we focus on the use of mxtures of factor analyzers to reduce the number of parameters n the specfcaton of the component-covarance matrces, as dscussed n [1,, 6]; see also [7]. Wth the factor-analytc representaton of the component-covarance matrces, we have that Σ = B B T + D ( = 1,..., g), (6) where B s a p q matrx and D s a dagonal matrx. As 1 q(q 1) constrants are 2 needed for B to be unquely defned, the number of free parameters n (6) s pq + p 1 q(q 1). (7) 2 Thus wth ths representaton (6), the reducton n the number of parameters for Σ s r = 1 p(p + 1) pq p + 1 q(q 1) 2 2 = 1{(p 2 q)2 (p + q)}, (8) assumng that q s chosen suffcently smaller than p so that ths dfference s postve. The total number of parameters s d 1 = (g 1) + 2gp + g{pq 1 q(q 1)}. (9) 2 We shall refer to ths approach as MFA (mxtures of factor analyzers). Even wth ths MFA approach, the number of parameters stll mght not be manageable, partcularly f the number of dmensons p s large and/or the number of components (clusters) g s not small. In ths paper, we therefore consder how ths factor-analytc approach can be modfed to provde a greater reducton n the number of parameters. We extend the model of [8, 9] to propose the normal mxture model (2) wth the restrctons and µ = Aξ ( = 1,..., g) () Σ = AΩ A T + D ( = 1,..., g), (11) where A s a p q matrx, ξ s a q-dmensonal vector, Ω s a q q postve defnte symmetrc matrx, and D s a dagonal p p matrx. As to be made more precse n the next secton, A s a matrx of loadngs on q unobservable factors and ts p columns are taken to be orthonormal; that s, A T A = I q, (12) where I q s the q q dentty matrx. Wth the restrctons () and 11) on the component mean µ and covarance matrx Σ, respectvely, the total number of parameters s reduced to d 2 = (g 1) + p + q(p + g) + 1 (g 1)q(q + 1). (13) 2 3
We shall refer to ths approach as MCFA (mxtures of common factor analyzers) snce t s formulated va the adopton of a common matrx (A) for the component-factor loadngs. We shall show for ths approach how the EM algorthm can be mplemented to ft ths normal mxture model under the constrants () and (11). We shall also llustrate how t can be used to provde lower-dmensonal plots of the data y j (j = 1,..., n). It provdes an alternatve to canoncal varates whch are calculated from the clusters under the assumpton of equal component-covarance matrces. 2 Mxtures of Common Factor Analyzers (MCFA) In ths secton, we examne the motvaton underlyng the MCFA approach wth ts constrants () and (11) on the g component means and covarance matrces µ and Σ ( = 1,..., g). We shall show that t can be vewed as a specal case of the MFA approach. To see ths we frst note that the MFA approach wth the factor-analytc representaton (6) on Σ s equvalent to assumng that the dstrbuton of the dfference Y j µ can be modelled as Y j µ = B U j + e j wth prob. π ( = 1,..., g) (14) for j = 1,..., n, where the (unobservable) factors U 1,..., U n are dstrbuted ndependently N(, I q ), ndependently of the e j, whch are dstrbuted ndependently N(, D ), where D s a dagonal matrx ( = 1,..., g). As noted n the ntroductory secton, ths model may not lead to a suffcently large enough reducton n the number of parameters, partcularly f g s not small. Hence f ths s the case, we propose the MCFA approach whereby the dstrbuton of Y j s modelled as Y j = AU j + e j wth prob. π ( = 1,..., g) (1) for j = 1,..., n, where the (unobservable) factors U 1,..., U n are dstrbuted ndependently N(ξ, Ω ), ndependently of the e j, whch are dstrbuted ndependently N(, D), where D s a dagonal matrx ( = 1,..., g). Here A s a p q matrx of factor loadngs, whch we take to satsfy the relatonshp (12); that s, A T A = I q. To see that the MCFA model as specfed by (1) s a specal case of the MFA approach as specfed by (14), we note that we can rewrte (1) as where Y j = AU j + e j = Aξ + A(U j ξ ) + e j = µ + AK K 1 (U j ξ ) + e j = µ + B U j + e j, (16) µ = Aξ, (17) B = AK, (18) U j = K 1 (U j ξ ), (19) 4
and where the U j are dstrbuted ndependently N(, I q ). The covarance matrx of U j s equal to I q, snce K can be been chosen so that K 1 Ω K 1T = I q ( = 1,..., g). (2) On comparng (16) wth (14), t can be seen that the MCFA model s a specal case of the MFA model wth the addtonal restrctons that and µ = Aξ ( = 1,..., g), (21) B = AK ( = 1,..., g), (22) D = D ( = 1,..., g). (23) The latter restrcton of equal dagonal covarance matrces for the component-specfc error terms (D = D) s sometmes mposed wth applcatons of the MFA approach to avod potental sngulartes wth small clusters (see []). Concernng the restrcton (22) that the matrx of factor of loadngs s equal to AK for each component, t can be vewed as adoptng common factor loadngs before the use of the transformaton K to transform the factors so that they have unt varances and zero covarances. Hence ths s why we call ths approach mxtures of common factor analyzers. It s also dfferent to the MFA approach n that t consders the factor-analytc representaton of the observatons Y j drectly, rather than the error terms Y j µ. As the MFA approach allows a more general representaton of the componentcovarance matrces and places no restrctons on the component means t s n ths sense preferable to the MCFA approach f ts applcaton s feasble gven the values of p and g. If the dmenson p and/or the number of components g s too large, then the MCFA provdes a more feasble approach at the expense of more dstrbutonal restrctons on the data. In emprcal results some of whch are to be reported n the sequel we have found the performance of the MCFA approach s usually at least comparable to the MFA approach for data sets to whch the latter s practcally feasble. The MCFA approach also has the advantage n that the latent factors n ts formulaton are allowed to have dfferent means and covarance matrces and are not whte nose as wth the formulaton of the MFA approach. Thus the (estmated) posteror means of the factors correspondng to the observed data can be used to portray the latter n low-dmensonal spaces. 3 Some Related Approaches The MCFA approach s smlar n form to the approach proposed by Yoshda et al. [8, 9], who also mposed the addtonal restrctons that the common dagonal covarance matrx D of the error terms s sphercal, D = σ 2 I p, (24)
Table 1: The number of parameters for three models p g q Number of parameters MFA 2 2 7999 4 2 1999 2 2 39999 4 2 79999 MCFA 2 2 38 4 2 32 2 2 18 4 2 12 MCUFSA 2 2 27 4 2 217 2 2 7 4 2 17 and that the component-covarance matrces of the factors are dagonal. We shall call ths approach MCUFSA (mxtures of common uncorrelated factor sphercal-error analyzers). The total number of parameters wth ths approach s d 3 = (g 1) + pq + 1 + 2gq 1 q(q + 1). (2) 2 In our experence, we have found that these restrctons of sphercty of the errors and of dagonal covarance matrces n the component dstrbutons of the factors can have an adverse effect on the clusterng of hgh-dmensonal data sets. The relaxaton of these restrctons does consderably ncrease the complexty of the problem of fttng the model. But we shall show how t can be effected va the EM algorthm wth the E- and M-steps beng able to be carred out n closed form. In Table 1, we have lsted the number of parameters to be estmated for the MFA, MCFA, and MCUFSA models when p =, ; q = 2; and g = 2, 4. For example, when we cluster p = dmensonal gene expresson data nto g = 2 groups usng q = 2 dmensonal factors, the MFA model requres 7999 parameters to be estmated, whle the MCUFSA needs only 27 parameters. Moreover, as the number of clusters grows from 2 to 4 the number of parameters for the MFA model grows twce as large as before, but that for MCUFSA remans almost the same. As MCUFSA needs less parameters to characterze the structure of the clusters, t does not always provde a good ft. It may fal to ft the data unless the drectons of the cluster-error dstrbutons are parallel to the axes of the orgnal feature space due the condton of sphercty on the cluster errors. Also, t s assumng that the component-covarance matrces of the factors are dagonal. Recently, Sangunett [] has consdered a method of dmensonalty reducton n a cluster analyss context. However, ts underlyng model assumes sphercty n the specfcaton of the varances/covarances of the factors n each cluster. Our proposed method allows for oblque factors, whch provdes the extra flexblty needed to cluster more effectvely hgh-dmensonal data sets n practce. 6
4 Fttng of Factor-Analytc Models The fttng of mxtures of factor analyzers as wth the MFA approach has been consdered n [], usng a varant of the EM algorthm known as the alternatng expectatoncondtonal maxmzaton algorthm (AECM). Wth the MCFA approach, we have ft to the same mxture model of factor analyzers but wth the addtonal restrctons ()and (11) on the component means µ and covarance matrces Σ. We also have to mpose the restrcton (23) of common dagonal covarance matrces D. The mplementaton of the EM algorthm for ths model s descrbed n the Appendx. In the EM framework, the component label z j assocated wth the observaton y j s ntroduced as mssng data, where z j = (z j ) s one or zero accordng as y j belongs or does not belong to the th component of the mxture ( = 1,..., g; j = 1,..., n). The unobservable factors u j are also ntroduced as mssng data n the EM framework. As part of the E-step, we requre the condtonal expectaton of the component labels z j ( = 1,..., g) gven the observed data pont y j (j = 1,..., n). It follows that E Ψ {Z j y j } = pr Ψ {Z j = 1 y j } = τ (y j ; Ψ) ( = 1,..., g; j = 1,..., n), (26) where τ (y j ; Ψ) s the posteror probablty that y j belongs to the th component of the mxture. From (), t can be expressed under the MCFA model as τ (y j ; Ψ) = π φ(y j ; Aξ, AΩ A T + D) g h=1 π hφ(y j ; Aξ h, AΩ h A T + D) (27) for = 1,..., g; j = 1,..., n. We also requre the condtonal dstrbuton of the unobservable (latent) factors U j gven the observed data y j (j = 1,..., n). The condtonal dstrbuton of U j gven y j and ts membershp of the th component of the mxture (that s, z j = 1) s multvarate normal, U j y j, z j = 1 N(ξ j, Ω y ), (28) where and and where ξ j = ξ + γ T (y j Aξ ) (29) Ω y = (I q γ T A)Ω, (3) γ = (AΩ A T + D) 1 AΩ. (31) We can portray the observed data y j n q-dmensonal space by plottng the correspondng values of the û j, whch are estmated condtonal expectatons of the factors U j, correspondng to the observed data ponts y j. From (28) and (29), E(U j y j, z j = 1) = ξ j = ξ + γ T (y j Aξ ). (32) 7
We let û j denote the value of the rght-hand sde of (32) evaluated at the maxmum lkelhood estmates of ξ, γ, and A. We can defne the estmated value û j of the jth factor correspondng to y j as û j = g τ (y j ; ˆΨ) û j (j = 1,..., n) (33) =1 where, from (27), τ (y j ; ˆΨ) s the estmated posteror probablty that y j belongs to the th component. An alternatve estmate of the posteror expectaton of the factor correspondng to the jth observaton y j s defned by replacng τ (y j ; ˆΨ) by ẑ j n (33), where ẑ j = arg max τ h (y j ; ˆΨ). (34) h Accuracy of Factor-Analytc Approxmatons To llustrate the accuracy of the three factor-analytc approxmatons as defned above, we performed a small smulaton experment. We generated random vectors from each of g = 2 dfferent three-dmensonal multvarate normal dstrbutons. The frst dstrbuton had the mean vector µ 1 = (,, ) T and covarance matrx Σ 1 = 4 1.8 1 1.8 2.9 1.9 2 whle the second dstrbuton had mean vector µ 2 = (2, 2, 6) T and covarance matrx 4 1.8.8 Σ 2 = 1.8 2...8. 2 We appled the MFA, MCFA, and the MCUFSA approaches wth q = 2 to cluster the data nto two groups. We used the ArrayCluster http://www.sm.ac.jp/ ~hguch/arraycluster.htm, whch was developed by Yoshda et al. [9] to mplement the MCUFSA approach. There were 2 msclassfcatons for MFA, 4 for MCFA, and 8 for MCUFSA. As we obtaned the parameter estmates for each model we can also predct each observaton based on the estmated factor scores and the parameter estmates. In Fgures 1, 2, and 3, we have plotted the predcted observatons ŷ j along wth the actual observatons y j by the MFA, MCFA, and the MCUFSA approaches. For the MFA approach, the predcted observaton s obtaned as where and ŷ j =, g τ (y j ; ˆΨ)(ˆµ + ˆB û j ), (3) =1 û j = ˆα T (y j ˆµ ) (36) ˆα = ( ˆB ˆBT + ˆD ) 1 ˆB. (37) 8
1 x3 x2 x1 Fgure 1: Orgnal observatons and the predcted observatons by MFA: Group 1; o Group 2; * predcted for Group 1; + predcted for Group 2 1 x3 x2 x1 Fgure 2: Orgnal observatons and the predcted observatons by MCFA: Group 1; o Group 2; * predcted for Group 1; + predcted for Group 2 For the MCFA approach, the predcted observaton s ŷ j = Âû j, (38) where  s the estmated projecton matrx A and where û j s the estmated factor score for the jth observaton, as defned by (33); smlarly, for the MCUFSA approach. The fgures show that the orgnal dstrbuton structure of two groups s recovered by the estmated factor scores for MFA and MCFA approaches. Ther assumed models are suffcently flexble to ft the data where the drectons of the two cluster-error dstrbutons are not parallel to the axes of the orgnal feature sapce. On the other hand the predcted observatons for the MCUFSA approach are not ftted well to the actual dstrbuton of two groups as shown n Fgure 3. Wth ths approach, the predcted observatons tend to be hgher than the actual observatons from the frst group and lower for those from the second group. Ths lack of ft s due to the strct assumpton of a sphercal covarance matrx for each component-error dstrbuton. We measured the dfference between the predcted and observed observatons by the mean squared error (MSE),where MSE= 2 j=1 (y j ŷ j ) T (y j ŷ j )/2. The value of the MSE for the smulated data s 2.3, 3.8, 17.34 for MFA, MCFA, and MCUFSA, respectvely. As to be expected, the MSE ncreases n gong from MFA to MCFA and then markedly to MCUFSA. 9
1 x3 x2 x1 Fgure 3: Orgnal observatons and the predcted observatons by MCUFSA: Group 1; o Group 2; * predcted for Group 1; + predcted for Group 2 6 Applcatons of MCFA Approach to Clusterng We mplemented the MFA, MCFA, and MCUFSA approaches to cluster tssues n two dfferent gene expresson data sets and ndvduals n one chemcal measurement data set. We compared the consstency between the mpled clusterng obtaned wth each approach wth the true group membershp of each data set. We adopted the clusterng correspondng to the local maxmzer that gave the largest value of the lkelhood as obtaned by mplementng the EM algorthm for tral starts, comprsng 2 k-means starts and 2 random starts. To measure the agreement between a clusterng of the data and ther true group membershp, we used the consstency measures of Jaccard Index [11] and the Adjusted Rand Index (ARI)[12]. Both ndces take the value 1 when there s perfect valdaton between the clusterng and the true groupng. The Jaccard Index takes any value between and 1, but the ARI can have negatve values. 6.1 Example 1: Leukaema Gene-Expresson Data The frst data set concerns the leukaema tssue samples of Golub et al. [13], n whch there are two types of acute leukaema: Acute Lymphoblastc Leukaema (ALL) and Acute Myelod Leukaema (AML). The data contan the expresson levels of 7129 genes on 72 tssues comprsng 47 cases of ALL and 2 cases of AML. We preprocessed the data by the followng steps: () thresholdng wth a floor of and a celng of 16; () flterng: excluson of genes wth max / mn and (max mn), where max and mn refer to the maxmum and mnmum expresson levels, respectvely, of a partcular gene across a tssue sample; () takng logs of the expresson levels and then standardzng them across genes to have mean zero and unt standard devaton. Fnally, standardzng them across samples to have mean zero and unt standard devaton. Ths preprocessng resulted n thousand genes beng retaned. We reduced ths set further to genes by selectng genes accordng to the smple two-sample t-test as used n [14]. As observed n prevous studes, the data on the ALL and AML cases are well separated, confrmng the bologcal dstncton between ALL and AML subtypes. Hence all three approaches were able to cluster the nto two clusters that almost corresponded
Table 2: Agreement between the clusterng result and the true membershp of leukaema data q Smlarty Indces Model 1 2 3 4 MFA.91.91.91.91 Jaccard MCFA.91.91.91.91 MCUFSA.91.91.91.91 MFA.94.94.94.94 ARI MCFA.94.94.94.94 MCUFSA.94.94.94.94 8 6 ALL AML 4 2 u 2 2 4 6 8 8 6 4 2 2 4 6 8 u 1 Fgure 4: Plot of the factor scores wth the MCFA approach perfectly wth the ALL and AML subtypes. There was only one msallocaton wth the three approaches usng ether q = 1, 2, 3, or 4 factors. Hence n Table 2 the values of the Jaccard Index are the same for all three approaches, as are the ARI values. We plotted the estmated factor scores û j = (û 1j, û 2j ) T of the tssues n twodmensonal space accordng to ther clustered membershp determned by the MCFA wth q = 2 factors (Fgure 4). It can be seen from Fgure 4 that the two classes ALL and AMl are well separated. The one AML tssue that s msallocated as ALL s crcled n Fgure 4. 6.2 Example 2: Paedatrc Leukaema Gene-Expresson Data In the second example, we consdered the clusterng of the Paedatrc Acute Lymphoblastc Leukaema (ALL) data of Yeoh et al. [1], who analyzed some gene expressons on 7 subtypes of 327 ALL tssues, consstng of BCR-ABL, E2A-PBXI, Hyperdplod (> ), MLL, Novel, T-ALL, and TEL-AML1 subtypes. They performed an herarchcal cluster analyss of the 327 dagnostc cases usng genes selected by several dfferent methods ncludng ch-squared and the t-statstc. We chose for our analyss the 2 genes wth hghest ch-squared statstcs n each subtype. (Ths resulted n a total 132 unque genes as some were chosen for more than one subtype. Gven that the 11
Table 3: Agreement between the clusterng result and the true membershp of paedatrc ALL data q Smlarty Indces Model 1 2 3 4 MFA.632.642.648..4 Jaccard MCFA NA.3.97.6.6 MCUFSA - -.93 -.84 MFA.72.73.741.6.632 ARI MCFA NA.392.68.7.74 MCUFSA - -.687 -.677 number of components (subtypes) here s not small wth g = 7, we mposed the constrant (23) of common dagonal matrces D n the formulaton of the MFA approach. Ths constrant s always mposed wth the MCFA approach. The values of the Jaccard Index and the ARI are lsted n Table 3 for the three approaches for each of fve levels of the number of factors q (q = 1, 2, 3, 4, ). In the case of a sngle factor (q = 1), the MFCA approach usng g = 7 components clustered the 327 tssues nto only clusters, and so the ndces were unable to be calculated. Ths s noted by NA (not avalable) for q = 1 n Table 3. It can be seen that the performance of the MCFA approach for q 3 factors s comparable to the MFA approach, although ther best results for the Jaccard and Adjusted Rand Indces occur for dfferent values of q. The ndces for the latter are greatest for q = 3 factors, whereas they are greatest for the MCFA approach for q = factors. The ArrayCluster program for the mplementaton of the MCUFSA approach gave results for only q = 3 and factors, whch are not as hgh as those for the MCFA approach. 6.3 Example 3: Chemcal Data wth Addtonal Nose Added The thrd example consders the so-called Vetnam data whch was consdered n Smyth et al. [16]. The Vetnam data set conssts of the log transformed and standardzed concentratons of 17 chemcal elements to whch four types of synthetc nose varables were added n [16] to study methods for clusterng hgh-dmensonal data. We used these data consstng of a total of 67 varables (p = 67; 17 chemcal concentraton varables plus normal nose varables). The concentratons were measured n har samples from sx classes (g = 6) of Vetnamese, and the total number of subjects were n = 224. The values of the ndces for the clusterng results for ths set are presented n Table 4. It can be seen that the hghest value (.81) of the ARI for the MFCA approach was obtaned for q = 4 factors, compared to a hghest value of.611 for ths ndex at q = 3 factors wth the MFA approach. A smlar comparson can be made on the bass of the Jaccard Index. Agan, t was not possble to obtan results for the MCUFSA approach for all values of q. 12
Table 4: Agreement between the clusterng result and the true membershp of Vetnamese data q Smlarty Indces Model 1 2 3 4 MFA.416.7.26.483.42 Jaccard MCFA.316.6.9.738.691 MCUFSA -.374.83.77.76 MFA.491.9.611.6. ARI MCFA.331.7.676.81.778 MCUFSA -.44.681.671.67 7 Low-Dmensonal Plots va MCFA Approach To llustrate the usefulness of the MFCA approach for portrayng the results of a clusterng n low-dmensonal space, we have plotted n Fgure the estmated mean posteror values of the factors q j as defned by (33). It can be seen that the clusters are represented n ths plot wth very lttle overlap. Ths s not the case n Fgure 6, where the frst two canoncal varates are plotted. They were calculated usng the mpled clusterng labels. It can be seen from Fgure 6 that one cluster s essentally on top of another. The canoncal varates are calculated on the bass of the assumpton of equal cluster-covarance matrces, whch does not apply here. The MCFA approach s not predcated on ths assumpton and so has more flexblty n representng the data n reduced dmensons. Fgure : Plot of the (estmated) posteror mean factor scores 8 Dscusson and Conclusons In practce, much attenton s beng gven to the use of normal mxture models n densty estmaton and clusterng. However, for hgh-dmensonal data sets, the componentcovarance matrces are hghly parameterzed and some form of reducton n the number of parameters s needed, partcularly when the number of observatons n s not large 13
Fgure 6: Plot of the frst two canoncal varates based on the mpled clusterng va MCFA approach relatve to the number of dmensons p. One way of proceedng s to work wth mxtures of factor analyzers (MFA) as studed n [1, Chapter 8]. Ths approach acheves a reducton n the number of parameters through ts factor-analytc representaton of the component-covarance matrces. But t may not provde a suffcent reducton n the number of parameters, partcularly when the the number g of clusters (components)to be mposed on the data s not small. In ths paper, we show how n such nstances the number of parameters can be reduced apprecably by usng a factor-analytc representaton of the component-covarance matrces wth common factor loadngs. The approach s called mxtures of common factor analyzers (MCFA). Ths sharng of the factor loadngs enables the model to be used to cluster hgh-dmensonal nto many clusters and to provde low-dmensonal plots of the clusters so obtaned. The latter plots are gven n terms of the (estmated) posteror means of the factors correspondng to the observed data. These projectons are not useful wth the MFA approach as n ts formulaton the factors are taken to be whte nose wth no cluster-specfc dscrmnatory features for the factors. The MFA approach does allow a more general representaton of the component varances/covarances and places no restrctons on the component means. Thus t s more flexble n ts modellng of the data. But n ths paper we demonstrate that MCFA provdes a comparable approach that can be appled n stuatons where the dmenson p and the number of clusters g can be qute large. We have presented analyses of both smulated and real data sets to demonstrate the usefulness of the MCFA approach. In practce, we can use the Bayesan Informaton Crteron (BIC) of Schwartz [18] to provde a gude to the choce of the number of factors q and the number of number of components g to be used. On the latter choce t s well known that regularty condtons do not hold for the usual ch-squared approxmaton to the asymptotc null dstrbuton of the lkelhood rato test statstc to be vald. However, they do hold for tests on the number of factors at a gven level of g, and so we can also use the lkelhood rato test statstc to choose q; see [1, Chapter 8]. 14
9 Appendx The model (1) underlyng the MCFA approach can be ftted va the EM algorthm to estmate the vector Ψ of unknown parameters. It conssts of the mxng proportons π, the factor component-mean vectors ξ, the dstnct elements of the factor componentcovarance matrces Ω, the projecton matrx A based on sharng of factor loadngs, and the common dagonal matrx D of the resduals gven the factor scores wthn a component of the mxture. In order to apply the EM algorthm to ths problem, we ntroduce the component-ndcator labels z j, where z j s one or zero accordng to whether y j belongs or does not belong to the th component of the model. We let z j be the component-label vector, z j = (z 1j,..., z gj ) T. The z j are treated as mssng data, along wth the (unobservable) latent factors u j wthn ths EM framework. The complete-data log lkelhood s then gven by log L c (Ψ) = g n z j {log π + log φ(y j ; Au j, D) + log φ(u j ; ξ, Ω )}. (39) =1 j=1 E-step On the E-step, we requre the condtonal expectaton of the complete-data log lkelhood, log L c (Ψ), gven the observed data y = (y T 1,..., y T n) T, usng the current ft for Ψ. Let Ψ (k) be the value of Ψ after the kth teraton of the EM algorthm. Then more specfcally, on the (k + 1)th teraton the E-step requres the computaton of the condtonal expectaton of log L c (Ψ) gven y, usng Ψ (k) for Ψ, whch s denoted by Q(Ψ; Ψ (k) ). We let τ (k) j = τ (y j ; Ψ (k) ), (4) where τ (y j ; Ψ) s defned by (27). Also, we let E Ψ (k) refer to the expectaton operator, usng Ψ (k) for Ψ. Then the so-called Q-functon, Q(Ψ; Ψ (k) ), can be wrtten as g n Q(Ψ; Ψ (k) ) = τ (k) j {log π + w (k) 1j + w(k) 1j }, (41) where and =1 j=1 w (k) 1j = E Ψ (k){log φ(y j; Au j, D) y j, z j = 1} (42) w (k) 2j = E Ψ (k){log φ(u j; ξ, Ω ) y j, z j = 1}. (43) M-step On the (k + 1)th teraton of the EM algorthm, the M-step conssts of calculatng the updated estmates π (k+1), ξ (k+1), Ω (k+1), A (k+1), and D (k+1) by solvng the equaton Q(Ψ; Ψ (k) )/ Ψ =. (44) 1
The updated estmates of the mxng proportons π are gven as n the case of the normal mxture model by π (k+1) = n j=1 τ (k) j /n ( = 1,..., g). (4) Concernng the other parameters, t can be shown usng vector and matrx dfferentaton that Q(Ψ; Ψ (k) )/ ξ = Ω 1 Q(Ψ; Ψ (k) )/ Ω 1 = Q(Ψ; Ψ (k) )/ D 1 = Q(Ψ; Ψ (k) )/ A = n j=1 g =1 g =1 n j=1 τ (k) j E Ψ (k){(u j ξ ) y j }, (46) τ (k) 1 j [Ω 2 E Ψ (k){(u j ξ )(u j ξ ) T y j }], (47) n j=1 n j=1 τ (k) 1 j [D E 2 Ψ (k){(y j Au j )(y j u j ) T y j }], τ (k) j [D 1 {y j E Ψ (k)(u T j y j ) (48) AE Ψ (k)(u j u T j y j )}]. (49) On equatng (46) to the zero vector, t follows that ξ (k+1) can be expressed as ξ (k+1) = ξ (k) + n j=1 τ (k) j γ(k)t (y j A (k) ξ (k) n j=1 τ (k) j ), () where γ (k) = (A (k) Ω (k) A (k)t + D (k) ) 1 A (k) Ω (k). (1) On equatng (47) to the null matrx, t follows that Ω (k+1) = n j=1 τ (k) j γ(k)t (y j A (k) ξ (k) )(y j A (k) ξ (k) ) T γ (k) n j=1 τ (k) j +(I q γ (k)t A (k) )Ω (k) (2) On equatng (48) to the zero vector, we obtan D (k+1) = dag(d (k) 1 + D (k) 2 ), (3) where D (k) 1 = g n =1 j=1 τ (k) j D(k) (I p β (k) ) g n =1 j=1 τ (k) j (4) 16
and and where D (k) 2 = g =1 n j=1 τ (k) j β (k) β(k)t (y j A (k) ξ (k) )(y j A (k) ξ (k) ) T β (k) g n =1 j=1 τ, () (k) j = (A (k) Ω (k) A (k)t + D (k) ) 1 D (k). (6) On equatng (49) to the null matrx, we obtan A (k+1) = ( g =1 A (k) 1 )( g =1 A (k) 2 ) 1, (7) where and A (k+1) 1 = A (k+1) 2 = n j=1 n j=1 r (k) τ (k) j {y jξ (k)t τ (k) j = ξ (k) + (y j A (k) ξ (k) ) T γ (k) }, (8) {(I q γ (k)t A (k) )Ω (k) + r (k) r (k)t }, (9) + γ (k)t (y j A (k) ξ (k) ) (6) The factor loadng matrx A (k+1) must be orthogonal n columns, that s, A (k+1)t A (k+1) = I q. (61) We can use the Cholesky decomposton to fnd the upper trangular matrx C (k+1) of order q so that A (k+1)t A (k+1) = C (k+1)t C (k+1). (62) Then t follows that f we replace A (k+1) by A (k+1) C (k+1) 1, (63) then t wll satsfy the requrement (61). Wth the adopton of the estmate (63) for A, we need to adjust the updated estmates ξ (k+1) and Ω (k+1) to be and where ξ (k+1) and Ω (k+1) C (k+1) ξ (k+1) (64) C (k+1) Ω (k+1) C (k)t, (6) are gven by () and (2), respectvely. We have to specfy an ntal value for the vector Ψ of unknown parameters n the applcaton of the EM algorthm. A random start s obtaned by frst randomly assgnng the data nto g groups. Let n, ȳ, and S be the number of observatons, the sample mean, and the sample covarance matrx, respectvely, of the th group of the data so obtaned ( = 1,..., g). We then proceed as follows: 17
Set π () = n /n. Generate random numbers from the standard normal dstrbuton N(, 1) to obtan values for the (j, k)th element of A (j = 1,..., p; k = 1,..., q). Implement the Cholesky decomposton so that A T A = C T C, where C s the upper trangle matrx of order q, and defne A () by A C 1. On notng that the transformed data D 1/2 Y j satsfes the probablstc PCA model of Tppng and Bshop [17] wth σ 2 = 1, t follows that for a gven D () and A (), we can specfy Σ () as Ω () = A ()T D ()1/2 H (Λ σ 2 I q)h T D()1/2 A (), where σ 2 = p h=q+1 λ h/(p q). The q columns of the matrx H are the egenvectors correspondng to the egenvalues λ 1 λ 2 λ q of D () 1/2 Ω () D () 1/2, (66) where Ω () s the covarance matrx of the y j n the th group, and Λ s the dagonal matrx wth dagonal elements equal to λ 1,..., λ q. Concernng the choce of D (), we can take D () to be the dagonal matrx formed from the dagonal elements of the (pooled) wthn-cluster sample covarance matrx of the y j. The ntal value for ξ s ξ () = A ()T ȳ. Some clusterng procedure such as k-means can be used to provde non-random parttons of the data, whch can be used to obtan another set of ntal values for the parameters. In our analyses we used both ntalzaton methods. Acknowledgement The work of J. Baek was supported by the Korea Research Foundaton Grant funded by the Korean Government (MOEHRD, Basc Research Promoton Fund, KRF-27-21-C48). The work of G. McLachlan was supported by the Australan Research Councl. REFERENCES [1] G.J. McLachlan and D. Peel, Fnte Mxture Models. New York: Wley, 2. [2] A.P. Dempster, N.M. Lard, and D.B. Rubn, Maxmum lkelhood from ncomplete data va the EM algorthm (wth dscusson), Journal of the Royal Statstcal Socety: Seres B, vol. 39, pp. 1 38, 1977. [3] G.J. McLachlan and T. Krshnan, The EM Algorthm and Extensons. Second Edton. New York: Wley, 28. 18
[4] J.D. Banfeld and A.E. Raftery, Model-based Gaussan and non-gaussan clusterng, Bometrcs, Vol. 49, pp. 83 821, 1993. [] G.J. McLachlan, D. Peel, and R.W. Bean, Modellng hgh-dmensonal data by mxtures of factor analyzers, Computatonal Statstcs & Data Analyss, vol. 41, pp. 379 388, 23. [6] G.J. McLachlan, R.W. Bean, and L. Ben-Tovm Jones, Extenson of the mxture of factor analyzers model to ncorporate the multvarate t dstrbuton, Computatonal Statstcs & Data Analyss, Vol. 1, 327 338, 27. [7] G.E. Hnton, P. Dayan, and M. Revow, M. Modelng the manfolds of mages of handwrtten dgts, IEEE Transactons on Neural Networks, vol. 8, pp. 6 73, 1997. [8] R. Yoshda, T. Hguch, and S. Imoto, A mxed factors model for dmenson reducton and extracton of a group structure n gene expresson data, Proceedngs of the 24 IEEE Computatonal Systems Bonformatcs Conference, pp. 161 172, 24. [9] R. Yoshda, T. Hguch, S. Imoto, and S. Myano, ArrayCluster: an analytc tool for clusterng, data vsualzaton and model fnder on gene expresson profles, Bonformatcs, vol. 22, pp. 138 139, 26. [] G. Sangunett, Dmensonalty redcton of clustered data sets, IEEE Transactons Pattern Analyss and Machne Intellgence, Vol. 3, pp. 3 4. [11] P. Jaccard, Dstrbuton de la florne alpne dans la Bassn de Dranses et dans quelques regones vosnes, Bulletn de la Socété Vaudose des Scences Naturel les, Vol. 37, pp. 241 272, 191. [12] L. Hubert and P. Arabe, Comparng parttons, Journal of Classfcaton, Vol. 2, 193 218, 198. [13] T.R. Golub, D.K. Slonm, P. Tamayo, C. Huard, M. Gaasenbeck, P. Mesrov, H. Coller, M.L. Loh, J.R. Downng, M.A. Calgur, et al., Molecular classfcaton of cancer: Class dscovery and class predcton by gene expresson montorng, Scence, 286, 31-37, 1999. [14] D. Nguyen, and D. Rocke, Tumor classfcaton by partal least squares usng mcroarray gene expresson data, Bonformatcs, Vol. 18, pp. 39, 22. [1] E. Yeoh, and Ross, M.E., et al., Classfcaton, subtype dscovery, and predcton of outcome n pedatrc acute lymphoblastc leukema by gene expresson proflng, Cancer Cell, Vol. 1, pp. 133 143, 22. [16] C. Smyth, D. Coomans, and Y. Everngham, Clusterng nosy data n a reduced dmenson space va multvarate regresson trees, Pattern Recognton, Vol. 39, pp. 424 431, 26. [17] M.E. Tppng, and C.M. Bshop, Mxtures of probablstc prncpal component analysers, Neural Computaton, Vol. 11, pp. 443 482, 1999. [18] G. Schwarz, Estmatng the dmenson of a model, Annals of Statstcs, Vol. 6, pp. 461 464, 1978. 19