Data Visualization by Pairwise Distortion Minimization

Transcription

1 Communcatons n Statstcs, Theory and Methods 34 (6), 005 Data Vsualzaton by Parwse Dstorton Mnmzaton By Marc Sobel, and Longn Jan Lateck* Department of Statstcs and Department of Computer and Informaton Scences Temple Unversty, Phladelpha, PA 191. We dedcate ths paper to the memory of Mlton Sobel who provded nspraton to us and the academc communty as a whole. ABSTRACT Data vsualzaton s acheved by mnmzng dstorton resultng from observng the relatonshps between data ponts. Typcally, ths s accomplshed by estmatng latent data ponts, desgned to accurately reflect the parwse relatonshps between observed data ponts. The dstorton masks the true parwse relatonshps between data ponts, represented by the latent data. Dstorton can be modeled as maskng dssmlarty measures between data ponts or, alternatvely, as maskng ther parwse dstances. The latter class of models are encompassed by metrc scalng methodology (MDS); the former are ntroduced here as compettors. The former class of models nclude Prncpal Components Analyss, whch mnmzes the global dstorton between observed and latent data. We model dstorton usng mxtures of parwse dfference factor-analyss statstcal models. We employ an algorthm whch we call stepwse forward selecton for purposes of dentfyng approprate startng values and determnng the approprate dmensonalty of the latent data space. We show that the parwse factor-analyss models frequently better ft the data because they allows for drect modelng of par-wse dssmlartes between data ponts. Marc Sobel ([email protected] ) s an Assocate Professor n the Department of Statstcs, Fox School of Busness and Management, 1810 N. 13 th Street, Longn Jan Lateck ([email protected] ) s an Assocate Professor n the Department of Computer and Informaton Scences (CIS) 314 Wachman Hall, 1805 N. Broad Street; Temple Unversty, Phladelpha, PA 191

2 Communcatons n Statstcs, Theory and Methods 34 (6) 1. INTRODUCTION There has been consderable nterest n both the machne learnng and statstcal modelng lterature n comparng, regsterng, and classfyng mage data. From the practtoners perspectve, there are a number of advantages f such algorthms are successful. Frst, algorthms of ths sort can provde mechansms for vsualzng the data. Second, they provde a mechansm for learnng the mportant features of the data. Because feature vectors typcally lve n very hgh dmensonal spaces, reducng ther dmensonalty s crucal to most datamnng tasks. Many algorthms for reducng data dmensonalty depend on estmatng latent (proected) varables desgned to mnmze certan energy (or error) functons. Other algorthms acheve the same purpose by estmatng latent varables usng statstcal models. Generally, dmensonalty reducton methods can be dvded nto metrc and nonmetrc methods. Metrc methods start wth data ponts (n a hgh-dmensonal space) wth observed parwse dstances between them. The goal of metrc methods s to estmate latent data ponts, lvng n a lower dmensonal space, whose parwse dstances accurately reflect the parwse dstances between the observed data ponts. Methods of ths sort nclude those proposed by J. Sammon []. Nonmetrc methods start wth data ponts whose parwse relatonshps are gven by dssmlartes whch need not correspond to a dstance. In contrast, prncpal components analyss mnmzes the global dstorton between observed and latent data values. Metrc methods ncorporate addtonal steps desgned to provde constrants wthn whch latent dssmlartes, havng approprate propertes, can be optmally estmated. Methods of ths sort nclude those of Kruskal [3]. In ths paper we take as our startng pont observed data ponts, lvng n a hgh-dmensonal space. The parwse relatonshps between these data ponts are represented by the correspondng relatonshps between latent data ponts, lvng n a lowdmensonal space. The parwse dssmlartes between observed data ponts are masked by nose. Ths nose could arse n many dfferent settngs; examples nclude: () settngs where parttonng data nto groups s of paramount nterest, and lack of

3 Communcatons n Statstcs, Theory and Methods 34 (6) 3 straghtforward clusters can be modeled as the mpact of nose on the parwse relatonshps between data, and () settngs where the energy of data ponts s beng modelled; n ths case nose arses n evaluatng the relatonshp between the energy of neghborng data ponts. Our approach s dfferent from that of probablstc prncpal components (see [4]) where nose masks the relatonshp between each ndvdual data pont and ts latent counterpart. By contrast, n our approach nose masks parwse dssmlartes between data ponts and analogous latent quanttes; we wll see below that ths dfference n approach allows us to buld n some extra flexblty nto the nterpretaton and modelng.of hgh-dmensonal data. Our approach s smlar n sprt to the approach employed n relatonal Markov models [5]. The man goal of multdmensonal scalng (MDS) s to mnmze the dstorton between parwse data dstances and the correspondng parwse dstances between ther latent proectons. Ths nsures that the latent (or proected) data optmally reflect the nternal structure of the data. MDS algorthms frequently nvolve constructng loss functons whch prescrbe (scaled) penaltes for dfferences between observed and latent parwse dstances. See also [6] for a graphcal analyss of MDS. MDS methods are used wdely n behavoral, econometrc and socal scences [7]. The most commonly used nonlnear proecton methods wthn MDS nvolve mnmzng measures of dstorton (lke those of Kruskal and Sammon) (e.g., Secton 1.3. n [8] and [9]). These measures of dstorton frequently take the form of loss or energy functon. For purposes of comparson we focus on the loss (energy) functon proposed by J. Sammon n [].In ths paper we compare MDS methods (as represented by that of Sammons) wth those usng parwse-dfference factor analyss methodology. We employ stepwse forward selecton algorthms (see below) to provde good estmates of the dmenson of the latent data space and

4 Communcatons n Statstcs, Theory and Methods 34 (6) 4 startng vectors approprate for use wth ether of these two methodologes. Other startng value optons whch have been recommended nclude () random startng values (see [], [8], and [13]) and () startng at latent varable values arsng from employng prncpal components analyss (PCA) []. The former opton, stll commonly used, fals to provde any useful tranng (nformaton). The latter opton fals because PCA does not provde optmal (or near-optmal) solutons to mnmzng the (nose) dstorton of the data. In fact, as wll be seen below n the examples, the dstorton reducton for PCA generated latent varables s very small. For mult-dmensonal scalng models, after employng stepwse forward selecton algorthms for the aforementoned purposes, we typcally use gradent descent methods (see e.g., [9] and [10]) to mnmze Sammon s cost functon. For factor analyss mxture models, after employng stepwse forward selecton algorthms, we use the EM algorthm (see [11]) to provde estmates of the parameters. We partton the parwse dfferences between data nto two groups by determnng data membershp usng EM-suppled probabltes and an approprate threshold. The frst group conssts n those pars of data wth small parwse dfferences; the second n those pars of data ponts wth large parwse dfferences. The frst group of pars provdes a mechansm for dstngushng data clusters; the second group provdes a mechansm for dstngushng whch pars of ponts are dfferent from one another. In the next secton we compare two dfferent ways of proectng the relatonshps between pars of data ponts nto latent k-dmensonal space, denoted by k R ; typcally k wll be taken to be or 3. Usng the notaton F 1,...,F n for the observed feature vector data, multdmensonal scalng s concerned wth proectng a known real valued dssmlarty

5 Communcatons n Statstcs, Theory and Methods 34 (6) 5 functon, { D(, ) = D(F,F ) } of the ordered pars of features {, } dmensonal functonal counterparts { µ µ } F F onto ther latent k- - (1 < n). Sammons energy functon provdes a typcal example of ths. In ths settng the latent r-vectors are chosen to mnmze a loss (or energy) functon of the form, S H D (, ) µ µ ( F µ ) = ; (1.1) D (, ) 1 < n n mu. We have n mnd the example, D (, ) = l'(f - F ). Many varatons on ths basc theme have been explored n the lterature (see [3]). As a counterpont to ths approach, we ntroduce the next secton:. FORMULATING THE PROBLEM USING STATISTICAL MODELS In ths secton we assume (as above) that feature vectors, assocated wth each data obect are themselves observed. We employ a varant of probablstc prncpal components models, ntroduced n [4]. Our varant s desgned to take account of the fact that we seek to model the nose dstorton between pars of feature vectors rather than the nose dstorton assocated wth each ndvdual feature vector. We follow the man prncple of MDS whch s to map the data to a low-dmensonal space n such a way that the dstorton between data ponts s mnmal. We ntroduce some necessary notaton frst. Let 'D(, )' denote the dssmlarty between feature vectors k F and F (1 < n) (whch s allowed to lve n more than one dmenson).

6 Communcatons n Statstcs, Theory and Methods 34 (6) 6 Explctly, we assume that ths dssmlarty measure 'D(, )' lves n a Eucldean space wth dmensonalty p (assumed to be less than or equal to the dmensonalty k of the feature space). In the example, gven below, we take D(, ) = F -F (1 < n), n whch case p=k. Other examples nclude assumng that = '( ) statstcal model assumed below takes the form: D(, ) 1 F -F (for a known p-vector l) The general lnear ( ), (.1) ( g) ( g) ( g) D(, ) = A µ - µ + ε ; 1 < n ' g ' dentfes the partcular mxture model component; (.e., 'g(, ) = s' means that the par (,) belong to mxture component s) ( g ) ' A ' are parametrc p q matrces ndexed by the component π ; ( g ) ' µ ' are parametrc q 1 latent vectors for feature F ndexed by the component π and observaton ndex ''. (1 <<n). ' ε ' s the parwse nose dstorton for features F,F; (1 < n), It s assumed below that the errors ' ε ' are normally dstrbuted wth ( g ) ( σ ) common varance I., Whle the dmensonalty p of the D s (defned above) may be qute hgh, the dmensonalty q of the latent mu vectors wll typcally be assumed to be qute small. (In the composte move example analyzed n secton 5, below, L s taken to be 5). The matrces 'A ( g ) ' are (latent) proecton matrces proectng pared dfferences between parametrc latent µ vectors onto ther feature vector pared dfference counterparts. We use the EM algorthm [11] to estmate the model parameters under the assumpton that the observed dssmlartes are gven by D (, ) = F F (1 < n). The equatons needed for purposes of dong ths calculaton are gven n the appendx. In equaton (.), below, we assume that the aforementoned mxture model, ndexed by g, conssts of exactly components. The frst component comprses pars of

7 Communcatons n Statstcs, Theory and Methods 34 (6) 7 observatons wth small varance; the second comprses pars of observatons wth large varance. The frst component model s desgned to characterze those pars of feature vectors whose dfference s well-approxmated by the correspondng dfference between ther latent varable counterparts; the second, those pars of feature vectors whose dfference s not wellapproxmated by ths dfference. Specfcally, we assume that the frst component varance (1) [ σ ] s sgnfcantly smaller than the second, were selected to mnmze quanttes of the form, () [ σ ]. Frst component model parameters ˆ ( g= 1) ( g= 1) ( g= 1) D(, ) -A (µ ˆ ˆ -µ ) SS[ g = 1] = P( D(, ) g = 1) (.), σ where the hatted quanttes are the EM algorthm estmates of the correspondng parameters and P( D (, ) g = 1) s the probablty specfed n the EM algorthm (see the appendx, below). Model Ft and Assessment We assess the ftness of data vsualzaton models usng Bayesan p-values [1]. Ths can be formulated as the probablty that the nformaton obtaned from the model s less than expected under an aposteror update of the data. Informaton quanttes lke those derved below are dscussed n [13]. Ths knd of calculaton s not possble for typcal MDS models because they are not formulated as statstcal models. In the model ntroduced at the begnnng of ths secton, the nformaton contaned n the observed dssmlarty measures, assumng an unnformatve pror and gnorng margnal terms, s,

8 Communcatons n Statstcs, Theory and Methods 34 (6) 8 { } {, } INF ( M D) = E log ( L) D = E log(l ) D (.3) 1 < n where L denotes the lkelhood of the data and L ( g) ( g) ( g) 1 D A µ µ, = ( g) exp σ (, ) ( ) ( g) ( σ ) For the model ntroduced n secton, the rght hand sde of equaton (.3) can be approxmated, omttng terms whch don t nvolve the observed dssmlarty measures, by ˆ ( g= 1) ( g= 1) ( g= 1) D (, ) A ( ˆ µ ˆ µ ) INF( M D) INF( M ) Pˆ D = (, g = 1) ( g= 1) ˆ 1 < n σ ˆ ( g= ) ( g= ) ( g= ) D (, ) A ( ˆ µ ˆ µ ) - P ˆ (, g= ) ( g= ) ˆ 1 < σ n (.4) where the hatted quanttes are all the EM algorthm estmates (see the appendx) (see [13] for a more complete dscusson of nformaton quanttes lke that gven n equaton (.4)). Posteror updates of the dssmlartes were smulated va: ˆ (g=1) ( g= 1) ( g= 1) ( g= 1) { ˆ µ ˆ µ ( ˆ σ ) } ˆ (g=) ( g= ) ( g= ) ( g= ) ˆ µ ˆ µ ( ˆ σ ) N A ( ), wprob P(, ˆ g = 1) * D (, ) (.5) N ˆ { A ( ), } wprob P(, g = ) (1 < n). ( N(*1,*) refers to the normal dstrbuton wth mean *1 and varance *). The posteror Bayes p-value s equal to: * ( D D D ) Bayes pvalue = P INF( M ) < INF( M ) (.6)

9 Communcatons n Statstcs, Theory and Methods 34 (6) 9 (the probablty n equaton (.6) beng calculated over the dstrbuton specfed by equaton (.5)). For the models examned below the (Bayesan p-values) were all between 80 and 90% ndcatng good fts. 3. ALGORITHMS EMPLOYED FOR MDS DATA VISUALIZATION We use onlne gradent descent algorthms to estmate parameters n the MDS approach to data vsualzaton [6]. The gradent of Sammon s energy functon wth respect to the parametrc vector r restrcted to terms nvolvng r ; s: µ µ D, µ µ ( E ) = (3.1) D, µ µ The analogous quantty wth and swtched s: ( E ) = ( E ) (1 < N). An onlne gradent descent algorthm can, n theory, be based on an teratve calculaton of the r-vectors by updatng r-vectors usng the followng teratve steps: (new) (old) µ = µ -ε (E ) ( 3.) (new) (old) µ = µ -ε (E ) We have already remarked on the problem of tranng a large number of µ vectors for the purpose of startng the gradent descent and EM algorthms. We show below how to select a small number l of vantage obects v 1,..., v l from among the observed nput obects such that the Sammon s energy functon, l ( D( p D v,f) p ) EDv ( 1,..., vl ) D (, ) (v,f ) = (3.3) 1 < l p= 1 D (, ) s farly small. Ths provdes us wth well-traned (.e., well-ftted) startng r-vectors gven by (0) ( v v ) µ = D( 1, ),..., D( p, ) ; =1,...,n

10 Communcatons n Statstcs, Theory and Methods 34 (6) 10 Snce for purposes of vsualzaton l= or l=3 s typcally suffcent to nsure small values of E( D v 1,..., v l) for a moderate szed data set, two or three vantage vectors usually suffce n ths case [14]. For a large number n of observed data ponts vantage obects can be obtaned by the stepwse forward selecton process descrbed below. We note that ths process mproves on the adhoc procedures used heretofore [13]. Stepwse Forward Selecton At each stage s=1,...,l, the stepwse forward selecton algorthm selects one new vantage obect v that s added to the set of prevously chosen obects v 1,..., v s-1 s chosen to satsfy: ( ) v = arg mn E D v,..., v, v (3.4) s v A 1 s 1. The vantage obect v s s where 'arg mn v A' denotes the vector n A for whch the mnmum value of EDv ( 1,..., vs 1, v) s reached. At stage s, havng chosen the vantage obect v s, we prune the obects by comparng the energes EDv ( 1,..., v 1, v+ 1,..., vs) (=1,.,s) wth the energy EDv ( 1,..., vs 1, vs ) (=1,...,s). If any of them are smaller than the latter energy, we remove the vantage obect v and return to the next step of the process. 4. Expermental Results In ths secton we examne the performance of the proposed algorthms on varous data sets. We begn wth the classcal Irs data set [7]. The Irs data s composed of 150 vectors each havng 4 components. It s known that there are 3 clusters, each havng 50 ponts; these consst of one clear cluster, denoted by A below, and two clusters, B and C that are hard to dstngush from one another. We frst compare 3 dmensonal proectons obtaned usng Sammon s algorthm (cf., equaton (1.1)) wth nput vantage vectors produced va stepwse

11 Communcatons n Statstcs, Theory and Methods 34 (6) 11 forward selecton [15]. Fgure 1a, below, shows a 3 dmensonal proecton of the Irs data obtaned usng the classcal Prncpal Components algorthm for data vsualzaton; the dstorton measure for ths estmate (computed usng Sammon s energy functon) s 3,55. Fgure 1b, below, shows a 3 dmensonal proecton of the Irs data obtaned usng Sammons data vsualzaton algorthm; the dstorton measure (computed usng Sammons energy functon) for ths estmate s 544. As can be seen, we cannot clearly dstngush between clusters B and C usng PCA. By contrast, clusters B and C can be clearly dstngushed usng Sammons data vsualzaton algorthm. Fgure 1(a on the left and b on the rght) 3 Dmensonal proectons of the Irs data (a): obtaned by classcal PCA, and (b) usng Sammons algorthm for data vsualzaton (employng vantage vectors produced by stepwse forward selecton. We now turn to evaluatng two dmensonal proectons for the data set referred to below as composte move, below. Composte move s composed of 10 shots (each havng 10 frames) taken from 4 dfferent moves; these consst n: a) 4 shots taken from the move, Mr. Beans Chrstmas : frames 1 to 40. b) 3 shots taken from the move House Tour : frames 41 to 70.

12 Communcatons n Statstcs, Theory and Methods 34 (6) 1 c) shots taken from a move we created (referred to below as Mov1 ): frames 71:90. d) 1 shot from a move n whch Kyle Mnogue s ntervewed: frames 91 to 100. The frames can vewed at the ste: Usng mage processng technques descrbed n [16], we assgn a vector wth 7 features to each of the 100 frames. We obtan a data set consstng of one hundred 7 component feature vectors. Composte Move has two herachcal groupng levels; t can be grouped usng shots and separately usng moves. We expect to dstngush both between the shots and, on a hgher level, between the moves. The best data vsualzaton algorthm (cf., fgure a) for ths data set was obtaned usng the parwse dfference factor analyss mxture model outlned n secton ; we used startng vantage vectors, computed usng stepwse forward selecton. As can be seen n Fgure a, below, there are 4 clear clusters that belong together n the upper left corner of the fgure. They represent the 4 shots grouped to form excerpts from Mr Beans Chrstmas. In the lower rght corner, we see two clear clusters. These are two shots from the move referred to as Mov1. The 3 shots from House Tour are represented by the 3 rghtmost clusters n the mddle of the fgure. Fgure b below, employs Sammon s data vsualzaton algorthm, usng gradent descent (see secton 3) wth the same vantage vectors. Sammon s data vsualzaton gave a sgnfcantly worse pcture of the data. Ths s demonstrated by the fact that the moves are no longer grouped correctly. For example, the four clusters from Mr. Bean s chrstmas are mxed wth clusters from the other moves n the lower rght hand quadrant.

13 Communcatons n Statstcs, Theory and Methods 34 (6) 13 Fgure The dmensonal proectons of the composte move data obtaned by (a) the parwse dfference factor analyss mxture algorthm (on the left) and (b) Sammons algorthm computed usng gradent descent (on the rght). 5. Conclusons and Future Research We have ntroduced stepwse forward selecton algorthms and demonstrated ther value n provdng startng values for factor mxture models and Multdmensonal scalng algorthms. It has been shown that parwse dfference factor mxture models provde good data vsualzaton for a wde varety of data when vantage vectors, constructed usng stepwse forward selecton, are used to generate approprate startng values. Our examples llustrate that factor mxture models frequently provde better data vsualzaton than Multdmensonal Scalng algorthms, desgned for the same purpose. Ther superorty arses as a result of ther flexblty n modelng data dstorton. We have shown how to assess the ftness of factor mxture models and used these results to assess ft n the examples presented above. We would lke to extend our current work to nclude mxture factor models whch ncorporate ntracomponent correlatons.

14 Communcatons n Statstcs, Theory and Methods 34 (6) 14 Appendx Calculatons va the EM algorthm needed for the mxture factor dfference model: In ths secton we descrbe the Expectaton-Maxmzaton (EM) algorthm [10] used to estmate the latent varables and parameters ntroduced n secton above. For purposes of clarty we repeat the formulaton of our model: ( ), (A.1) (π) (π) (π) D(, ) = A µ - µ + ε ; 1 < n ' π ' dentfes the partcular mxture model component; (.e., ' π(, ) = s' means that the par (,) belong to mxture component s) (π) ' A ' are parametrc p q matrces ndexed by the component π ; (π) ' µ ' are parametrc q 1 latent vectors for feature F ndexed by the component π and observaton ndex ''. (1 <<n). ' ε,' s the parwse nose dstorton for features F,F; (1 < n) It s assumed below that ' ε ' that the errors are normally dstrbuted wth (π) ( σ ) common varance., ( old; g ) ( g; new) In the notaton below, µ (respectvely, µ ) denotes the old or prevous value (respectvely, new or updated value) of the latent parameter µ for the g th component (g=1,). (=1,,n). Analogous notaton s used to characterze the proecton matrx A. We also ( g; old ) employ the notaton, µ ( ) for the average of the old (or prevous) mu-parameters of the g th component excludng the th; smlar notaton apples to the the new (or updated) parameters (g=1,; =1,,n). Then, employng the notaton, κ = 1 ( gold ; ) (g; old) ( gold ; ) ( gold ; ) { σ µ µ } ( κ; old ) ( κ; old ) ( κ; old ) ( κ; old ) { σ µ µ } exp (1/ [ ] ) D-A ( ) P(,;g)=P(D(,) g)= (A.) exp (1/ [ ] ) D-A ( ) for the probablty weght attached to the observed par of dssmlarty measure D (, ), ( g; new) we update the latent mean vectors µ (g=1,; =1,,n) va,

15 Communcatons n Statstcs, Theory and Methods 34 (6) 15 ( gnew ; ) ( ( gnew ; ) ( gnew ; ) 1 ) ( gnew ; ) ( ( gnew ; ) ( gnew ; ), + ( ) ) A ' A A ' D A µ P(, ; π) ˆ µ (A.3) P (, g ; ) The back proecton matrx ( g; new) A s updated usng the formula, A ( gnew ; ) for g=1,. D µ µ µ µ µ µ ' P(, ; π) ( g; new) ( g; new) ( g; new) ( g; new) ( g; new) ( g; new) ( ) ( )( ), < < We upgrade the varances < ( g; new) σ va, P (, g ; ) 1 (A.4) σ ( gnew ; ) < ( gnew ; ) ( gnew ; ) ( gnew ; ), µ µ D A ( P(, ; g) < P (, g ; ) (A.5). Bblography [1] Jollffe,I.T. Prncpal Component Analyss, Sprnger-Verlag, 1986 [] Sammon, J.W., Jr., A nonlnear mappng for data structure analyss, IEEE Trans. Comput. 1969, 18, [3] T.F. Cox and M.A. Cox. Multdmensonal Scalng, Chapman and Hall, 001. [4] Bshop, M., and Tppng, M.E. A Herarchcal Latent Varable Model for Data Vsualzaton, IEEE Transactons on Pattern Analyss and Machne Intellgence, 1983, 0,3, [5] Koller, D. Probablstc Relatonal Models, nvted contrbuton to, Inductve Logc Programmng, 9 th Internatonal Workshop (ILP-99), Saso Dzerosk and Peter Flach, Eds, Sprnger Verlag, 1999, pp [6] McFarlane, M., and Young F.W., Graphcal Senstvty Analyss for Multdmensonal Scalng, Journal of Computatonal and Graphcal Statstcs, 1994, 3, 1, [7] Kohonen, T. Self-organzng maps, Sprnger-Verlaag, New York, 001.

16 Communcatons n Statstcs, Theory and Methods 34 (6) 16 [8] Faloutsos C., and Ln, K.-I. FastMap: A fast algorthm for Indexng,Data-Mnng and Vsualzaton of Tradtonal and Multmeda Datasets, Proc. ACM SIGMOD Internatonal Conference on Management of Data, 1995, [9] Mao, J. and Jan, A.K.: Artfcal Neural Networks for Feature Extracton and Multvarate Data Proecton. IEEE Transactons on Neural Networks 1995, 6,. [10] Lerner, Boaz, Guterman, Hugo, Aladem, Mayer, Dnsten, Itshak, and Romem, Ytzhak,On pattern classfcaton wth Sammon s Nonlnear Mappng - An Expermental Study, Pattern Recognton, 1998, 31, [11] Lard, N.M., and Rubn, D.B., Maxmum lkelhood for ncomplete data va the em algorthm, Journal of Royal Statstcal Socety, 1977, 39, pp [1] Gelman, Carln, Stern and Rubn, Bayesan Data Analyss, Chapman and Hall, [13] Maclachlan, G. and Peer, D., Fnte Mxture Models, Wley Seres n Probablty and Statstcs, 000. [14] Fraley, C. and Raftery A.E., How Many Clusters? Whch clusterng method? Answers va Model Based Cluster Analyss, Computer Journal, 1999, 41, pp97:306. [15] Jollffe, I.T., Prncpal Components Analyss, Sprnger seres n statstcs, nd edton, 00. [16] Lateck, L.J., and Wldt, D., Automatc Recognton of Unpredctable Events n Vdeos, Proceedngs of the Internatonal Conference on Pattern Recognton (ICPR), 00, 16.