Review of Hierarchical Models for Data Clustering and Visualization

Revew of Herarchcal Models for Data Clusterng and Vsualzaton Lola Vcente & Alfredo Velldo Grup de Soft Computng Seccó d Intel lgènca Artfcal Departament de Llenguatges Sstemes Informàtcs Unverstat Poltècnca de Catalunya (UPC) Jord Grona, -3. E-08034 Barcelona {mvcente, avelldo}@ls.upc.es Abstract. Real data often show some level of herarchcal structure and ts complexty s lkely to be underrepresented by a sngle low-dmensonal vsualzaton plot. Herarchcal models organze the data vsualzaton at dfferent levels, and ther ultmate goal s dsplayng a representaton of the entre data set at the top-level, perhaps revealng the presence of clusters, whle allowng the lower levels of the herarchy to dsplay representatons of the nternal structure wthn each of the clusters found, provdng the defnton of lower level sets of subclusters whch mght not be apparent n the hgher-level representaton. Several unsupervsed herarchcal models are revewed, dvded nto two man categores: Heurstc Herarchcal Models, wth a focus on Self-Organzng Maps, and Probablstc Herarchcal Models, manly based on Gaussan Mxture Models. Introducton The structural complexty of hgh-dmensonal datasets can hardly be captured by means of sngle, low-dmensonal representatons, and nformaton whch s structured at dfferent representaton levels s lkely to escape them. Often, real-world datasets nvolve some level of herarchcal structure. Herarchcal models organze the data vsualzaton at dfferent levels, and ther ultmate goal s dsplayng a representaton of the entre data set at the top-level, perhaps revealng the presence of clusters, whle allowng the lower levels of the herarchy to dsplay representatons of the nternal structure wthn each of the clusters found, provdng the defnton of lower level sets of sub-clusters whch mght not be apparent n the hgher-level representaton. The defnton of a herarchy wll allow the analyst to drll down the data n order to dscover patterns that mght be hdden to other, more smple models (See example n fgure ). The constructon of a herarchy, n all the models hereupon revewed, s carred out n a top-down fashon, n a procedure known as dvsve herarchcal clusterng. Ths revew separates the unsupervsed models under consderaton nto two categores: Heurstc Herarchcal Models and Probablstc Herarchcal Models. Heurstc Herarchcal Models focus on varatons on the well known Self-Organzng Map (SOM, [, ]). Ths model has been wdely used over the last twenty-odd years due to ts powerful vsualzaton propertes, and despte some of ts ntrnsc lmtatons. Probablstc Herarchcal Models are based on densty estmatons: Herarchcal Mxture of Latent Varable Models and Herarchcal Generatve Topographc Model, wll be revewed.

Approaches to herarchcal data vsualzaton ncorporatng SOM ental a hard data partton, whle probablstc models allow a soft parttonng n whch, at any level of herarchy, data ponts can effectvely belong to more than one model. Fg.. An example of a herarchcal model, where more detals on the structure of the data are revealed n each level, from Bshop & Tppng [3] Heurstc Herarchcal models based on the SOM SOM s an unsupervsed, neural network-nspred model for clusterng and data vsualzaton, n whch the prototypes are encouraged to resde n a one- or twodmensonal manfold n the feature space. The resultng manfold s also referred to as a constraned topologcal map, snce the orgnal hgh-dmensonal observatons are mapped down onto a fxed, ordered, dscrete grd on a coordnate system. Kohonen s [, ] SOM s an unsupervsed neural network provdng a mappng from a hgh-dmensonal nput space to a one- or two-dmensonal output space whle preservng topologcal relatons as fathfully as possble. The SOM conssts of a set of unts arranged usually arranged n a - or -dmensonal n grd wth a weght vector m R attached to each unt. The basc tranng algorthm can be summarzed as follows: Observatons from the hgh-dmensonal nput space, referred to as nput vectors x R n, are presented to the SOM and the actvaton of each unt for the presented nput vector s calculated, usually resortng to an actvaton functon based on the dstance between the weght vector of the unt and the nput vector. Prototype methods represent the tranng data by a set of ponts n the feature space. These prototypes are typcally not examples from the tranng sample.

The weght vector of the unt showng the hghest actvaton (smallest dstance) s selected as the wnner and s modfed as to more closely resemble the presented nput vector by means of a reorentaton of the weght vector towards the drecton of the nput vector, weghted by a tme-decreasng learnng rateα. Furthermore, the weght vectors of unts n the neghborhood of the wnner are modfed accordngly (to a lesser extent than the wnner) as descrbed by a tmedecreasng neghborhood functon. Ths learnng procedure fnally leads to a topologcally ordered mappng of the presented nput sgnals. Smlar nput data are mapped onto neghborng regons on the map. Several developments on the basc algorthm have addressed the ssue of adaptve SOM structures; amongst them: Dynamc Self-Organzng Maps [4], Incremental Grd Growng [5], or Growng Grd [6], where new unts are added to map areas where the data are not represented at a satsfyng degree of granularty. As mentoned n the ntroducton, herarchcal models can provde more nformaton from a data set. SOM has been developed n several ways n order to set t wthn herarchcal frameworks, whch are commonplace as part of more standard statstcal clusterng procedures.. Herarchcal Feature Map The key dea of herarchcal feature maps as proposed n [7] s to use a herarchcal setup of multple layers where each layer conssts of a number of ndependent SOMs. One SOM s used at the root or frst layer of the herarchy. For every unt n ths map a SOM s created n the next layer of the herarchy. Ths procedure s repeated n further layers of the herarchcal feature map. A 3-layered example s provded n fgure. The frst layer map conssts of x unts, thus we fnd four ndependent self-organzng maps on the second layer. Snce each map on the second layer conssts agan of x unts, there are 6 maps on the thrd layer. The tranng process of herarchcal feature maps starts wth the root SOM on the frst layer. Ths map undergoes standard tranng. When ths frst SOM becomes stable,.e. only mnor further adaptaton of the weght vectors occurs, tranng proceeds wth the maps n the second layer. Here, each map s traned wth only that porton of the nput data that s mapped on the respectve unt n the hgher layer map. Ths way, the amount of tranng data for a partcular SOM s reduced on the way down the herarchy. Addtonally, the vectors representng the nput patterns may be shortened on the transton from one layer to the next, due to the fact that some nput vector components can be expected to be equal among those nput data that are mapped onto the same unt. These equal components may be omtted for tranng the next layer maps wthout loss of nformaton. Fg.. Archtecture of a three-layer herarchcal feature map, from Merkl [8] Herarchcal feature maps have two benefts over SOM whch make ths model partcularly attractve:

Frst, herarchcal feature maps ental substantally shorter tranng tmes than the standard SOMs. The reason for that s twofold: On the one hand, there s the nput vector dmenson reducton on the transton of one layer to the next. Shorter nput vectors lead drectly to reduced tranng tmes. On the other hand, the SOM tranng s performed faster because the spatal relaton of dfferent areas of the nput space s mantaned by means of the network archtecture rather than by means of the tranng process. Second, herarchcal feature maps may be used to produce farly solated,.e. dsont, clusters of the nput data than can be gradually refned when movng down along the herarchy. In ts basc form, the SOM struggles to produce solated clusters. The separaton of data tems s a rather trcky task that requres some nsght nto the structure of the nput data. Metaphorcally speakng, the standard SOM can be used to produce general maps of the nput data, whereas herarchcal feature maps produce an atlas of the nput data. The standard SOM provdes the user wth a snapshot of the data; as long as the map s not too large, ths may be suffcent. As the maps grow larger, however, they have the tendency of provdng too lttle orentaton for the user. In such a case, herarchcal feature maps are advsable as models for data representaton.. Herarchcal SOM The Herarchcal SOM (HSOM) model usually refers to a tree of maps, the root of whch acts as a preprocessor for subsequent layers. As the herarchy s traversed upwards, the nformaton becomes more and more abstract. Herarchcal self-organzng networks were frst proposed by Luttrell [9]. He ponted out that although the addton of extra layers mght yeld a hgher dstorton n data reconstructon, t mght also effectvely reduce the complexty of the task. A further advantage s that dfferent knds of representatons would be avalable from dfferent levels of the herarchy. A multlayer HSOM for clusterng was ntroduced by Lampnen and Oa [0]. In the HSOM, the best matchng unt (BMU) of an nput vector x s sought from the frst-layer map and ts ndex s gven as nput to the second-layer map. If more than one data vector concurs wthn the same unt of the frst layer map, the whole data hstogram can be gven to the second layer nstead of a sngle ndex. Ths approach has been appled to document database management []. HSOM conssts of a number of maps organzed n a pyramdal structure, such as that dsplayed n fgure 3. Note that there s a strct herarchy and neghborhood relaton mpled wth ths archtecture. The sze of the pyramd,.e. the number of levels as well as the sze of the maps at each level, has to be decded upon n advance, meanng there s no dynamc growth of new maps based on the tranng process tself. However, snce the tranng of the pyramd s performed one level at a tme, t s theoretcally possble to add a further level f requred. Furthermore, note that, usually, the number of nodes at the hgher levels s small as compared to other SOM models usng multple maps. Durng the tranng process, the nput vectors that are passed down n the herarchy are compressed: f certan vector entres of all nput sgnals that are mapped onto the same node show no or lttle varance, they are deemed not to contan any addtonal nformaton for the subordnate map and thus are not requred for tranng the correspondng sub-tree of the herarchy. Ths leads to the defnton of dfferent weght vectors for each map, created dynamcally as the tranng proceeds.

Fg. 3. Herarchcal SOM Archtecture: 3 layers wth x (layers and ) and 3x3 (layer 3) from Kokkalanen & Oa [].3 Growng Herarchcal SOM The Growng Herarchcal Self-organzng Map [3, 4] (GHSOM) s proposed as an extenson to the SOM [, ] and HSOM [9] wth these two ssues n mnd:. SOM has a fxed network archtecture.e. the number of unts to use as well as the layout of the unts has to be determned before tranng.. Input data that are herarchcal n nature should be represented n a herarchcal manner for clarty of representaton. GHSOM uses a herarchcal structure of multple layers where each layer conssts of a number of ndependent SOMs. Only one SOM s used at the frst layer of the herarchy. For every unt n ths map a SOM mght be added to the next layer of the herarchy. Ths prncple s repeated wth the thrd and any further layers of the GHSOM. In order to avod SOM fxed sze n terms of the number of unts an ncrementally growng verson of SOM s used, smlar to the Growng Grd.

Fg. 4. GHSOM reflectng the herarchcal structure of the nput data, from Dttenbach, Merkl & Rauber [3] The GHSOM wll grow n two dmensons: n wdth (by ncreasng the sze of each SOM) and n depth (by ncreasng the levels of the herarchy). For growng n wdth, each SOM wll attempt to modfy ts layout and ncrease ts total number of unts systematcally so that each unt s not coverng too large an nput space. The tranng proceeds as follows:. The weghts of each unt are ntalzed wth random values.. The standard SOM tranng algorthm s appled. 3. The unt wth the largest devaton between ts weght vector and the nput vectors that represents s chosen as the error unt. 4. A row or a column s nserted between the error unt and the most dssmlar neghbour unt n terms of nput space 5. Steps -4 are repeated untl the mean quantzaton error (MQE) reaches a gven threshold, a fracton of the average quantfcaton error of unt, n the proceedng layer of the herarachy. Fg. 5. Insertng a row n SOM, from Dttenbach, Merkl & Rauber [3] The pcture on the left of fgure 5 s the SOM layout before nserton. e s the error unt and d s the most dssmlar neghbor. The pcture on the rght shows the SOM layout after nsertng a row between e and d.

As for deepenng the herarchy of the GHSOM, the general dea s to keep checkng whether the lowest level SOMs have acheved enough coverage for the underlyng nput data. The detals are as follows:. Check the average quantfcaton error of each unt to ensure t s above certan gven threshold: t ndcates the desred granularty level of a data representaton as a fracton of the ntal quantzaton error at layer 0. Assgn a SOM layer to each unt wth an average quantfcaton error greater than the gven threshold, and tran SOM wth nput vectors mapped to ths unt. GHSOM provdes a convenent way to self organze nherently herarchcal data nto layers and t gves users the ablty to choose the granularty of the representaton at the dfferent levels of the herarchy. Moreover, the GHSOM algorthm wll automatcally determne the archtecture of the SOMs at dfferent levels. Ths s an mprovement over the Growng Grd as well as HSOM. The drawbacks of ths model nclude the strong dependency of the results on a number of parameters that are not automatcally tuned. Hgh thresholds usually result n a flat GHSOM wth large ndvdual SOMs, whereas low thresholds result n a deep herarchy wth small maps. 3 Probablstc Herarchcal models Probablstc models offer a consstent framework to deal wth problems that ental uncertanty. When probablty theory lays at the foundaton of a learnng algorthm, the rsk that the reasonng performed n t be nconsstent n some cases s lessened ([5, 6]) Next, we present several herarchcal models developed wthn a probablstc framework. The presentaton of each model s preceded by a bref summary of the theory layng ts foundatons. 3. Gaussan Mxture Modelng The Gaussan Mxture Model (GMM) s based on densty estmaton. It s a semparametrc estmaton method snce t defnes a very general class of functonal forms for the densty model where the number of adaptve parameters can be ncreased n a systematc way (by addng more components to the model) so that the model can be made arbtrarly flexble. In a mxture model, a probablty densty functon s expressed as a lnear combnaton of bass functons. A model wth M components s wrtten n the form M p( x) = P( ) p( x ), = where the parameters P ( ) are called the mxng coeffcents and the parameters of the component densty functons p ( x ) typcally vary wth. To be a vald probablty densty, a functon must be non-negatve throughout ts doman and ntegrate to over the whole space. Constranng the mxng coeffcents ()

M = P( ) = () 0 ( ) and choosng normalzed densty functons P, (3) p ( x ) dx = guarantees that the model does represent a densty functon. The mxture model s a generatve model and t s useful to consder the process of generatng samples from the densty t represents, as they can be consdered as representatves of the observed data. Frst, one of the components s chosen at random wth probablty P ( ) ; thus we can vew P ( ) as the pror probablty of the th component. Then a data pont s generated from the correspondng densty p ( x ). The correspondng posteror probabltes can be wrtten, usng Bayes theorem, n the form P ( x) = p( x ) P( ) p( x) where p(x) s gven by (). These posteror probabltes satsfy the constrants M = P( x) =, 0 P ( x) It only remans to decde on the form of the component denstes. These could be Gaussan dstrbutons wth a dfferent form of covarance matrx. Sphercal The covarance matrx s a scalar multple of the dentty matrx, = σ so that I p( x ) exp x µ = d (πσ ) σ = Dagonal The covarance matrx s dagonal dag( σ,..., σ ) and the densty functon s p( x d ( x mu ) = exp d d (π σ ) σ, = =,, ) Full The covarance matrx s allowed to be any postve defnte and the densty functon s T p ( x ) = exp ( x µ ) Σ ( x µ ) d π,, d (4) (5) (6) (7) (8) d d matrx (9)

Each of these models s a unversal approxmator, n that they can model any densty functon arbtrarly closely, provded that they contan enough components. Usually a mxture model wth full covarance matrces wll need fewer components to model a gven densty, but each component wll have more adustable parameters. The method for determnng the parameters of a Gaussan mxture model from a data set s based on the maxmzaton a data lkelhood functon. It s usually convenent to recast the problem n the equvalent form of mnmzng the negatve log lkelhood of the data set E = L = N n= n log p( x ), whch s treated as an error functon. Because the lkelhood s a dfferentable functon of the parameters, t s possble to use a general purpose non-lnear optmzer such as the expectaton-maxmzaton (E-M) algorthm [7]. It s usually faster to converge than general purpose algorthms, and t s partcularly sutable to deal wth ncomplete data. (0) 3. Herarchcal Mxture Models Bshop and Tppng [3] ntroduced the concept of herarchcal vsualzaton for probablstc PCA. By consderng a probablstc mxture of latent varable models we obtan a soft partton of the data set at the top level of the herarchy nto clusters, correspondng to the second level of the herarchy. Subsequent levels, obtaned usng nested mxture representatons, provde ncreasngly refned models of the data set. The constructon of the herarchcal tree proceeds top-down, and can be drven nteractvely by the user. At each stage of the algorthm the relevant model parameters are determned usng the expectaton-maxmzaton (E-M) algorthm [7]. The densty model for a mxture of latent varable models takes the form: M 0 p( t) = π p( t ) = where M 0 s the number of components of the mxture, and the parameters π are the mxng coeffcents, or pror probabltes, correspondng to the mxture components p ( t ). Each component s an ndependent latent varable model wth parameters µ, W andσ. The herarchcal mxture model s a two-level structure consstng of a sngle latent varable model at the top level and a mxture of M 0 such models at the second level. The herarchy can be extended to a thrd level by assocatng a group G of latent varable models wth each model n the second level. The correspondng probablty densty can be wrtten n the form M 0 ( t) = π π = G p p( t, ) () ()

where the p ( t, ) agan represent ndependent latent varable models, and the π correspond to sets of mxng coeffcents, one for each, satsfyng π =. Thus, each level of the herarchy corresponds to a generatve model, wth lower levels gvng more refned and detaled representatons. Determnaton of the parameters of the models at the thrd level can agan be vewed as a mssng data problem n whch the mssng nformaton corresponds to labels specfyng whch model generated each data pont. 3.3 Generatve Topographc Mappng The am of the Generatve Topographc Mappng (GTM, [8]), a probablstc alternatve to the SOM, that also resorts to Bayesan statstcs, s to allow a non-lnear transformaton from latent space to data space but keepng the model computatonally tractable. In ths approach, the data s modeled by a mxture of Gaussans (although alternatve dstrbutons can be used), n whch centers of the Gaussans are constraned to le on a lower dmensonal manfold. The topographc nature of the mappng comes about because the kernel centers n the data space preserve the structure of the latent space. By adequate selecton of the form of the non-lnear mappng, t s possble to tran the model usng a generalzaton of the EM algorthm. The GTM provdes a well-defned obectve functon (somethng that the SOM lacks) and ts optmsaton, usng ether non-lnear standard technques or the EM-algorthm, has been proved to converge. As part of ths process, the calculaton of the GTM learnng parameters s grounded n a sound theoretcal bass. Bayesan theory can be used n the GTM to calculate a posteror probablty of each pont n latent space beng responsble for each pont n data space, nstead of the SOM sharp map unt membershp attrbuton for each data pont. The GTM belongs to a famly of latent space models that model a probablty dstrbuton n the (observable) data space by means of latent, or hdden varables. The latent space s used to vsualze the data, and s usually a dscrete square grd on the twodmensonal Eucldean space. GTM creates a generatve probablstc model n the data space by placng a radally symmetrc Gaussan wth zero mean and nverse varance. 3.4 Herarchcal GTM The probablstc defnton of the GTM allows ts extenson to a herarchal settng n a straghtforward and prncpled way [9]. The Herarchcal GTM (HGTM) models the whole data set at the top level, and then breaks t down nto clusters at deeper levels of the herarchy. The herarchy s defned as follows: HGTM arranges a set of GTMs and ther correspondng plots n a tree structure T. The Root s consdered to be at level,.e. Level(Root) =. Chldren of a model N wth Level(N) = are at level +,.e. Level(M) = +, for all M Chldren(N). Each model M n the herarchy, except for the Root, has an assocated parentcondtonal mxture coeffcent: a pror dstrbuton defned as: p (M Parent(M)). The prors are non-negatve and satsfy the consstency condton: M Chldren ( N ) p ( M N) = (3) Uncondtonal prors for the models are recursvely calculated as follows:

p (Root) =, and for other models Level( M ) p (M)= = p ( Path( M ) Path( M ) where Path(M)=(Root,,M) s the N-tuple (N=Level(M)) of nodes defnng the path n T from Root to M. The dstrbuton assocated to the herarchcal model s a mxture of leaf models, M Leaves( T ) ) (4) p ( t T ) = p( M ) p( t M ) (5) The tranng of the HGTM s straghtforward and proceeds n a recursve fashon (topdown):. A root GTM s traned and used to generate an overall vsualzaton of the data set.. The user dentfes regons of nterest on the vsualzaton map. 3. These regons of nterest are transformed nto the data space and form the bass for buldng a collecton of new, chld GTMs. 4. The EM algorthm works wth responsbltes (posteror probabltes of unt membershp gven data observatons) moderated by the parent-condtonal pror prevously descrbed. 5. After assessng the lower level vsualzaton maps, the user may decde to proceed further and model n greater detal some specfc portons of these. An automated ntalzaton, resortng to Mnmum Descrpton Length (MDL) prncples, can be mplemented to choose the number and locaton of sub-models. 4 Conclusons In ths bref paper, we have revewed a number of recent advances on the development of unsupervsed herarchcal models for data vsualzaton and clusterng. The settng of data exploraton elements, such as clusterng and vsualzaton, nto a herarchcal framework augments the amount of nformaton about a data set that models manage to convey. Most real-world problems ental complex data sets that seldom provde enough nformaton n a sngle snapshot, and nteractve herarchcal methods are more lkely to provde an adequate nsght nto the fne detals of the structure of data patterns. Two sub-groups of models have been consdered: Heurstc Herarchcal Models and Probablstc Herarchcal Models. Many advantages can be expected from the defnton of data analyss models accordng to prncpled probablstc theory, amongst them the possblty of developng these models n a coherent way. References. Kohonen, T.: Self-Organzng Maps. Berln: Sprnger Verlag (995). Kohonen, T.: Self-organzed formaton of topologcally correct feature maps. Bologcal Cybernetcs, 43(), (98) 59-69 3. Bshop, C.M. & Tppng, M.E.: A herarchcal latent varable model for data vsualzaton. IEEE Transactons on Pattern Analyss and Machne Intellgence, 0(3), (998) 8-93 4. Alahakoon, D., Halgamuge, S.K. & Srnvasan, B.: Dynamc self-organzng maps wth controlled growth for knowledge dscovery. IEEE Transactons Neural Networks, (3), (000) 60-64

5. Blackmore, J. & Mkkulanen, R.: Incremental grd growng: Encodng hgh-dmensonal structure nto a two-dmensonal feature map. In Proceedngs of the Internatonal Conference on Neural Networks (ICANN 93). San Francsco, CA, (993) 450-455 6. Frtzke, B.: Growng grd a self-organzng network wth constant neghborhood range and adaptaton strength. Neural Processng Letters, (5), (995) -5 7. Mkkulanen, R.: Scrpt recognton wth herarchcal feature maps. In Connecton Scence,, (990) 83-0 8. Merkl, D.: Exploraton of text collectons wth herarchcal feature maps. Proceedngs of the Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval (SIGIR 97), Phladelpha, USA (997) 9. Luttrell, S. P.: Herarchcal self-organzng networks. In Proceedngs of the Internatonal Conference on Neural Networks (ICANN'89). London, U.K (989) -6 0. Lampnen, J. & Oa, E.: Clusterng propertes of herarchcal self-organzng maps. Journal of Mathematcal Imagng and Vson,, (99) 6-7. Kohonen, T., Kask, S., Lagus, K. & Honkela, T.: Very large two-level SOM for the browsng of newsgroups. In Proceedngs of the Internatonal Conference on Neural Networks (ICANN'96), Bochum, Germany (996) 69-74. Kokkalanen, P. & Oa, E.: Self-organzng herarchcal feature maps. In Proceedngs of the Internatonal Jont Conference on Neural Networks. San Dego, Calforna, U.S.A., (990) 79-84 3. Dttenbach, M, Merkl, D. & Rauber, A.: The growng herarchcal self-organzng map. In Proceedngsof the Internatonal Jont Conference on Neural Networks. (IJCNN 000), Como, Italy 6, (000) 5-9 4. Dttenbach, M, Rauber, A. & Merkl, D.: Uncoverng the herarchcal structure n data usng the growng herarchcal self-organzng map. Neurocomputng 48( 4), (00) 99 6. 5. Jaynes, E.: Probablty Theory: The Logc of Scence. Cambrdge Unversty Press (003) 6. Cerqudes, J.: Improvng Bayesan Network Classfers. PhD Thess. U.P.C. Barcelona. Span (003) 7. Dempster, A.P., Lard, N.M., & Rubn, D.B.: Maxmum lkelhood from ncomplete data va the EM algorthm. Journal of the Royal Statstcal Socety, B,39, (977) -38 8. Bshop, C.M., Svensén, M., & Wllams, C.K.I.: GTM: the generatve topographc mappng. Neural Computng, 0, (998) 5-34 9. To, P. & Nabney, I.: Herarchcal GTM: constructng localzed non-lnear proecton manfolds n a prncpled way. IEEE Transactons on Pattern Analyss and Machne Intellgence, 4(5), (00) 639-656.