Dimensionality Reduction for Data Visualization

Dmensonalty Reducton for Data Vsualzaton Samuel Kask and Jaakko Peltonen Dmensonalty reducton s one of the basc operatons n the toolbox of data-analysts and desgners of machne learnng and pattern recognton systems. Gven a large set of measured varables but few observatons, an obvous dea s to reduce the degrees of freedom n the measurements by representng them wth a smaller set of more condensed varables. Another reason for reducng the dmensonalty s to reduce computatonal load n further processng. A thrd reason s vsualzaton. Lookng at the data s a central ngredent of exploratory data analyss, the frst stage of data analyss where the goal s to make sense of the data before proceedng wth more goal-drected modelng and analyses. It has turned out that although these dfferent tasks seem alke ther soluton needs dfferent tools. In ths artcle we show that dmensonalty reducton to data vsualzaton can be represented as an nformaton retreval task, where the qualty of vsualzaton can be measured by precson and recall measures and ther smoothed extensons, and that vsualzaton can be optmzed to drectly maxmze the qualty for any desred tradeoff between precson and recall, yeldng very well-performng vsualzaton methods. HISTORY Each multvarate observaton x = [x 1,...x n ] T s a pont n an n-dmensonal space. A key dea n dmensonalty reducton s that f the data les n a d-dmensonal (d < n) subspace of the n- dmensonal space, and f we can dentfy the subspace, then there exsts a transformaton whch loses no nformaton and allows the data to be represented n a d-dmensonal space. If the data les n a (lnear) subspace then the transformaton s lnear, and more generally the data may le n a d-dmensonal (curved) manfold and the transformaton s non-lnear. Among the earlest methods are so-called Multdmensonal Scalng (MDS) methods [1] whch try to poston data ponts nto a d-dmensonal space such that ther parwse dstances are preserved as well as possble. If all parwse dstances are preserved, t can be argued that the data manfold has been dentfed (up to some transformatons). In practce, data of course are nosy and the soluton s found by mnmzng a cost functon such as the squared loss between the parwse dstances, E MDS =,j (d(x,x j ) d(x,x j ))2, where thed(x,x j ) are the orgnal dstances between the pontsx andx j, and thed(x,x j ) are the dstances between ther representatonsx andx j n thed-dmensonal space. MDS comes n several flavors that dffer n ther specfc form of cost functon and addtonal constrants on the mappng, and some of the choces gve famlar methods such as Prncpal Components Analyss or Sammon s mappng as specal cases. Neural computng methods are other wdely used famles of manfold embeddng methods. Socalled Autoencoder Networks (see, e.g., [2]) pass the data vector through a lower-dmensonal bottleneck layer n a neural network whch ams to reproduce the orgnal vector. The actvtes of the 1

neurons n the bottleneck layer gve the coordnates on the data manfold. Self-Organzng Maps (see [3]), on the other hand, drectly learn a dscrete representaton of a low-dmensonal manfold by postonng weght vectors of neurons along the manfold; the result s a dscrete approxmaton to prncpal curves or manfolds, a non-lnear generalzaton of prncpal components [4]. In 2000 a new manfold learnng boom was begun after publcaton of two papers n Scence showng how to learn nonlnear data manfolds. Locally Lnear Embeddng [5] made, as the name reveals, locally lnear approxmatons to the nonlnear manfold. The other, called Isomap [6], s essentally MDS tuned to work along the data manfold. After the manfold has been learned, dstances wll be computed along the manfold. But plan MDS tres to approxmate dstances of the data space whch do not follow the manfold, and hence plan MDS wll not work n general. That s why Isomap starts by computng dstances along the data manfold, approxmated by a graph connectng neghbor ponts. Snce only neghbors are connected, the connectons are lkely to be on the same part of the manfold nstead of jumpng across gaps to dfferent brances; dstances along the neghborhood graph are thus decent approxmatons of dstances along the data manfold known as geodesc dstances. A large number of other approaches have been ntroduced for learnng of manfolds durng the past ten years, ncludng methods based on spectral graph theory and based on smultaneous varance maxmzaton and dstance preservaton. CONTROVERSY Manfold learnng research has been crtczed for lack of clear goals. Many papers ntroduce a new method and only show ts performance by nce mages of how t learns a toy manfold. A famous example s the Swss roll, a two-dmensonal data sheet curved n three dmensons nto a Swss roll shape. Many methods have been shown capable of unrollng the Swss roll but few have been shown to have real applcatons, success stores, or even to quanttatvely outperform alternatve methods. One reason why quanttatve comparsons are rare s that the goal of manfold embeddng has not always been clearly defned. In fact, manfold learnng may have several alternatve goals dependng on how the learned manfold wll be used. We focus on one specfc goal, data vsualzaton, ntended for helpng analysts to look at the data and fnd related observatons durng exploratory data analyss. Data vsualzaton s tradtonally not a well-defned task ether. But t s easy to observe emprcally [7] that many of the manfold learnng methods are not good for data vsualzaton. The reason s that they have been desgned to fnd a d-dmensonal manfold f the nherent dmensonalty of data sd. For vsualzaton, the dsplay needs to haved= 2 ord= 3; that s, the dmensonalty may need to be reduced beyond the nherent dmensonalty of data. NEW PRINCIPLE It s well-known that a hgh-dmensonal data set cannot n general be fathfully represented n a lower-dmensonal space, such as the plane wth d = 2. Hence a vsualzaton method needs to choose what knds of errors to make. The choce naturally should depend on the vsualzaton goal; t turns out that under a specfc but general goal the choce can be expressed as an nterestng tradeoff, as we wll descrbe below. When the task s to vsualze whch data ponts are smlar, the vsualzaton can have two knds of errors (Fgure 1): t can mss some smlartes (.e. t can place smlar ponts far apart as false negatves) or t can brng dssmlar data ponts close together as false postves. If we know the 2

Input space P x * * * Output space (vsualzaton) * * * y Q mss false postves Fgure 1: A vsualzaton can have two knds of errors (from [9]). When a neghborhood P n the hgh-dmensonal nput space s compared to a neghborhood Q n the vsualzaton, false postves are ponts that appear to be neghbors n the vsualzaton but are not n the orgnal space; msses (whch could also be called false negatves) are ponts that are neghbors n the orgnal space but not n the vsualzaton. cost of each type of error, the vsualzaton can be optmzed to mnmze the total cost. Hence, once the user gves the relatve cost of msses and false postves, t fxes vsualzaton to be a welldefned optmzaton task. It turns out [8, 9] that under smplfyng assumptons the two costs turn nto precson and recall, standard measures between whch a user-defned tradeoff s made n nformaton retreval. Hence, the task of vsualzng whch ponts are smlar can be formalzed as a task of vsual nformaton retreval, that s, retreval of smlar ponts based on the vsualzaton. The vsualzaton can be optmzed to maxmze nformaton retreval performance, nvolvng as an unavodable element a trade-off between precson and recall. In summary, vsualzaton can be made nto a rgorous modelng task, under the assumpton that the goal s to vsualze whch data ponts are smlar. When the smplfyng assumptons are removed the neghborhoods are allowed to be contnuousvalued probablty dstrbutons p j of pont j beng a neghbor of pont. Then t can be shown that sutable analogues of precson and recall are dstances between the neghborhood dstrbutons p n the nput space and q on the dsplay. More specfcally, the Kullback-Lebler dvergence D(p,q ) reduces under smplfyng assumptons to recall and D(q,p ) to precson. The total cost s then E =λ D(p,q )+(1 λ) D(q,p ), (1) j whereλ s the relatve cost of msses and false postves. The dsplay coordnates of all data ponts are then optmzed to mnmze ths total cost; several nonlnear optmzaton approaches could be used, we have smply used conjugate gradent descent. Ths method has been called NeRV for Neghbor Retreval Vsualzer [8, 9]. Whenλ = 1 the method reduces to Stochastc Neghbor Embeddng [10], an earler method whch we now see maxmzes recall. 3

Fgure 2: Tradeoff between precson and recall n vsualzng a sphere (from [9]). Left: the threedmensonal locaton of ponts on the three-dmensonal sphere s encoded nto colors and glyph shapes. Center: two-dmensonal vsualzaton that maxmzes recall by squashng the sphere flat. All orgnal neghbors reman close-by but false postves (false neghbors) from opposte sdes of the sphere also become close-by. Rght: vsualzaton that maxmzes precson by peelng the sphere surface open. No false postves are ntroduced but some orgnal neghbors are mssed across the edges of the tear. Vsualzaton of a smple data dstrbuton makes the meanng of the tradeoff between precson and recall more concrete. When vsualzng the surface of a three-dmensonal sphere n two dmensons, maxmzng recall squashes the sphere flat (Fgure 2) whereas maxmzng precson peels the surface open. Both solutons are good, but have dfferent knds of errors. Both nonlnear and lnear vsualzatons can be optmzed by mnmzng (1). The remanng problem s how to defne the neghborhoods p; n the absence of more knowledge, symmetrc Gaussans or more heavy-taled dstrbutons are justfable choces. An even better alternatve s to derve the neghborhood dstrbutons from probablstc models that encode our knowledge of the data, both pror knowledge and what was learned from data. Dervng nput smlartes from a probablstc model has recently been done n Fsher Informaton Nonparametrc Embeddng [11], where the smlartes (dstances) approxmate Fsher nformaton dstances (geodesc dstances where the local metrc s defned by a Fsher nformaton matrx) derved from nonparametrc probablstc models. In related earler work [12, 13], approxmated geodesc dstances were computed n a learnng metrc derved usng Fsher nformaton matrces for a condtonal class probablty model. In all these works, though, the dstances were gven to standard vsualzaton methods, whch have not been desgned for a clear task of vsual nformaton retreval. In contrast, we wll combne the model-based nput smlartes to the rgorous precsonrecall approach to vsualzaton. Then the whole procedure corresponds to a well-defned modelng task where the goal s to vsualze whch data ponts are smlar. We wll next dscuss ths n more detal n two concrete applcatons. 4

APPLICATION 1: VISUALIZATION OF GENE EXPRESSION COMPENDIA FOR RETRIEVING RELEVANT EXPERIMENTS In the study of molecular bologcal systems, behavor of the system can seldom be nferred from frst prncples ether because such prncples are not known yet or because each system s dfferent. The study needs to be data-drven. Moreover, n order to make research cumulatve, new experments need to be placed n the context of earler knowledge. In the case of data-drven research, a key part of that sretrevalofrelevantexperments. An earler experment, a set of measurements, s relevant f some of the same bologcal processes are actve n t, ether ntentonally or as sde effects. In molecular bology t has become standard practce to store expermental data n repostores such as ArrayExpress of the European Bonformatcs Insttute EBI. Tradtonally, experments are sought from the repostory based on metadata annotatons only, whch works well when searchng for experments that nvolve well-annotated and well-known bologcal phenomena. In the nterestng case of studyng and modelng new fndngs, more data-drven approaches are needed, and nformaton retreval and vsualzaton based onlatent varable models are promsng tools [14]. Let s assume that n experment data g have been measured; n the concrete case below g wll be a dfferental gene expresson vector, where g j s expresson level of gene or gene set j compared to a control measurement. Now f we ft to the compendum a model that generates a probablty dstrbuton over the experments, p(g,z θ), where the θ are parameters of the model whch we wll omt below and z are latent varables, ths model can be used for retreval and vsualzaton as explaned below. Ths modelng approach makes sense n partcular f the model s constructed such that the latent varables have an nterpretaton as actvtes of latent or underlyng bologcal processes whch are manfested ndrectly as the dfferental gene expresson. Gven the model, relevance can be defned n a natural way as follows: Lkelhood of experment beng relevant for an earler experment j s p(g g j ) = p(g z)p(z g j )dz. That s, the experment s relevant f t s lkely that the measurements have arsen as products of the same unknown bologcal processes z. Ths defnton of relevance can now be used for retrevng the most relevant experments and, moreover, the defnton can be used as the natural probablty dstrbuton p n (1) to construct a vsual nformaton retreval nterface (Fgure 3); n ths case the data are 105 mcroarray experments from the Array Express database, comparng pathologcal samples such as cancer tssues to healthy samples. Above the vsual nformaton retreval dea was explaned n abstract concepts, applcable to many data sources. In the gene expresson retreval case of Fgure 3, the data were expressons of a pror defned gene sets, quantzed nto counts, and the probablstc model was the Dscrete Prncpal Component Analyss model, also called Latent Drchlet Allocaton, and n the context of texts called a topc model. The resultng relevances can drectly be gven as nputs to the Neghbor Retreval Vsualzer (NeRV); n Fgure 3 a slghtly modfed varant of the relevances was used, detals n [14]. In summary, fttng a probablstc latent varable model to the data produces a natural relevance measure whch can then be plugged as a smlarty measure nto the vsualzaton framework. Everythng from start to fnsh s then based on rgorous choces. APPLICATION 2: VISUALIZATION OF GRAPHS Graphs are a natural representaton of data n several felds where vsualzatons are helpful: socal networks analyss, nteracton networks n molecular bology, ctaton networks, etc. In a sense, graphs 5

A B B C prostate cancer oxdatve phosphorylaton purne metabolsm atp synthess glycolyss chrohn s dsease bladder carcnoma Fgure 3: A vsual nformaton retreval nterface to a collecton of mcroarray experments vsualzed as glyphs on a plane (from [14]). A: Glyph locatons have been optmzed by the Neghbor Retreval Vsualzer so that relevant experments are close-by. For ths experment data, relevance s defned by the same data-drven bologcal processes beng actve, as modeled by a latent varable model (component model). B: Enlarged vew wth annotatons; each color bar corresponds to a bologcal component or process, and the wdth tells the actvty of the component. These experments are retreved as relevant for the melanoma experment shown n the center. C: The bologcal components (nodes n the mddle) lnk the experments (left) to sets of genes (rght) actvated n them. 6

A B C Fgure 4: Vsualzatons of graphs. A: US college football teams (nodes) and who they played aganst (edges). The vsual groups of teams match the 12 conferences arranged for yearly play (shown wth dfferent colors). B-C: word adjacences n the works of Jane Austen. The nodes are words, and edges mean the words appeared next to each other n the text. The NeRV vsualzaton n B shows vsual groups whch reveal syntactc word categores: adjectves, nouns and verbs shown n blue, red, and green. The edge bundles reveal dsassortatve structure whch matches ntuton, for example, verbs are adjacent n text to nouns or adjectves and not to other verbs. Earler graph layout methods (Walshaw s algorthm shown n C) fal to reveal the structure. Fgure from [17], c ACM, 2010. are hgh-dmensonal structured data where nodes are ponts and all other nodes are dmensons; the value of the dmenson s the type or strength of the lnk. There exst lots of graph drawng algorthms, ncludng strng analogy-based methods such as Walshaw s algorthm [15] and spectral methods [16]. Most of them focus explctly or mplctly on local propertes of graphs, drawng nodes lnked by an edge close together but avodng overlap. That works well for smple graphs but for large and complcated ones addtonal prncples are needed to avod the famous harball vsualzatons. A promsng drecton forward s to learn a probablstc latent varable model of the graph, n the hope of capturng ts central propertes, and then focus on vsualzng those propertes. In the case of graphs, the data to be modeled s whch other nodes a node lnks to. But as the observed lnks n a network may be stochastc (nosy) measurements such as gene nteracton measurements, t makes sense to assume that the lnks are a sample from an underlyng lnk dstrbuton, and learn a probablstc latent varable model to model the dstrbutons. The smlarty of two nodes s then naturally evaluated as smlarty of ther lnk dstrbutons. The rest of the vsualzaton can proceed as n the prevous secton, wth experments replaced by graph nodes. Fgure 4 shows sample graphs vsualzed based on a varant of Dscrete Prncpal Components Analyss or Latent Drchlet Allocaton sutable for graphs. Wth ths lnk dstrbuton-based approach, the Neghbor Retreval Vsualzer places nodes close-by on the dsplay f they lnk to smlar other nodes, wth smlarty defned as smlarty of lnk dstrbutons. Ths has the nce sde-result that lnks form bundles where all start nodes are smlar and all end nodes are smlar. In summary, the dea s to use any pror knowledge n choosng a sutable model for the graph, and after that all steps of the vsualzaton follow naturally and rgorously from start to fnsh. In the absence of pror knowledge flexble machne learnng models such as the Dscrete Prncpal Components Analyss above can be learned from data. 7

CONCLUSIONS We have dscussed dmensonalty reducton for a specfc goal, data vsualzaton, whch has been so far defned mostly only heurstcally. Recently t has been suggested that a specfc knd of data vsualzaton task, that s, vsualzaton of smlartes of data ponts, could be formulated as a vsual nformaton retreval task, wth a well-defned cost functon to be optmzed. The nformaton retreval connecton further reveals that a tradeoff between msses and false postves needs to be made n vsualzaton as n all other nformaton retreval. Moreover, the vsualzaton task can be turned nto a well-defned modelng problem by nferrng the smlartes usng probablstc models that are learned to ft the data. A free software package that solves nonlnear dmensonalty reducton as vsual nformaton retreval, wth a method called NeRV for Neghbor Retreval Vsualzer, s avalable at http://www.cs.hut.f/projects/m/software/dredvz/. AUTHORS Samuel Kask (samuel.kask@tkk.f) s a Professor of Computer Scence n Aalto Unversty and Drector of Helsnk Insttute for Informaton Technology HIIT, a jont research nsttute of Aalto Unversty and Unversty of Helsnk. He studes machne learnng, n partcular mult-source machne learnng, wth applcatons n bonformatcs, neuronformatcs and proactve nterfaces. Jaakko Peltonen (jaakko.peltonen@tkk.f) s a postdoctoral researcher and docent at Aalto Unversty, Department of Informaton and Computer Scence. He receved the D.Sc. degree from Helsnk Unversty of Technology n 2004. He s an assocate edtor of Neural Processng Letters and has served n program commttees of eleven conferences. He studes generatve and nformaton theoretc machne learnng especally for exploratory data analyss, vsualzaton, and mult-source learnng. References [1] I. Borg and P. Groenen, ModernMultdmensonal Scalng. New York: Sprnger, 1997. [2] G. Hnton, Connectonst learnng procedures, Artfcal Intellgence, vol. 40, pp. 185 234, 1989. [3] T. Kohonen,Self-Organzng Maps. Berln: Sprnger, 3rd ed., 2001. [4] F. Muler and V. Cherkassky, Self-organzaton as an teratve kernel smoothng process, Neural Computaton, vol. 7, pp. 1165 1177, 1995. [5] S. T. Rowes and L. K. Saul, Nonlnear dmensonalty reducton by locally lnear embeddng, Scence, vol. 290, pp. 2323 2326, 2000. [6] J. B. Tenenbaum, V. de Slva, and J. C. Langford, A global geometrc framework for nonlnear dmensonalty reducton, Scence, vol. 290, pp. 2319 2323, 2000. [7] J. Venna and S. Kask, Comparson of vsualzaton methods for an atlas of gene expresson data sets, Informaton Vsualzaton, vol. 6, pp. 139 154, 2007. 8

[8] J. Venna and S. Kask, Nonlnear dmensonalty reducton as nformaton retreval, n Proceedngs of AISTATS*07, the 11th Internatonal Conference on Artfcal Intellgence and Statstcs (JMLR Workshop and Conference Proceedngs Volume 2) (M. Mela and X. Shen, eds.), pp. 572 579, 2007. [9] J. Venna, J. Peltonen, K. Nybo, H. Ados, and S. Kask, Informaton retreval perspectve to nonlnear dmensonalty reducton for data vsualzaton, Journal of Machne Learnng Research, vol. 11, pp. 451 490, 2010. [10] G. Hnton and S. T. Rowes, Stochastc neghbor embeddng, n Advances n Neural Informaton Processng Systems 14 (T. Detterch, S. Becker, and Z. Ghahraman, eds.), pp. 833 840, Cambrdge, MA: MIT Press, 2002. [11] K. M. Carter, R. Rach, W. G. Fnn, and A. O. Hero III, FINE: Fsher nformaton nonparametrc embeddng, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 31, no. 11, pp. 2093 2098, 2009. [12] S. Kask, J. Snkkonen, and J. Peltonen, Bankruptcy analyss wth self-organzng maps n learnng metrcs, IEEETransactons on NeuralNetworks, vol. 12, pp. 936 947, 2001. [13] J. Peltonen, A. Klam, and S. Kask, Improved learnng of Remannan metrcs for exploratory analyss, NeuralNetworks, vol. 17, pp. 1087 1100, 2004. [14] J. Caldas, N. Gehlenborg, A. Fasal, A. Brazma, and S. Kask, Probablstc retreval and vsualzaton of bologcally relevant mcroarray experments, Bonformatcs, vol. 25, no. 12, pp. 145 153, 2009. [15] C. Walshaw, A multlevel algorthm for force-drected graph drawng, n GD 00: Proceedngs of the 8th Internatonal Symposum on Graph Drawng, (London, UK), pp. 171 182, Sprnger- Verlag, 2001. [16] K. M. Hall, An r-dmensonal quadratc placement algorthm, Management Scence, vol. 17, no. 3, pp. 219 229, 1970. [17] J. Parkknen, K. Nybo, J. Peltonen, and S. Kask, Graph vsualzaton wth latent varable models, n Proceedngs of MLG-2010, the Eghth Workshop on Mnng and Learnng wth Graphs, (New York, NY, USA), pp. 94 101, ACM, 2010. DOI: http://do.acm.org/10.1145/1830252.1830265. 9