Visualizing Siilarity Data with a Mixture of Maps Jaes Cook, Ilya Sutskever, Andriy Mnih and Geoffrey Hinton Departent of Coputer Science University of Toronto Toronto, Ontario M5S 3G4 located at: /h/71/hinton/papers/a/a1.tex Abstract We show how to visualize a set of pairwise siilarities between objects by using several different two-diensional aps, each of which captures different aspects of the siilarity structure. When the objects are abiguous words, for exaple, different senses of a word occur in different aps, so river and loan can both be close to bank without being at all close to each other. Aspect aps reseble clustering because they odel pair-wise siilarities as a ixture of different types of siilarity, but they also reseble local ulti-diensional scaling because they odel each type of siilarity by a twodiensional ap. We deonstrate our ethod on a toy exaple, a database of huan wordassociation data, a large set of iages of handwritten digits, and a set of feature vectors that represent words. 1 Introduction Given a large set of objects and the pairwise siilarities between the, it is often useful to visualize the siilarity structure by arranging the objects in a two-diensional space in such a way that siilar pairs lie close together. Methods like principal coponents analysis (PCA) or etric ulti-diensional scaling (MDS) [2] are siple and fast, but they iniize a cost function that is far ore concerned with odeling the large dissiilarities than the sall ones. Conseuently, they do not provide good visualizations of data that lies on a curved low-diensional anifold in a high diensional space because they do not reflect the distances along the anifold [8]. Local MDS [7] and soe ore recent ethods such as local linear ebedding (LLE) [6], axiu variance unfolding [9], or stochastic neighbour ebedding (SNE) [3] attept to odel local distances (strong siilarities) accurately in the two-diensional visualization at the expense of odeling larger distances (sall siilarities) inaccurately. The SNE objective function is difficult to optiize efficiently, but it leads to uch better solutions than ethods such as LLE, and because SNE is based on a probabilistic odel, it suggests a new approach to producing better visualizations: Instead of using just one two-diensional ap as a odel of the siilarities between objects, use any different two-diensional aps and cobine the into a single odel of the siilarity data by treating the as a ixture odel. This is not at all the sae as finding, say, a four-diensional ap and then displaying two orthogonal two-diensional projections [6]. In that case, the four-diensional ap is the product of the two twodiensional aps and a projection can be very isleading because it can put points that are far apart in 4-D close together in 2-D. In a ixture of aps, being close together in any ap eans that two objects really are siilar in the ixture odel. 2 Stochastic Neighbor Ebedding SNE starts by converting high-diensional distance or siilarity data into a set of conditional probabilities of the for p j i, each of which is the probability that one object, i, would stochastically pick another object j as its neighbor if it was only allowed to pick one neighbor. These conditional probabilities can be produced in any ways. In the word association data we describe later, subjects are asked to pick an associated word, so p j i is siply the fraction of the subjects who pick word j when given word i. If the data consists of the coordinates of objects in a highdiensional Euclidean space, it can be converted into a set of conditional probabilities of the for p j i for each object i by using a spherical Gaussian distribution centered at the high-diensional position of i, x i, as shown in figure 1. We set p i i = 0, and for j i, p j i = exp( x i x j 2 /2σ 2 i ) k i exp( x i x k 2 /2σ 2 i ) (1)
High-D space Low-D space Xj Xk Yj Yk Xi Yi Figure 1: A spherical Gaussian distribution centered at x i defines a probability density at each of the other points. When these densities are noralized, we get a probability distribution, P i, over all of the other points that represents their siilarity to i. Figure 2: A circular Gaussian distribution centered at y i defines a probability density at each of the other points. When these densities are noralized, we get a probability distribution over all of the other points that is our lowdiensional odel, Q i of the high-diensional P i. The sae euation can be used if we are only given the pairwise distances between objects, x i x j. The variance of the Gaussian, σi 2, can be adjusted to vary the entropy of the distribution P i which has p j i as a typical ter. If σi 2 is very sall the entropy will be close to 0 and if it is very large the entropy will be close to log 2 (N 1), where N is the nuber of objects. We typically pick a nuber M N and adjust σi 2 by binary search until the entropy of P i is within soe sall tolerance of log 2 M. The goal of SNE is to odel the p j i by using conditional probabilities, j i, that are deterined by the locations y i of points in a low-diensional space as shown in figure 2: exp( y i y j 2 ) j i = k i exp( y i y k 2 ) For each object, i, we can associate a cost with a set of low-diensional y locations by using the Kullback-Liebler divergence to easure how well the distribution Q i odels the distribution P i C = i KL(P i Q i ) = i j i (2) p j i log p j i j i (3) To iprove the odel, we can ove each y i in the direction of steepest descent of C. It is shown in [3] that this gradient optiization has a very siple physical interpretation (see figure 3). y i is attached to each y j by a spring which exerts a force in the direction y i y j. The agnitude of this force is proportional to the length of the spring, y i y j, and it is also proportional to the spring stiffness which euals the isatch (p j i j i )+(p i j i j ). Steepest descent in the cost function corresponds to following the dynaics defined by these springs, but notice that the spring stiffnesses keep changing. Starting fro sall rando y values, steepest descent finds a local iniu of C. Better local inia can be found by adding Gaussian noise to the y values after each update. Starting with a high noise level, we decay the noise fairly rapidly to find the approxiate noise level at which structure starts to for in the low-diensional ap. A good indicator of the eergence of structure is that a sall decrease in the noise level leads to a large decrease in the cost function. Then we repeat the process, starting the noise level just above the level at which structure eerges and anealing it uch ore gently. This allows finding low-diensional aps that are significantly better inia of C. 2.1 Syetric SNE The version of SNE introduced by [3] is based on iniizing the divergences between conditional distributions. An alternative is to define a single joint distribution over all non-identical ordered pairs: exp( x i x j 2 /2σ 2 ) p ij = k<l exp( x k x l 2 /2σ 2 ) exp( y i y j 2 ) ij = k<l exp( y k y l 2 ) C sy = KL(P Q) = i j (4) (5) p ij log p ij ij (6) This leads to sipler derivatives, but if one of the highdiensional points, j, is far fro all the others, all of the p j will be very sall. To overcoe this proble it is possible to replace E. 4 by p ij = 0.5(p j i + p i j ) where p j i and p i j are defined using E. 1. When j is far fro all the other points, all of the p j i will be very sall, but the p j will su to 1. Even when p ij is defined by averaging the conditional probabilities, we still get good low-diensional aps using the derivatives given by Es. 5 and 6.
Yk Y Yi Yn Figure 3: The gradient of the cost function in E. 3 with respect y i has a physical interpretation as the resultant force produced by springs attaching y i to each of the other points. The spring between i and j exerts a force that is proportional to its length and is also proportional to (p j i j i ) + (p i j i j ). 3 Aspect Maps Instead of using a single two-diensional ap to define j i we can allow i and j to occur in several different twodiensional aps. Each object, i, has a ixing proportion πi in each ap,, and the ixing proportions are constrained to add to 1 for each object: π i = 1. The different aps cobine to define j i as follows: where j i = π i π j e d i,j d i,j = y i y j 2, z i = h z i (7) πi πh e d i,h Provided there is at least one ap in which i is close to j and provided the versions of i and j in that ap have high ixing proportions, it is possible for j i to be uite large even if i and j are far apart in all the other aps. In this respect, using a ixture odel is very different fro siply using a single space that has extra diensions, because points that are far apart on one diension cannot have a high j i no atter how close together they are on the other diensions. To optiize the aspect aps odel, we used Carl Rasussen s iniize function which is available at www.kyb.tuebingen.pg.de/bs/people/carl/code/iniize/. The gradients are derived below. C πi = π [log l k z k log z k ] k l k i = [ ( ) 1 k l k l k z k πi π k πl e d k,l 1 ] z k πi π e d k,h h k = ( pj i + p ) i j πj e d i,j j j i z i i j z j + z k π π e d k,h k l k i h k = ( pj i + p ) i j πj e d i,j j j i z i i j z j + 1 z k π π e d k,h k i h k = ( pj i + p ) i j πj e d i,j j j i z i i j z j + ( ) 1 z k π π e d k,h k h k i = ( pj i + p ) i j πj e d i,j j j i z i i j z j + ( 1 + 1 ) πj e d i,j z j i z j = [ 1 ( j i p j i ) + 1 ] ( i j p i j ) πj e d i,j j j i z i i j z j Rather than using the ixing proportions πi theselves as paraeters of the odel, we defined paraaters wi, and defined π i = This gives us the gradient [( C = πi w i e w i e w i C πi π i. ) C π i The distance between points i and j in ap appears as both d i,j and d j,i. If y i,c denotes the cth coordinate of y i, we have ( ) C C yi,c = 2 d + C i,j d (yi,c yj,c). j,i ]
C d i,j = d (log log l k ) k l k i,j = d log l k k l k i,j = d (log l k z k log z k ) k l k i,j = [ ( 1 k l k l k z k d π k πl i,j 1 ] z k d π e d k,h i,j h k = p j i πi πj e d 1 i,j p l i πi πj e d i,j j i z i z i = p j i πi πj e d 1 i,j πi πj e d j i z i z i = π i π j e d i,j (p j i j i ) j i z i l i,j e d k,l 4 Reconstructing two aps fro one set of siilarities As a siple illustration of aspect aps, we constructed a toy proble in which the assuptions underlying the use of aspect aps are correct. For this toy proble, the lowdiensional space has as any diensions as the highdiensional space. Consider the two aps shown in figure 4. We gave each object a ixing proportion of 0.5 in each ap and then used E. 7 to define a set of conditional probabilities p j i which can be odeled perfectly by the two aps. The uestion is whether our optiization procedure can reconstruct both aps fro one set of conditional probabilities if the objects start with rando coordinates in each ap. Figure 4 shows that both aps can be recovered up to reflection, translation and rotation. 5 Modeling huan word association data The University of South Florida has ade a database of huan word associations available on the web. Participants were presented with a list of English words as cues, and asked to respond to each word with a word which was eaningfully related or strongly associated [5]. The database contains 5018 cue words, with an average of 122 responses to each. This data lends itself naturally to SNE: siply define the probability p j i as the fraction of ties word j was picked in response to word i. Abiguous words in the dataset cause a proble. For exaple, SNE ight want to put fire close to the words ) A B C D E E A C B D Figure 4: The two aps in the top row can be reconstructed correctly fro a single set of pairwise siilarities. Using a randoly chosen one-to-one apping between points in the top two aps, the siilarities are defined using E. 7 with all ixing proportions fixed at 0.5. wood and job, even though wood and job should not be put close to one another. A solution is to use the aspect aps version, AMSNE, and consider the word fire as a ixture of two different eanings. In one ap fire is a source of heat and should be put near wood, and in the other fire is soething done to eployees and should be close to job. Abiguity is not the only reason a word ight belong in two different places: as another exaple, death ight be siilar to words like sad and cancer but also to destruction and ilitary, even though cancer is not usually seen as being siilar to ilitary. When odelling the free association data, we found that AMSNE would put any unrelated clusters of words in the sae ap far apart. To ake the individual aps ore coherent, we added a penalty that kept each ap sall, thus discouraging any one ap fro containing several unrelated clusters. The penalty ter λ 2 i y i 2 is siply added to the cost function in E. 3. We fitted the free association data with the aspect aps odel using 50 aps with λ set to 0.48. In order to speed the optiization, we only used the 1000 cue words that were ost often given as responses. Four of the resulting aps are shown in figures 5 and 6. In figure 5 the two different aps odel the very different siilarities induced by two different eanings of the word can. In figure 6 we see two different contexts in which the word A B C D E E D A B C
POP SODA COKE BOTTLE THIRST CAN DRINK BEER STOMACH WINE RESTAURANT BAR HUNGRY HERO SOUP MIX FOOD SANDWICH CAKE EAT SPAGHETTI PIE BAKE COOKIE DINNER CHOCOLATE ICE CREAM LUNCH NERVOUS MILK SHAKE COW BREAKFAST TRY CHANCE CEREAL COP POLICE HONEST TRUTH OIL LIE TICKET GAS CRASH LIAR PLANE CHEAT ANGRY TIRE CAR ACCIDENT MAD DRIVE STAGE ACTOR DRIVER SHAPE COPY TRUCK RIDE BODY FIGURE ROLL HORSE JELLY CORN ANIMAL BUTTER SKIN FIELD MONKEY WHEAT SHEEP PIG BACK HAY BONE BREAD FARM FRONT COW POTATO BEEF MEAT FRIES HAMBURGER STEAK KETCHUP ABILITY ACT DO STAGE BATHROOM CAN PLAY TOILET POWER SEAT CONTROL ELECTRICITY FUN ENERGY ABUSE PARTY ALCOHOL SOUP BEER DRUNK BUM SPORTS FOOTBALL SPORT TEAM STAND SIT SOUND NOISE QUIET LOUD BASKETBALL GLASSES SKIN FIELD SOFT GAME HARD BASEBALL EYES COTTON HIT TENNIS BALL HAT BAT COAT JACKET DUCK LIQUOR THROW Figure 5: Two of the 50 aspect aps for the word association data. Each ap odels a different sense of can. Each word is represented by a circle whose area is proportional to its ixing proportion. field is used. Whether these should be called different eanings of the word field is an open uestion that can be answered by linguistic intuitions of lexicographers or by looking at whether two eanings odel the observed siilarity judgeents better than one. 6 UNI-SNE: A degenerate version of aspect aps On soe datasets, we found that fitting two aspect aps led to solutions that seeed strange. One of the aspect aps would keep all of the objects very close together, while the other aspect ap would create widely separated clusters of objects. This behaviour can be understood as a sensible way of dealing with a proble that arises when using a 2- D space to odel a set of high-diensional distances that have an intrinsic diensionality greater than 2. In the best 2-D odel of the high-diensional distances, the objects in Figure 6: Two of the 50 aspect aps for the word association data. Each ap odels a different sense of field. the iddle will be crushed together too closely and the objects around the periphery will be uch too far fro other peripheral objects 1. Using the physical analogy of figure 3, there will be any weak but very stretched springs between objects on opposite sides of the 2-D space and the net effect of all these springs will be to force objects in the iddle together. A background ap in which all of the objects are very close together gives all of the j i a sall positive contribution. This is sufficient to ensure that j i is at least as great as p j i for objects that are significantly further apart than the average separation. When j i > p j i, the very stretched springs actually repel distant objects and this causes the foreground ap to expand, thus providing enough space to allow clusters of siilar objects to be separated fro each other. If we siply constrain all of the objects in the background 1 To flatten a heispherical shell into a disk, for exaple, we need to copress the center of the heisphere and stretch or tear its periphery.
30 8 6 20 0 6 4 4 2 10 8 0 0 5 9 2 4 6 10 20 2 3 7 8 8 6 4 2 0 2 4 6 8 30 30 20 10 0 10 20 30 1 Figure 7: The result of applying the syetric version of SNE to 5000 digit iages fro the MNIST dataset. The 10 digit classes are not well separated. ap to have identical locations and ixing proportions, we get a degenerate version of aspect aps that is euivalent to cobining SNE with a unifor background odel. We chose to ipleent this idea for the sipler, syetric version of SNE so E. 5 becoes: ij = (1 λ) exp( y i y j 2 ) k<l exp( y k y l 2 ) + 2λ N(N 1) We call this robust version UNI-SNE and it often gives uch better visualizations than SNE. We tested UNI-SNE on the MNIST dataset of handwritten digit iages. It is very difficult to ebed this data into a 2-D ap in such a way that very siilar iages are close to one another and the class structure of the data is apparent. Using the first two principal coponents, for exaple, produces a ap in which the classes are hopelessly scrabled [4]. A nonlinear version of PCA [4] does uch better but still fails to separate the individual classes within the clusters 4,7,9 and 3,5,8. We first used principal coponents analysis on all 60,000 MNIST training iages to reduce each 28 28 pixel iage to a 30-diensional vector. Then we applied the syetric version of SNE to 5000 of these 30-diensional vectors with an eual nuber fro each class. To get the p ij we averaged p i j and p j i each of which was coputed using a perplexity of 30 (see [3] for details). We ran SNE with exponentially decaying jitter, stopping after 1100 paraeter updates when the KL divergence between the p ij and the (9) Figure 8: If 0.2 of the total probability ass is used to provide a unifor background probability distribution, the slight attraction between dissiilar objects is replaced by slight repulsion. This causes expansion and rearrangeent of the ap which akes the class boundaries far ore apparent. ij was changing by less than.0001 per iteration. Figure 7 shows that SNE is also unable to separate the clusters 4,7,9 and 3,5,8 and it does not cleanly separate the clusters for 0, 1, 2, and 6 fro the rest of the data. Starting with the solution produced by syetric SNE, we ran UNI-SNE for a further 1500 paraeter updates with no jitter but with 0.2 of the total probability ass uniforly distributed between all pairs. Figure 8 shows that this produced a draatic iproveent in revealing the true structure of the data. It also reduced the KL divergence in E. 6 fro 2.47 to 1.48. UNI-SNE is better than any other visualization ethod we know of for separating the classes in this dataset, though we have not copared it with the recently developed ethod called axiu variance unfolding [9] which, like UNI- SNE, tries to push dissiilar objects far apart. We have also tried applying UNI-SNE to a set of 100- diensional real-valued feature vectors each of which represents one of the 500 ost coon words or sybols in a dataset of AP newswire stories[1]. The corpus contains 16,000,000 words and a feature vector was extracted for each of the 18,000 coonest words or sybols by fitting a odel (to be described elsewhere) that tries to predict the features of the current word fro the features of the two previous words. We used UNI-SNE to see whether the learning procedure was extracting sensible representations
of the words. Figure 9 shows that the feature vectors capture the strong siilarities uite well. Acknowledgents We thank Sa Roweis for helpful discussions and Josh Tenenbau for telling us about the free association dataset. This research was supported by NSERC, CFI and OTI. GEH is a fellow of CIAR and holds a CRC chair. References [1] Y. Bengio, R. Duchare, P. Vincent, and C. Jauvin. A neural probabilistic language odel. Journal of Machine Learning Research, 3(6):1137 1155, 2003. [2] M.A.A. Cox and T.F. Cox. Multidiensional Scaling. Chapan & Hall/CRC, 2001. [3] G. Hinton and S. Roweis. Stochastic neighbor ebedding. Advances in Neural Inforation Processing Systes, 15:833 840, 2003. [4] G. E. Hinton and R. R. Salakhutdinov. Reducing the diensionality of data with neural networks. Science, 313:504 507, 2006. [5] D. L. Nelson, C. L. McEvoy, and T. A. Schreiber. The university of south florida word association, rhye, and word fragent nors. In http://www.usf.edu/freeassociation/, 1998. [6] S.T. Roweis and L.K. Saul. Nonlinear Diensionality Reduction by Locally Linear Ebedding. Science, 290(5500):2323, 2000. [7] J.W. Saon. A nonlinear apping for data structure analysis. IEEE Transactions on Coputers, 18(5):401 409, 1969. [8] J.B. Tenenbau, V. Silva, and J.C. Langford. A Global Geoetric Fraework for Nonlinear Diensionality Reduction. Science, 290(5500):2319, 2000. [9] K.Q. Weinberger and L.K. Saul. Unsupervised Learning of Iage Manifolds by Seidefinite Prograing. International Journal of Coputer Vision, 70(1):77 90, 2006.
<proper_noun>, the. <unknown> of to and a #n in s said for he was is on with by his ) ( but have at as an who are were i ; not has be had #an will #$n : about would or people after one two when which #na percent there other also years governent no if last been soe year dole illion ore y president could first three state clinton u.s because can today washington all before only over ost tie now told new police says any just against officials ore_than week tuesday forer capaign say four thursday any wednesday national like while those so onday did day federal friday house n aerican has_been even? under should still ade united_states ay republicans another republican five between killed children re country being ilitary buchanan end found group party through israel new_york peace how out including days senate both don t oney sunday city onth saturday then want would_be very bill ake six #r think world work faily next later billion reported presidential onths get congress do hoe ago several help china arch spokesan life night court until part deocrats report whether case support woen going_to office an law war without e never use died vote here way left sae gop political back called copany began such international public troops ebers up at_least plan each asked uch workers show president_clinton nation business white_house iles judge second election seven late john expected near too early take ust official know didn t bob coittee right news own ajor budget around a_new april down leader russian saying leaders ary r good trying according ve rs bosnia see death press your issue university us earlier voters states past trial tax bosnian en israeli wife used texas top power led nuber aerica progra eeting charges service won hours sen already priary ties nearly to_do area drug long econoic got give director california nubers ight chairan agreeent big outside recent however serb authorities #o security believe deocratic adinistration u.n sall center russia whose eight need alexander foreign soldiers newspaper head accused black local released talks attack keep syste plans job decision aericans put prie_inister force stateent based old defense capital prison once town does across woan television trade elections pay copanies best tv took announced enough southern school off issues inforation candidates study along return northern lost chief although iowa groups half again control wanted ll car things every taiwan hiself soething snow becoe brown weeks high agreed hospital other anti candidate stop thousands record nine countries evidence arrested jobs sipson charged reporters can t u.s. nato county third fighting japan race little probles aking fire possible young david why daily tried usli reports health arket let change deal spending visit palestinian econoy such_as serbs father progras really killing orning food doesn t free nae clear white likely hoes victis london los_angeles association opposition new_hapshire though place a_few increase within a_lot becae despite held radio future refused ever investigation attacks industry face Figure 9: A ap produced by applying UNI-SNE to 100-diensional feature vectors that were learned for the 500 coonest words in the AP news dataset.