Overview and alpage inria & paris vii Séminaire cental Vendredi 5 novembre 2010
Outline 1 2 Similarity Context Dimensionality reduction Latent semantic analysis Non-negative matrix factorization Tensors 3
Semantic similarity Most work on semantic similarity relies on the distributional hypothesis (Harris 1954) Take a word and its contexts: tasty tnassiorc greasy tnassiorc tnassiorc with butter tnassiorc for breakfast By looking at a word s context, one can infer its meaning
Semantic similarity Most work on semantic similarity relies on the distributional hypothesis (Harris 1954) Take a word and its contexts: tasty tnassiorc greasy tnassiorc tnassiorc with butter tnassiorc for breakfast FOOD By looking at a word s context, one can infer its meaning
Semantic similarity Most work on semantic similarity relies on the distributional hypothesis (Harris 1954) Take a word and its contexts: tasty tnassiorc greasy tnassiorc tnassiorc with butter tnassiorc for breakfast By looking at a word s context, one can infer its meaning
Matrix Similarity Context Dimensionality reduction Tensors Capture co-occurrence frequencies of two entities
Matrix Similarity Context Dimensionality reduction Tensors Capture co-occurrence frequencies of two entities rouge délicieux rapide d occasion pomme 2 1 0 0 vin 2 2 0 0 voiture 1 0 1 2 camion 1 0 1 1
Matrix Similarity Context Dimensionality reduction Tensors Capture co-occurrence frequencies of two entities rouge délicieux rapide d occasion pomme 7 9 0 0 vin 12 6 0 0 voiture 7 0 8 4 camion 2 0 3 4
Matrix Similarity Context Dimensionality reduction Tensors Capture co-occurrence frequencies of two entities rouge délicieux rapide d occasion pomme 56 98 0 0 vin 44 34 0 0 voiture 23 0 31 39 camion 4 0 18 29
Matrix Similarity Context Dimensionality reduction Tensors Capture co-occurrence frequencies of two entities rouge délicieux rapide d occasion pomme 728 592 1 0 vin 1035 437 0 2 voiture 392 0 487 370 camion 104 0 393 293
Similarity calculation Similarity Context Dimensionality reduction Tensors Cosine P x y = n i=1 x i y i Pn P n i=1 x2 i i=1 y i 2 Examples: cos(pomme, vin) =.96 cos(pomme, voiture) =.42 cos( x, y ) = x y Other possibilities: set-theoretic measures Dice Jaccard probabilistic measures Kullback-Leibler divergence Jensen-Shannon divergence
Different kinds of context Similarity Context Dimensionality reduction Tensors Three different word space models based on context: document-based model (nouns documents) window-based model (nouns context words) syntax-based model (nouns dependency relations) Each model with plethora of parameters! document size, window size, type of dependency relations weighting function ± dimensionality reduction
Similarity Context Dimensionality reduction Tensors Different kinds of semantic similarity tight, synonym-like similarity: (near-)synonymous or (co-)hyponymous loosely related, topical similarity: more loose relationships, such as association and meronymy
Similarity Context Dimensionality reduction Tensors Different kinds of semantic similarity tight, synonym-like similarity: (near-)synonymous or (co-)hyponymous loosely related, topical similarity: more loose relationships, such as association and meronymy Example médecin doctor : docteur doctor, médecin de famille family doctor, chirurgien surgeon, spécialiste specialist, dermatologue dermatologist, gynécologue gynaecologist médecin doctor : malade patient, maladie disease, diagnostic diagnosis, traitement treatment, hôpital hospital, stéthoscope stethoscope
Relation context similarity Similarity Context Dimensionality reduction Tensors Different context leads to different kind of similarity Syntax, small window large window, documents The former models induce tight, synonymous similarity The latter models induce topical relatedness
Relation context similarity Similarity Context Dimensionality reduction Tensors Different context leads to different kind of similarity Syntax, small window large window, documents The former models induce tight, synonymous similarity The latter models induce topical relatedness Evaluation Syntax-based model scores best when evaluated according to Wordnet similarity measures (cornetto) Large window and document-based do not score well on Wordnet similarity, but do score on Wordnet domain evaluation
Similarity Context Dimensionality reduction Tensors Two reasons for performing dimensionality reduction: Intractable computations When number of elements and number of features is too large, similarity computations may become intractable reduction of the number of features makes computation tractable again Generalization capacity the dimensionality reduction is able to describe the data better, or is able to capture intrinsic semantic features dimensionality reduction is able to improve the results (counter data sparseness and noise)
Similarity Context Dimensionality reduction Tensors Latent semantic analysis: introduction Application of a mathematical/statistical technique to simulate how humans learn the semantics of words LSA finds latent semantic dimensions according to which words and documents can be identified Goal: counter data sparseness and get rid of noise
Similarity Context Dimensionality reduction Tensors Latent semantic analysis: introduction What is latent semantic analysis technically speaking? The application of singular value decomposition to a term-document matrix to improve similarity calculations
Singular value decomposition Similarity Context Dimensionality reduction Tensors Find the dimensions that explain most variance by solving number of eigenvector problems Only keep the n most important dimensions (n = 50 300)
Similarity Context Dimensionality reduction Tensors SVD: three matrices
LSA: Example Similarity Context Dimensionality reduction Tensors LSA in (part of) Twente Nieuws Corpus: 10 years of Dutch newspaper texts (AD, NRC, TR, VK, PAR) terms = nouns documents = paragraphs 20,000 terms * 2,000,000 documents matrix reduced to 300 dimensions
LSA: criticism Similarity Context Dimensionality reduction Tensors LSA criticized for a number of reasons: dimensionality reduction is best fit in least-square sense, but this is only valid for normally distributed data; language data is not normally distributed Dimensions may contain negative values; it is not clear what negativity on a semantic scale should designate Shortcomings are remedied by subsequent techniques (PLSA, LDA, NMF,... )
Similarity Context Dimensionality reduction Tensors Non-negative matrix factorization: technique Given a non-negative matrix V, find non-negative matrix factors W and H such that: Choosing r n, m reduces data V nxm W nxr H rxm (1) Constraint on factorization: all values in three matrices need to be non-negative values ( 0) Constraint brings about a parts-based representation: only additive, no subtractive relations are allowed
Similarity Context Dimensionality reduction Tensors Non-negative matrix factorization: technique Different kinds of NMF s that minimize different cost functions: Square of Euclidean distance Kullback-Leibler divergence better suited for language phenomena To find NMF is to minimize D(V WH) with respect to W and H, subject to the constraints W, H 0 This can be done with update rules H aµ H aµ i W V iµ ia (WH) iµ k W ka W ia W ia µ H aµ V iµ (WH) iµ v H av (2) these update rules converge to a local minimum in the minimization of KL divergence
NMF: graphical representation Similarity Context Dimensionality reduction Tensors
NMF: results Similarity Context Dimensionality reduction Tensors Context vectors (5k nouns 2k co-occurring nouns) extracted from clef corpus nmf is able to capture clear semantic dimensions Examples: bus bus, taxi taxi, trein train, halte stop, reiziger traveler, perron platform, tram tram, station station, chauffeur driver, passagier passenger bouillon broth, slagroom cream, ui onion, eierdooier egg yolk, laurierblad bay leaf, zout salt, deciliter decilitre, boter butter, bleekselderij celery, saus sauce
Two-way vs. three-way Similarity Context Dimensionality reduction Tensors all methods use two way co-occurrence frequencies matrix suitable for two-way problems words documents nouns dependency relations not suitable for n-way problems words documents authors verbs subjects direct objects
Two-way vs. three-way Similarity Context Dimensionality reduction Tensors all methods use two way co-occurrence frequencies matrix suitable for two-way problems words documents nouns dependency relations not suitable for n-way problems tensor words documents authors verbs subjects direct objects
Similarity Context Dimensionality reduction Tensors Non-negative tensor factorization: technique Idea similar to non-negative matrix factorization Calculations are different min xi R D1 0,y i R D2 0,z i R D3 T k 0 i=1 x i y i z i 2 F
NTF: graphical representation Similarity Context Dimensionality reduction Tensors
Task: automatic extraction of multi-word expressions (mwes) from large corpora Starting point: many mwes are non-compositional, i.e. the meaning of the mwe is not the sum of the meaning of the individual words Intuition: a noun within a mwe cannot easily be replaced by a semantically similar noun Use of semantic clusters to determine whether a mwe-candidate is compositional or not
Intuition break the ice break the vase *snow cup *hail dish In the first expression, it is not possible to replace ice with semantically related nouns such as snow or hail; in the second expression, it is possible to replace vase with semantically related words such as cup or dish.
Overview of the method verb + prepositional complement instances are extracted from 500M word corpus (focus on mwe with PP) matrix of 5K verb-preposition combinations 10K nouns is created 10K nouns are automatically clustered using distributional similarity measures number of statistical measures is applied to determine unique associations given the cluster in which a noun appears
Measures inspired by selectional preferences (Resnik 1993), entropy-based Kullback-Leibler divergence between P(n) and P(n v) S v = n p(n v) p(n v) log p(n) (3)
Measures The preference of the verb for the noun [0, 1] A v n = p(n v) log p(n v) p(n) S v (4) Ratio of verb preference for a particular noun, compared to other nouns in the cluster [0, 1] R v n = A v n n ɛc A v n (5)
Measures The preference of the noun for the verb [0, 1] A n v = p(v n) log p(v n) p(v) S n (6) Ratio of noun preference for a particular verb, compared to other nouns in the cluster [0, 1] R n v = A n v n ɛc A n v (7)
An elaborated example in de smaak vallen in de put vallen *geur kuil *voorkeur krater *stijl greppel in the taste fall in the well fall to be appreciated to fall down the well *smell hole *preference crater *style trench
An elaborated example smaak: idioom, karakter, persoonlijkheid, stijl, temperament, thematiek, uiterlijk, uitstraling, voorkomen mwe candidate (1) (2) (3) (4) mwe? val#in smaak.12 1.00.07 1.00 yes val#in karakter.00.00.00.00 no val#in stijl.00.00.00.00 no
An elaborated example put: gaatje, gat, kloof, krater, kuil, lek, scheur, valkuil mwe candidate (1) (2) (3) (4) mwe? val#in put.00.04.10.05 no val#in kuil.00.11.10.38 no val#in kloof.00.01.10.04 no
Quantitative Evaluation Fully automated, compared to Referentie Bestand Nederlands Upper bound consists of all RBN mwe s present in the data Parameters Prec Rec F-Measure (1) (2) (3) (4) N (%) (%) (%).10.80 2916 24.66 16.64 19.87.10.80.01.80 1694 30.99 12.15 17.45 Fazly/Stevenson 3470 16.89 13.56 15.04 Random baseline 4332 1.02 1.02 1.02
Conclusion Non-compositionality based algorithm is able to rule out expression that are coined mwe s by traditional algorithms, improving on state-of-the-art Using measures (1) and (2) gives best results; using (3) and (4) increases precision but degrades recall
Ambiguity Problem: ambiguity bar
Ambiguity Problem: ambiguity bar
Ambiguity Problem: ambiguity bar
Ambiguity Problem: ambiguity bar
Ambiguity Problem: ambiguity bar
Ambiguity Problem: ambiguity bar
Ambiguity Problem: ambiguity bar Main research question: can topical similarity and tight, synonym-like similarity be combined to differentiate between various senses of a word?
Goal: classification of nouns according to both window-based context (with large window) and syntactic context Construct three matrices capturing co-occurrence frequencies for each mode nouns cross-classified by dependency relations nouns cross-classified by (bag of words) context words dependency relations cross-classified by context words Apply nmf to matrices, but interleave the process Result of former factorization is used to initialize factorization of the next one
Graphical Representation 5k 80k A nouns x dependency relations 50 50 5k = x W 80k H 5k 2k B nouns x context words 50 50 5k = x V 2k G 2k 80k C context words x dependency relations 2k = x 50 80k U 50 F
Sense subtraction switch off dimension(s) of an ambiguous word to reveal other possible senses From matrix W, we know which dimensions are the most important for a certain word Matrix H gives the importance of each dependency relation given a dimension subtract dependency relations that are responsible for a given dimension from the original noun vector v new = v orig ( v 1 h dim ) each dependency relation is multiplied by a scaling factor, according to the load of the feature on the subtracted dimension
Combination with clustering A simple clustering algorithm (k-means) assigns ambiguous nouns to its predominant sense Centroid of the cluster is folded into nmf model The dimensions that define the centroid are subtracted from the ambiguous noun vector Adapted noun vector is fed to the clustering algorithm again
Experimental Design Approach applied to Dutch, using Twente Nieuws Corpus (± 500M words) Corpus parsed with Dutch dependency parser alpino three matrices constructed with: 5k nouns 80k dependency relations 5k nouns 2k context words 80k dependency relations 2k context words Factorization to 50 dimensions
Example dimension: transport 1 nouns: auto car, wagen car, tram tram, motor motorbike, bus bus, metro subway, automobilist driver, trein trein, stuur steering wheel, chauffeur driver 2 context words: auto car, trein train, motor motorbike, bus bus, rij drive, chauffeur driver, fiets bike, reiziger reiziger, passagier passenger, vervoer transport 3 dependency relations: viertraps adj four pedal, verplaats met obj move with, toeter adj honk, tank in houd obj [parsing error], tank subj refuel, tank obj refuel, rij voorbij subj pass by, rij voorbij adj pass by, rij af subj drive off, peperduur adj very expensive
Pop: most similar words pop music doll 1 pop, rock, jazz, meubilair furniture, popmuziek pop music, heks witch, speelgoed toy, kast cupboard, servies [tea] service, vraagteken question mark 2 pop, meubilair furniture, speelgoed toy, kast cupboard, servies [tea] service, heks witch, vraagteken question mark sieraad jewel, sculptuur sculpture, schoen shoe 3 pop, rock, jazz, popmuziek pop music, heks witch, danseres dancer, servies [tea] service, kopje cup, house house music, aap monkey
Barcelona: most similar words Spanish city Spanish football club 1 Barcelona, Arsenal, Inter, Juventus, Vitesse, Milaan Milan, Madrid, Parijs Paris, Wenen Vienna, München Munich 2 Barcelona, Milaan Milan, München Munich, Wenen Vienna, Madrid, Parijs Paris, Bonn, Praag Prague, Berlijn Berlin, Londen London 3 Barcelona, Arsenal, Inter, Juventus, Vitesse, Parma, Anderlecht, PSV, Feyenoord, Ajax
Clustering example: werk 1 werk work, beeld image, foto photo, schilderij painting, tekening drawing, doek canvas, installatie installation, afbeelding picture, sculptuur sculpture, prent picture, illustratie illustration, handschrift manuscript, grafiek print, aquarel aquarelle, maquette scale-model, collage collage, ets etching 2 werk work, boek book, titel title, roman novel, boekje booklet, debuut debut, biografie biography, bundel collection, toneelstuk play, bestseller bestseller, kinderboek child book, autobiografie autobiography, novelle short story, 3 werk work, voorziening service, arbeid labour, opvoeding education, kinderopvang child care, scholing education, huisvesting housing, faciliteit facility, accommodatie acommodation, arbeidsomstandigheid working condition
Evaluation: methodology Comparison to EuroWordNet senses using Wu & Palmer s Wordnet similarity measure a sense is assigned to a correct cluster if: for the top 4 words of the cluster the average similarity between the sense and the top 4 words exceeds a similarity threshold in EuroWordNet and the sense has not yet been considered correct when multiple senses exist in EuroWordNet, the one with maximum similarity is chosen
Evaluation: precision & recall Precision of a word: Percentage of correct clusters to which it is assigned overall: average precision of all words in test set Recall of a word: Percentage of senses in EuroWordnet that have a corresponding cluster overall: average recall of all words in test set Test set: words covered by algorithm that are also present in EuroWordNet (3683 words)
Evaluation: results threshold θ.40 (%).50 (%).60 (%) kmeans nmf prec. 78.97 69.18 55.16 rec. 63.90 55.95 44.77 cbc prec. 44.94 38.13 29.74 rec. 69.61 60.00 48.00 kmeans orig prec. 86.13 74.99 58.97 rec. 60.23 52.45 41.80
Evaluation: results kmeans nmf beats cbc with regard to precision cbc beats kmeans nmf with regard to recall kmeans nmf has higher recall than kmeans orig, so algorithm is able to find multiple senses with good precision
Conclusion Combining bag of words data and syntactic data is useful bag of words data (factorized with nmf) puts its finger on topical dimensions syntactic data is particularly good at finding similar words a clustering approach allows one to determine which topical dimension(s) are responsible for a certain sense and adapt the (syntactic) feature vector of the noun accordingly subtracting the more dominant sense to discover less dominant senses Algorithm scores better with regard to precision; lower with regard to recall
Standard selectional preference models: two-way co-occurrences Keeping track of single relationships But: two-way selectional preference models are not sufficiently rich Compare: The skyscraper is playing coffee. The turntable is playing the piano.
The skyscraper is playing coffee. (play, su, scyscraper) (play, obj, coffee) The turntable is playing the piano. (play, su, turntable) (play, obj, piano) (play, su, turntable, obj, piano)
Three-way extraction of selectional preferences Approach applied to Dutch, using twente nieuws corpus (500M words of newspaper texts) parsed with Dutch dependency parser alpino three-way co-occurrence of verbs with subjects and direct objects extracted adapted with extension of pointwise mutual information Resulting tensor 1k verbs 10k subjects 10k direct objects non-negative tensor factorization with k dimensions (k = 50, 100, 300)
Graphical representation
Example dimension: police action subjects su s verbs v s objects obj s politie police.99 houd aan arrest.64 verdachte suspect.16 agent policeman.07 arresteer arrest.63 man man.16 autoriteit authority.05 pak op run in.41 betoger demonstrator.14 Justitie Justice.05 schiet dood shoot.08 relschopper rioter.13 recherche detective force.04 verdenk suspect.07 raddraaier instigator.13 marechaussee military police.04 tref aan find.06 overvaller raider.13 justitie justice.04 achterhaal overtake.05 Roemeen Romanian.13 arrestatieteam special squad.03 verwijder remove.05 actievoerder campaigner.13 leger army.03 zoek search.04 hooligan hooligan.13 douane customs.02 spoor op track.03 Algerijn Algerian.13
Example dimension: legislation subjects su s verbs v s objects obj s meerderheid majority.33 steun support.83 motie motion.63 VVD.28 dien in submit.44 voorstel proposal.53 D66.25 neem aan pass.23 plan plan.28 Kamermeerderheid.25 wijs af reject.17 wetsvoorstel bill.19 fractie party.24 verwerp reject.14 hem him.18 PvdA.23 vind think.08 kabinet cabinet.16 CDA.23 aanvaard accepts.05 minister minister.16 Tweede Kamer.21 behandel treat.05 beleid policy.13 partij party.20 doe do.04 kandidatuur candidature.11 Kamer Chamber.20 keur goed pass.03 amendement amendment.09
Example dimension: exhibition subjects su s verbs v s objects obj s tentoonstelling exhibition.50 toon display.72 schilderij painting.47 expositie exposition.49 omvat cover.63 werk work.46 galerie gallery.36 bevat contain.18 tekening drawing.36 collectie collection.29 presenteer present.17 foto picture.33 museum museum.27 laat let.07 sculptuur sculpture.25 oeuvre oeuvre.22 koop buy.07 aquarel aquarelle.20 Kunsthal.19 bezit own.06 object object.19 kunstenaar artist.15 zie see.05 beeld statue.12 dat that.12 koop aan acquire.05 overzicht overview.12 hij he.10 in huis heb own.04 portret portrait.11
Quality count 44 dimensions contain similar, framelike semantics 43 dimensions contain less clear-cut semantics single verbs account for one dimension verb senses are mixed up 13 dimensions based on syntax rather than semantics fixed expressions pronomina
Evaluation: methodology pseudo-disambiguation task to test generalization capacity (standard automatic evaluation for selectional preferences) s v o s o jongere drink bier coalitie aandeel youngster drink beer coalition share werkgever riskeer boete doel kopzorg employer risk fine goal worry directeur zwaai scepter informateur vodka manager sway sceptre informer wodka 10-fold cross validation (± 300,000 co-occurrences)
Evaluation: models Evaluation of 4 different models 2 matrix models 1k verbs (10k subjects + 10k direct objects) singular value decomposition (R) non-negative matrix factorization (R 0 ) 2 tensor models 1k verbs 10k subjects 10k direct objects parallel factor analysis (R) non-negative tensor factorization (R 0 )
Evaluation: results dimensions 50 (%) 100 (%) 300 (%) svd 69.60 ± 0.41 62.84 ± 1.30 45.22 ± 1.01 nmf 81.79 ± 0.15 78.83 ± 0.40 75.74 ± 0.63 parafac 85.57 ± 0.25 83.58 ± 0.59 80.12 ± 0.76 ntf 89.52 ± 0.18 90.43 ± 0.14 90.89 ± 0.16
Conclusion novel method able to investigate three-way co-occurrences capable of automatically inducing selectional preferences three-way methods improve on two-way methods non-negativity constraint improves on unconstrained models non-negative tensor factorization outperforms other models
Similarity evaluation Wordnet-based similarity Compare results to Dutch cornetto database Two similarity measures: path length: Wu & Palmer similarity measure information theoretic: Lin s similarity measure Nouns close in the hierarchy are tightly similar Pairwise similarity for k similar words Test set of ± 5000 nouns
Similarity evaluation Wordnet-based similarity wu & palmer s similarity lin s similarity model k = 1 k = 3 k = 5 k = 1 k = 3 k = 5 document.379.331.309.354.320.304 window (w=par).442.377.349.404.357.336 window (w=2).633.561.526.541.485.456 syntax.648.584.554.555.504.480 baseline.128.126.126.164.163.163 syntax > window (w=2) window (w=par) > document
Similarity evaluation Wordnet-based similarity Syntax (with pmi) best Closely followed by small window (with pmi) Large window and document perform much worse dimensionality reduction only helps to improve document-based (a little)
Similarity evaluation Cluster quality Compare output of clustering algorithm to gold standard classification Two cluster tasks (esslli 2008 workshop s shared task) concrete noun categorization (44 nouns) 2-way: natural, artefact 3-way: animal, vegetable, artefact 6-way: bird, groundanimal, fruittree, green, tool, vehicle abstract/concrete noun discrimination (30 nouns) 2-way: hi, lo Evaluation measures: Entropy: distribution of classes within cluster (small = good) Purity: ratio of largest class present in cluster (large = good)
Similarity evaluation Cluster quality 2-way 3-way 6-way model ent pur ent pur ent pur document.930.614.179.932.292.682 window (w=par).911.659.541.705.377.591 window (w=2).000 1.000.213.909.206.773 syntax-based.000 1.000.000 1.000.153.864 syntax > window (w=2) document > window (w=par)
Similarity evaluation Cluster quality Same tendencies as wordnet-based similarity model with large window seems to extract topically related clusters: aardappel potato, ananas pineapple, banaan banana, champignon mushroom, fles bottle, kers cherry, ketel kettle, kip chicken, kom bowl, lepel spoon, peer pear, sla lettuce, ui onion Similar result for abstract/concrete noun discrimination
Similarity evaluation Domain coherence Coherence of semantic domain tags (available in cornetto) particular areas of human knowledge (politics, medicine, sports) topical similarity Ratio of most frequent domain tag (also in tagset of target word) over top 10 similar words Same test set of ± 5000 nouns
Similarity evaluation Domain coherence model sim topic document.394 window (w=par).399 window (w=2).414 syntax.441 baseline.048 syntax > window (w=2) = window(w=par) = document
Similarity evaluation Domain coherence Syntax still scores best Other models do not perform much worse No real difference between small window and large window large window and document do not extract tight similarity, but they do grasp topical similarity