Semantic Clustering in Dutch

Transcription

1 Alfa-informatica, Rijksuniversiteit Groningen Computational Linguistics in the Netherlands December 16, 2005

2 Outline 1 2 Clustering Additional remarks 3 Examples 4

3 Research carried out during internship at Centrum voor Nederlandse Taal & Spraak, University of Antwerp Goal: automatically clustering nouns (and adjectives) by applying machine learning techniques Basic approach: inducing semantic classes of nouns according to the adjectives those nouns collocate with (and vice versa) Hypothesis: syntactic context (e.g. adjectival modifiers to nouns) is a sufficient cue for semantic clustering

7 Semantic similarity Clustering Additional remarks Finding semantically similar words by looking at syntactic context (Distributional Hypothesis, Harris) Take a word and its contexts: verse sneup gezouten sneup lekkere sneup zoete sneup taaie sneup A speaker of Dutch can infer meaning from context In the same way, a computer might be able to discover similar words from similar contexts

11 Semantic similarity Clustering Additional remarks Finding semantically similar words by looking at syntactic context (Distributional Hypothesis, Harris) Take a word and its contexts: verse sneup gezouten sneup lekkere sneup FOOD zoete sneup taaie sneup A speaker of Dutch can infer meaning from context In the same way, a computer might be able to discover similar words from similar contexts

12 Semantic similarity Clustering Additional remarks Finding semantically similar words by looking at syntactic context (Distributional Hypothesis, Harris) Take a word and its contexts: verse sneup gezouten sneup lekkere sneup FOOD zoete sneup taaie sneup A speaker of Dutch can infer meaning from context In the same way, a computer might be able to discover similar words from similar contexts

13 Vector space measures 1/2 Clustering Additional remarks How to determine semantic similarity? Create vectors rood lekker snel tweedehands appel wijn auto vrachtwagen

14 Vector space measures 1/2 Clustering Additional remarks How to determine semantic similarity? Create vectors rood lekker snel tweedehands appel wijn auto vrachtwagen

15 Vector space measures 2/2 Clustering Additional remarks Apply cosine similarity measure P x y = n i=1 x i y i Pn P n i=1 x2 i i=1 y i 2 Examples: cos(auto, vrachtwagen) = 4 18 = 0.94 cos( x, y ) = x y cos(appel, vrachtwagen) = 2 15 = 0.51 Problem: ambiguity Compare een steengoed nummer een oneven nummer Different meaning, but they end up in the same cluster

23 Clustering Clustering Additional remarks Clustering = the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters) Two kinds of clustering: Partitional clustering: stand-alone clusters, not embedded in a structure Hierarchical clustering: a complete branching structure is assigned, up to the root node

27 Additional remarks Clustering Additional remarks Adjective-noun collocations have been extracted from Twente News Corpus (>300M words) Lemma s have been used to get a better generalization Frequencies have been logarithmically smoothed For the n most frequent nouns, vectors have been created that contain the frequency of the m most frequent adjectives (and vice versa) In most experiments, n=5.000, m=20.000

32 Examples Examples of noun clustering Examples mei februari september maart december augustus oktober januari juli april november juni aanvaller speler middenvelder verdediger linksbuiten international invaller keeper voetballer doelman spits guerrillabeweging opstandeling rebellenleider guerrillastrijder guerrilla verzetsbeweging rebel bevrijdingsleger minuut millimeter seconde cent ton meter centimeter graad kilo kilometer

36 Examples Examples of adjective clustering Examples bruin groen rood oranje grijs wit geel roze zwart paars blauw ongebreideld mateloos tomeloos grenzeloos ongeremd brutaal cool lelijk dom tof stom

39 Examples Example of hierarchical clustering Examples januari september augustus november februari juni december oktober maart april juli mei donderdag maandag zaterdag woensdag dinsdag zondag vrijdag nacht zondagmiddag weekend herfst middag zomeravond handelsdag winter avond zomerdag voorjaar werkdag weer ochtend zomer najaar morgen dag weekeinde

40 Wordnet comparison evaluation 1/3 Examples Automatic evaluation by comparing clusters to Wordnet relations The wordnet-relations used for the evaluation are: Hyponyms Hyperonyms Hyponyms of the hyperonyms (co-hyponyms, synonyms)

45 Wordnet comparison evaluation (2/3) Examples For each cluster: Take the word with most relations to other words of cluster in Wordnet (=most central word) Get hyponyms, hyperonyms, co-hyponyms and synonyms in Wordnet Calculate precision: how many words in cluster have equivalent Wordnet-relation (Calculate recall: how many Wordnet-relations have no equivalent in found cluster) General precision (recall): average of precision (recall) of the various clusters

51 Wordnet comparison evaluation (3/3) Examples Precision Recall Random precision Random recall percentage (%) # clusters

52 Share of each relation Examples Synonyms Hyponyms Hypernyms Co-hyponyms Precision percentage (%) # clusters

53 Wu & Palmer Examples Calculate similarity between two words according to distance in hierarchical wordnet Instead of having a fixed group of words to compare the clusters to, the cluster quality is calculated according to similarity in WordNet.

54 Wu & Palmer Examples Calculate similarity between two words according to distance in hierarchical wordnet Instead of having a fixed group of words to compare the clusters to, the cluster quality is calculated according to similarity in WordNet.

55 Wu & Palmer Examples 70 Similarity Random baseline percentage (%) # clusters

56 Significant similarity percentages when comparing the clusters to WordNet Syntactic context is a good cue for the automatic extraction of semantic classes Ambiguity is a problem difficult to tackle for a computer

59 Future work Develop algorithms that disambiguate different senses of a word Develop algorithms that extract hierarchical wordnets instead of stand-alone clusters (algorithms that might discover is-a relations) Investigate verbs: Improve noun-clustering by taking into account subject-verb and verb-object relations Cluster verbs with subject-verb and verb-object relations Deal with data sparseness and the curse of dimensionality by applying statistical analysis (LSA, principal component analysis)