DanNet From Dictionary to Wordnet Jörg Asmussen Society for Danish Language and Literature, DSL, Copenhagen Bolette Sandford Pedersen Centre for Language Technology, CST, University of Copenhagen Lars Trap-Jensen Society for Danish Language and Literature, DSL, Copenhagen
Outline 1. Introduction LTJ, 2 min. 2. Characteristics of the DDO LTJ, 5 min. 3. Building DanNet BSP, 8 min. 4. Extraction of differentia info JA, 7 min. 5. Conclusions JA, 2 min
DanNet Lexical-semantic wordnet for Danish Joint project Society for Danish Language and Literature Centre for Language Technology, University of Copenhagen 4 years (2005 2008), ~ 400,000
Limited resources Adapt an existing wordnet? or Reuse other lexical-semantic resources: SIMPLE-DK Den Danske Ordbog, DDO
Outline 1. Introduction 2. Characteristics of the DDO 3. Building DanNet 4. Extraction of differentia info from definitons 5. Conclusions
Den Danske Ordbog Published by DSL 2003 5 Corpus-based, DDOC 60,000 entries Spelling, morphology, pronunciation, meaning, collocations, fixed phrases, syntax, usage, word formation, etymology
Den Danske Ordbog Words edited in related groups Machine readable Fine-grained microstructure 100,000 definitions
Semantic description
Semantic description Systematic domain info concerns relation
Semantic description Sense definition relevant info manually extracted
Semantic description Hyperonym
Semantic description Sense relations, i.e. synonyms
Semantic description Collocational information
Semantic description Authentic example
Semantic description
Definitions in the DDO Definition scheme: Genus proximum closest hyperonym: apparat technical device Differentia specifica distinctive feature: remaining part of the definition
Outline 1. Introduction 2. Characteristics of the DDO 3. Building DanNet 4. Extraction of differentia info from definitons 5. Conclusions
Building DanNet Extract definitions and genus specifications Include them in the DanNet tool Use it for domain-wise development of data: 1. Homonymy and polysemy 2. Establishing synsets 3. Adjusting the hierarchical structure
Homonymy & polysemy celle cell is genus proximum of gærcelle,yeast cell fængselscelle prison cell Convert lexical expressions into concepts: celle-1 part of living organism celle-2,small room
Establishing synsets lære studies fag subject videnskab science informatik informatics bromatologi nutrition science samfundsfag social studies datalogi computer science
Establishing synsets One synset lære studies fag subject videnskab science informatik informatics bromatologi nutrition science samfundsfag social studies datalogi computer science
Building the hierarchy Hyponymy is generally defined as X is a Y Taxonymy is a subtype of this: X is a kind/type of Y Cf. Cruse, 1991 and 2002
Example: Hyponymy? træ tree kirsebærtræ cherry tree birketræ birch vejtræ roadside tree
Example: Hyponymy? træ tree vejtræ roadside tree kirsebærtræ cherry tree birketræ birch Orthogonal Hyponymy
Building the hierarchy TOP genstand object møbel furniture siddemøbel sitting furniture stol chair
Building the hierarchy TOP genstand object møbel furniture indbo/bohave household effects siddemøbel sitting furniture stol chair
Building the hierarchy TOP genstand object møbel furniture indbo/bohave household effects siddemøbel sitting furniture stol chair
Definition composition Genus selection a conscious process Differentia: No editorial specifications, i.e. no fixed definition vocabulary nor syntax Consequences for DanNet: Complicates computational exploitation Semantic relations are coded manually
Coding relations What is done manually: No semantic info other than that of DDO Reduction of semantic info What is done automatically: Inheritance of relations from hyperonyms
Outline 1. Introduction 2. Characteristics of the DDO 3. Building DanNet 4. Extraction of differentia info from definitons 5. Conclusions
Extraction of telic role fjernsyn tv set box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device
Extraction of telic role fjernsyn tv set genus expression box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device
Extraction of telic role fjernsyn tv set genus expression box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device Telic role: VPs headed by can
Extraction of telic role fjernsyn tv set genus expression box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device Telic role: VPs headed by can
Hypothesis
Hypothesis VPs in a relative clause which are headed by kan can specify the telic role (i.e. the for_purpose_of relation) of the definiendum
Hypothesis Corpus query VPs Find a relative all definitions clause with which genus are apparat headed by kan can specify followed the by telic der role or som (i.e. the for_purpose_of relation) followed by of kan the definiendum followed by a word ending in e
Results of corpus query
Results of corpus query query VP heads denoting telic role dictionary entries
Results of corpus query query VP heads denoting telic role Only 26 occurrences of this pattern but 203 dictionary entries apparat definitions
Why this bad coverage?
Why this bad coverage? 1. Definitions where the pattern contains interposed material are not captured
Why this bad coverage? 1. Definitions where the pattern contains interposed material are not captured 2. Other stuctural patterns indicating a for_purpose_of relation than that one given in our hypothesis
Further patterns 1. GE that can VP-inf 2. GE that is used for to VP-inf with 3. GE for to VP-inf with/on/in 4. GE that VP-fin 5. GE for NP 6. GE that is specially designed for to VP-inf
Further patterns head for_purpose_of 1. GE that can VP-inf 2. GE that is used for to VP-inf with 3. GE for to VP-inf with/on/in 4. GE that VP-fin 5. GE for NP 6. GE that is specially designed for to VP-inf
1. GE that can VP-inf 2. GE that is used for to VP-inf with 3. GE for to VP-inf with/on/in 4. GE that VP-fin 5. GE for NP Further patterns head These patterns 6. GE that is specially designed for to VP-inf for_purpose_of capture 70% of the apparat definitions
A statistical approach
A statistical approach Frequency list of types in definitions with genus apparat
A statistical approach Frequency list of types in definitions with genus apparat compared with
A statistical approach Frequency list of types in definitions with genus apparat compared with frequency list of types in all definitions
A statistical approach Frequency list of types in definitions with genus apparat compared with frequency list of types in all definitions using a statistical test (e.g. log likelihood)
A statistical approach Frequency list of types in definitions with genus apparat compared with frequency list of types in all definitions using a statistical test (e.g. log likelihood) Salient types are listed for investigation and may give hints on semantic relations
Some salient types afspille to play back afspilning play back måle,measure måling,gauging måler,measuring tool målinger,measurements
Some salient types afspille to play back afspilning play back måle,measure måling,gauging måler,measuring tool målinger,measurements grammofon, cd-afspiller, afspiller, sequencer, diktafon kassettespiller, hjemmevideo, kassettebåndoptager, båndoptager stroboskop, måler, timer, løgnedetektor, ekkolod gasmåler, speedometer, omdrejningstæller, benzinmåler, fotofælde elmåler, trykmåler, luxmeter, spirometer, gyrometer, alkometer, newtonmeter, magnetometer, instrument, måleinstrument, kalorimeter radiosonde, satellit, fartskriver
Automatic extraction?
Automatic extraction? Basically NO... Developing reliant methods is too expensive!
Automatic extraction? Structural and lexical properties of definitions differ considerably
Automatic extraction? Structural and lexical properties of definitions differ considerably Difficult to automatically extract semantic relations from definitions
Automatic extraction? Structural and lexical properties of definitions differ considerably Difficult to automatically extract semantic relations from definitions Concordances and lists of salient definition types may help the editor
Automatic extraction? Structural and lexical properties of definitions differ considerably Difficult to automatically extract semantic relations from definitions Concordances and lists of salient definition types may help the editor But the DanNet editor still has to do the core job of analysing dictionary definitions
Outline 1. Introduction 2. Characteristics of the DDO 3. Building DanNet 4. Extraction of differentia info from definitons 5. Conclusions
Conclusion Reusing the DDO
Conclusion Reusing the DDO Cheap Expensive
Conclusion Reusing the DDO Cheap Expensive Semi-automatic exploitation of the dictionary structure hyponymy structure synonym/antonym info
Conclusion Reusing the DDO Cheap Expensive Semi-automatic exploitation of the dictionary structure hyponymy structure synonym/antonym info Automatic exploitation of definitions proper to find other semantic relations
Conclusion Reusing the DDO Cheap Expensive Semi-automatic exploitation of the dictionary structure hyponymy structure synonym/antonym info Automatic exploitation of definitions proper to find other semantic relations
Conclusion The DanNet approach
Cheap Conclusion The DanNet approach Expensive
Conclusion The DanNet approach Cheap Translation/expansion of existing WNs? Expensive Better coherence with other WNs Linguistic bias
Conclusion The DanNet approach Cheap Translation/expansion of existing WNs? Expensive Better coherence with other WNs Linguistic bias Reusing/merging language resources? More loyal to the specific language Expensive, unless based on an existing resource, i.e. a dictionary
Conclusion The DanNet approach Cheap Translation/expansion of existing WNs? Expensive Better coherence with other WNs Linguistic bias Reusing/merging language resources? More loyal to the specific language Expensive, unless based on an existing resource, i.e. a dictionary