DanNet From Dictionary to Wordnet



Similar documents
User studies, user behaviour and user involvement evidence and experience from The Danish Dictionary

There And Back Again from Dictionary to Wordnet to Thesaurus and Vice Versa: How to Use and Reuse Dictionary Data in a Conceptual Dictionary

A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches

Processing: current projects and research at the IXA Group

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

Natural Language Processing. Part 4: lexical semantics

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Language Meaning and Use

HELP DESK SYSTEMS. Using CaseBased Reasoning

Domain Knowledge Extracting in a Chinese Natural Language Interface to Databases: NChiql

Intro to Linguistics Semantics

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Semantic analysis of text and speech

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

Application Architectures

L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25

Online dictionaries how do users find them and what do they do once they have?

Title: Chinese Characters and Top Ontology in EuroWordNet

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

A Mapping of CIDOC CRM Events to German Wordnet for Event Detection in Texts

Human Language Technology Research and the Development of the Brazilian Portuguese Wordnet

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

The Oxford Learner s Dictionary of Academic English

Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

1. Introduction. 2. Lemma selection

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

Software Engineering. System Models. Based on Software Engineering, 7 th Edition by Ian Sommerville

Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries

TERMINOGRAPHY and LEXICOGRAPHY What is the difference? Summary. Anja Drame TermNet

Syntactic Theory on Swedish

Interactive Dynamic Information Extraction

M3039 MPEG 97/ January 1998

An Efficient Database Design for IndoWordNet Development Using Hybrid Approach

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Customizing an English-Korean Machine Translation System for Patent Translation *

ON GETTING THE MOST OUT OF INTERNET RESOURCES TO RAISE TRANSLATION QUALITY OF PROFESSIONAL DOCUMENTATION

A Workbench for Prototyping XML Data Exchange (extended abstract)

Discourse Processing for Context Question Answering Based on Linguistic Knowledge

Paraphrasing controlled English texts

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

THE BACHELOR S DEGREE IN SPANISH

MULTIFUNCTIONAL DICTIONARIES

Reading Listening and speaking Writing. Reading Listening and speaking Writing. Grammar in context: present Identifying the relevance of

Monitoring BPMN-Processes with Rules in a Distributed Environment

Getting Off to a Good Start: Best Practices for Terminology

FLORIDA TEACHER STANDARDS for ESOL ENDORSEMENT 2010

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Adaptive Context-sensitive Analysis for JavaScript

Critical Reading. English Language Arts Curriculum Framework. Revised 2010

A. Schedule: Reading, problem set #2, midterm. B. Problem set #1: Aim to have this for you by Thursday (but it could be Tuesday)

Syllabus: a list of items to be covered in a course / a set of headings. Language syllabus: language elements and linguistic or behavioral skills

Joint Steering Committee for Development of RDA

Secure semantic based search over cloud

Ling 201 Syntax 1. Jirka Hana April 10, 2006

A Framework for Ontology-Based Knowledge Management System

Hybrid Strategies. for better products and shorter time-to-market

Extraction and Visualization of Protein-Protein Interactions from PubMed

Chapter ML:XI. XI. Cluster Analysis

Key words related to the foci of the paper: master s degree, essay, admission exam, graders

A generic approach for data integration using RDF, OWL and XML

Local Culture in Global English:

Chapter 8 The Enhanced Entity- Relationship (EER) Model

Facilitating Business Process Discovery using Analysis

Data Deduplication in Slovak Corpora

Modern foreign languages

Local Culture in Global English:

Reading for Success : A Novel Study for Stuart Little by E.B. White. Common Core Standards Grades 5, 6, 7

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Selecting a Taxonomy Management Tool. Wendi Pohs InfoClear Consulting #SLATaxo

Glossary of translation tool types

Natural Language to Relational Query by Using Parsing Compiler

Picking them up and Figuring them out: Verb-Particle Constructions, Noise and Idiomaticity

Object-Oriented Software Specification in Programming Language Design and Implementation

Skills for Effective Business Communication: Efficiency, Collaboration, and Success

Corpus and Discourse. The Web As Corpus. Theory and Practice MARISTELLA GATTO LONDON NEW DELHI NEW YORK SYDNEY

Mining a Change-Based Software Repository

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Monitoring BPMN-Processes with Rules in a Distributed Environment

Transcription:

DanNet From Dictionary to Wordnet Jörg Asmussen Society for Danish Language and Literature, DSL, Copenhagen Bolette Sandford Pedersen Centre for Language Technology, CST, University of Copenhagen Lars Trap-Jensen Society for Danish Language and Literature, DSL, Copenhagen

Outline 1. Introduction LTJ, 2 min. 2. Characteristics of the DDO LTJ, 5 min. 3. Building DanNet BSP, 8 min. 4. Extraction of differentia info JA, 7 min. 5. Conclusions JA, 2 min

DanNet Lexical-semantic wordnet for Danish Joint project Society for Danish Language and Literature Centre for Language Technology, University of Copenhagen 4 years (2005 2008), ~ 400,000

Limited resources Adapt an existing wordnet? or Reuse other lexical-semantic resources: SIMPLE-DK Den Danske Ordbog, DDO

Outline 1. Introduction 2. Characteristics of the DDO 3. Building DanNet 4. Extraction of differentia info from definitons 5. Conclusions

Den Danske Ordbog Published by DSL 2003 5 Corpus-based, DDOC 60,000 entries Spelling, morphology, pronunciation, meaning, collocations, fixed phrases, syntax, usage, word formation, etymology

Den Danske Ordbog Words edited in related groups Machine readable Fine-grained microstructure 100,000 definitions

Semantic description

Semantic description Systematic domain info concerns relation

Semantic description Sense definition relevant info manually extracted

Semantic description Hyperonym

Semantic description Sense relations, i.e. synonyms

Semantic description Collocational information

Semantic description Authentic example

Semantic description

Definitions in the DDO Definition scheme: Genus proximum closest hyperonym: apparat technical device Differentia specifica distinctive feature: remaining part of the definition

Outline 1. Introduction 2. Characteristics of the DDO 3. Building DanNet 4. Extraction of differentia info from definitons 5. Conclusions

Building DanNet Extract definitions and genus specifications Include them in the DanNet tool Use it for domain-wise development of data: 1. Homonymy and polysemy 2. Establishing synsets 3. Adjusting the hierarchical structure

Homonymy & polysemy celle cell is genus proximum of gærcelle,yeast cell fængselscelle prison cell Convert lexical expressions into concepts: celle-1 part of living organism celle-2,small room

Establishing synsets lære studies fag subject videnskab science informatik informatics bromatologi nutrition science samfundsfag social studies datalogi computer science

Establishing synsets One synset lære studies fag subject videnskab science informatik informatics bromatologi nutrition science samfundsfag social studies datalogi computer science

Building the hierarchy Hyponymy is generally defined as X is a Y Taxonymy is a subtype of this: X is a kind/type of Y Cf. Cruse, 1991 and 2002

Example: Hyponymy? træ tree kirsebærtræ cherry tree birketræ birch vejtræ roadside tree

Example: Hyponymy? træ tree vejtræ roadside tree kirsebærtræ cherry tree birketræ birch Orthogonal Hyponymy

Building the hierarchy TOP genstand object møbel furniture siddemøbel sitting furniture stol chair

Building the hierarchy TOP genstand object møbel furniture indbo/bohave household effects siddemøbel sitting furniture stol chair

Building the hierarchy TOP genstand object møbel furniture indbo/bohave household effects siddemøbel sitting furniture stol chair

Definition composition Genus selection a conscious process Differentia: No editorial specifications, i.e. no fixed definition vocabulary nor syntax Consequences for DanNet: Complicates computational exploitation Semantic relations are coded manually

Coding relations What is done manually: No semantic info other than that of DDO Reduction of semantic info What is done automatically: Inheritance of relations from hyperonyms

Outline 1. Introduction 2. Characteristics of the DDO 3. Building DanNet 4. Extraction of differentia info from definitons 5. Conclusions

Extraction of telic role fjernsyn tv set box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device

Extraction of telic role fjernsyn tv set genus expression box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device

Extraction of telic role fjernsyn tv set genus expression box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device Telic role: VPs headed by can

Extraction of telic role fjernsyn tv set genus expression box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device Telic role: VPs headed by can

Hypothesis

Hypothesis VPs in a relative clause which are headed by kan can specify the telic role (i.e. the for_purpose_of relation) of the definiendum

Hypothesis Corpus query VPs Find a relative all definitions clause with which genus are apparat headed by kan can specify followed the by telic der role or som (i.e. the for_purpose_of relation) followed by of kan the definiendum followed by a word ending in e

Results of corpus query

Results of corpus query query VP heads denoting telic role dictionary entries

Results of corpus query query VP heads denoting telic role Only 26 occurrences of this pattern but 203 dictionary entries apparat definitions

Why this bad coverage?

Why this bad coverage? 1. Definitions where the pattern contains interposed material are not captured

Why this bad coverage? 1. Definitions where the pattern contains interposed material are not captured 2. Other stuctural patterns indicating a for_purpose_of relation than that one given in our hypothesis

Further patterns 1. GE that can VP-inf 2. GE that is used for to VP-inf with 3. GE for to VP-inf with/on/in 4. GE that VP-fin 5. GE for NP 6. GE that is specially designed for to VP-inf

Further patterns head for_purpose_of 1. GE that can VP-inf 2. GE that is used for to VP-inf with 3. GE for to VP-inf with/on/in 4. GE that VP-fin 5. GE for NP 6. GE that is specially designed for to VP-inf

1. GE that can VP-inf 2. GE that is used for to VP-inf with 3. GE for to VP-inf with/on/in 4. GE that VP-fin 5. GE for NP Further patterns head These patterns 6. GE that is specially designed for to VP-inf for_purpose_of capture 70% of the apparat definitions

A statistical approach

A statistical approach Frequency list of types in definitions with genus apparat

A statistical approach Frequency list of types in definitions with genus apparat compared with

A statistical approach Frequency list of types in definitions with genus apparat compared with frequency list of types in all definitions

A statistical approach Frequency list of types in definitions with genus apparat compared with frequency list of types in all definitions using a statistical test (e.g. log likelihood)

A statistical approach Frequency list of types in definitions with genus apparat compared with frequency list of types in all definitions using a statistical test (e.g. log likelihood) Salient types are listed for investigation and may give hints on semantic relations

Some salient types afspille to play back afspilning play back måle,measure måling,gauging måler,measuring tool målinger,measurements

Some salient types afspille to play back afspilning play back måle,measure måling,gauging måler,measuring tool målinger,measurements grammofon, cd-afspiller, afspiller, sequencer, diktafon kassettespiller, hjemmevideo, kassettebåndoptager, båndoptager stroboskop, måler, timer, løgnedetektor, ekkolod gasmåler, speedometer, omdrejningstæller, benzinmåler, fotofælde elmåler, trykmåler, luxmeter, spirometer, gyrometer, alkometer, newtonmeter, magnetometer, instrument, måleinstrument, kalorimeter radiosonde, satellit, fartskriver

Automatic extraction?

Automatic extraction? Basically NO... Developing reliant methods is too expensive!

Automatic extraction? Structural and lexical properties of definitions differ considerably

Automatic extraction? Structural and lexical properties of definitions differ considerably Difficult to automatically extract semantic relations from definitions

Automatic extraction? Structural and lexical properties of definitions differ considerably Difficult to automatically extract semantic relations from definitions Concordances and lists of salient definition types may help the editor

Automatic extraction? Structural and lexical properties of definitions differ considerably Difficult to automatically extract semantic relations from definitions Concordances and lists of salient definition types may help the editor But the DanNet editor still has to do the core job of analysing dictionary definitions

Outline 1. Introduction 2. Characteristics of the DDO 3. Building DanNet 4. Extraction of differentia info from definitons 5. Conclusions

Conclusion Reusing the DDO

Conclusion Reusing the DDO Cheap Expensive

Conclusion Reusing the DDO Cheap Expensive Semi-automatic exploitation of the dictionary structure hyponymy structure synonym/antonym info

Conclusion Reusing the DDO Cheap Expensive Semi-automatic exploitation of the dictionary structure hyponymy structure synonym/antonym info Automatic exploitation of definitions proper to find other semantic relations

Conclusion Reusing the DDO Cheap Expensive Semi-automatic exploitation of the dictionary structure hyponymy structure synonym/antonym info Automatic exploitation of definitions proper to find other semantic relations

Conclusion The DanNet approach

Cheap Conclusion The DanNet approach Expensive

Conclusion The DanNet approach Cheap Translation/expansion of existing WNs? Expensive Better coherence with other WNs Linguistic bias

Conclusion The DanNet approach Cheap Translation/expansion of existing WNs? Expensive Better coherence with other WNs Linguistic bias Reusing/merging language resources? More loyal to the specific language Expensive, unless based on an existing resource, i.e. a dictionary

Conclusion The DanNet approach Cheap Translation/expansion of existing WNs? Expensive Better coherence with other WNs Linguistic bias Reusing/merging language resources? More loyal to the specific language Expensive, unless based on an existing resource, i.e. a dictionary