Building a Large Scale Lexical Ontology for Portuguese



Similar documents
PAPEL: A Dictionary-Based Lexical Ontology for Portuguese

ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU

A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches

Building a Question Classifier for a TREC-Style Question Answering System

Natural Language Processing. Part 4: lexical semantics

Human Language Technology Research and the Development of the Brazilian Portuguese Wordnet

Using WordNet.PT for translation: disambiguation and lexical selection decisions

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

TERMINOGRAPHY and LEXICOGRAPHY What is the difference? Summary. Anja Drame TermNet

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

EuroRec Repository. Translation Manual. January 2012

AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS

An Efficient Database Design for IndoWordNet Development Using Hybrid Approach

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Natural Language Database Interface for the Community Based Monitoring System *

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

What s in a Lexicon. The Lexicon. Lexicon vs. Dictionary. What kind of Information should a Lexicon contain?

The XLDB Group at CLEF 2004

Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance

Overview of MT techniques. Malek Boualem (FT)

Converging Web-Data and Database Data: Big - and Small Data via Linked Data

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

WHY TRANSLATOR-FRIENDLY?

INFORMING A INFORMATION DISCOVERY TOOL FOR USING GESTURE

Search and Information Retrieval

Methods and Tools for Encoding the WordNet.Br Sentences, Concept Glosses, and Conceptual-Semantic Relations

Relations Extracted from a Portuguese Dictionary: Results and First Evaluation

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Computer Standards & Interfaces

Natural Language Interfaces to Databases: simple tips towards usability

Mining Text Data: An Introduction

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

A Case Study of Question Answering in Automatic Tourism Service Packaging

INGLÊS. Aula 13 DIRECT AND INDIRECT SPEECH

Intro to Linguistics Semantics

Thesis Proposal Verb Semantics for Natural Language Understanding

Hyponymy Extraction and Web Search Behavior Analysis Based On Query Reformulation

Text Mining: The state of the art and the challenges

Learning Translation Rules from Bilingual English Filipino Corpus

COURSE OBJECTIVES SPAN 100/101 ELEMENTARY SPANISH LISTENING. SPEAKING/FUNCTIONAl KNOWLEDGE

Automated Extraction of Security Policies from Natural-Language Software Documents

Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu

Natural Language to Relational Query by Using Parsing Compiler

The Battle for the Future of Data Mining Oren Etzioni, CEO Allen Institute for AI (AI2) March 13, 2014

Interactive Dynamic Information Extraction

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining

AUTOMATIC DATABASE CONSTRUCTION FROM NATURAL LANGUAGE REQUIREMENTS SPECIFICATION TEXT

Extending a Lexicon of Portuguese Nominalizations with Data from Corpora

Grammar Presentation: The Sentence

Clustering Connectionist and Statistical Language Processing

Chapter 8 The Enhanced Entity- Relationship (EER) Model

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Learn-Portuguese-Now.com presents PHRASES. What Did You Say? by Charlles Nunes

An Integrated Approach to Automatic Synonym Detection in Turkish Corpus

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Protein Protein Interaction Networks

Week 3. COM1030. Requirements Elicitation techniques. 1. Researching the business background

ISA OR NOT ISA: THE INTERLINGUAL DILEMMA FOR MACHINE TRANSLATION

How to work with a video clip in EJA English classes?

Information extraction from texts. Technical and business challenges

TOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments

It s all around the domain ontologies - Ten benefits of a Subject-centric Information Architecture for the future of Social Networking

Survey Results: Requirements and Use Cases for Linguistic Linked Data

INTEGRATED DEVELOPMENT ENVIRONMENTS FOR NATURAL LANGUAGE PROCESSING

Information Technology for KM

Find the signal in the noise

PiQASso: Pisa Question Answering System

How to make Ontologies self-building from Wiki-Texts

The Value of Taxonomy Management Research Results

Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy)

Large-Scale PatternBased Information

Statistical Machine Translation

SUNY PURCHASE ONLINE BASIC SPANISH I SPA 1010 SYLLABUS

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Empirical Machine Translation and its Evaluation

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Transcription:

Building a Large Scale Lexical Ontology for Portuguese Nuno Seco Linguateca Node of Coimbra http://linguateca.dei.uc.pt

Agenda Motivations Goals Ontology Extraction Ontology Evaluation Study the Systematicity of Polysemy in the Lexicon using the ontology. What has been done so far

Motivation Communication (in natural language) is a knowledge hungry task. Grammatical knowledge (e.g., SVO, VSO, ) Cultural knowledge Common sense knowledge If computers are to do NLP they need knowledge.

Motivation Some properties complicate the automatic processing: Metaphorical nature Context dependent Vagueness Creative Diachronic but these properties are the result of human usage, and makes language use easy by humans!

Motivation So what we need is a resource* that can be used by a machine and makes explicit the effect of these properties. A Lexical Ontology for Portuguese * Be aware as this is only a snapshot of the language in a particular point in time.

Motivation Two strategies are usually followed: Manual construction WordNet Cyc HowNet (Semi) Automatic construction MindNet KnowItAll PAPEL (Palavras Associadas Porto Editora Linguateca)

Motivation So what can be done with a lexical ontology? Information Retrieval Machine Translation Question Answering Semantic Similarity Judgments Concept Creation / Explanation

Goals Extract the semantic organization of the pt. lexicon. (Ontology Learning, Information Extraction). Evaluate the knowledge extracted defining a methodology. Study the specific issue of systematic polysemy in Portuguese. Compare our model to other models of the Portuguese language (WordNet.PT and WordNet.BR). Make the resource publicly available.

Extracting the Structure of the Lexicon Can be thought of as a reverse engineering process.

What relations? Hyponymy; Hyperonymy Saxofone - instrumento musical de sopro, feito de metal, recurvo, com chaves e embocadura de palheta is_a(saxofone, instrumento musical) Meronymy; Holonomy rim orgão que tem a a função de orgão cada uma das partes do corpo is_a(rim, orgão) & part_of(orgão, body) -> part_of(rim, body)

What relations (cont d)? Synonymy permutar trocar; syn(permutar, trocar) Antonymy infeliz o que não é feliz ant(infeliz, feliz) iracional não racional ant(iracional, racional) Morphological processing: infeliz = in + feliz descontente = des + contente

What relations (cont d)? Causation matar - causar a morte a causa(matar, morte) Entailment ressonar - respirar com ruído durante o sono sono estado de quem dorme entails(ressnonar, dormir) Cross part-of-speech relations informatização - acto ou efeito de informatizar nominalization(informatizar, informatização)

Extracting the Structure of the Lexicon Árvore -- planta lenhosa que pode atingir grandes alturas e cujo tronco se ramifica na parte superior árvore (tree) => planta lenhosa (woody plant) => organismo (organism) => ser vivo (living thing) => ente (entity)

Structure the Lexicon (Simple English example) Tree -- a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms. tree => woody plant => vascular plant => plant => organism => living thing => physical object => entity Taken from WordNet 2.1

Ontology Evaluation Evaluation has received very little attention!! But still, we can identify 4 core kinds: The use of a golden collection Evaluate the output of some ontology driven process Compare the ontology with clusters generated from corpora Human evaluation

Using a Golden Collection A Golden Collection B Where is the best output? C Lexical and Relational alignment

Using a Golden Collection (cont d) At the lexical level (terms in common) Precision, Recall, F-Measure,... O1 O2 2 O2 O Pr = Abr = O 1 O O 1 O O 1 2 2 O 2

Using a Golden Collection (cont d) At the relational (hyperonymy/hyponymy) level (Maedche et al., 2002) Animal Animal Mamífero Réptil Mamífero Réptil Ruminante Carnívoro Gato Cão Cão Gato Cocker TO( cão,o 1,O2 ) = 3 5

Evaluate the Output of an Ontology Dependent Application A B Where is the best output? C Ontology Dependent Application

Evaluate the Output of an Ontology Dependent Application (cont d) Semantic similarity computations using ontologies and correlating them with human judgments. Performing query expansion in information retrieval systems. Knowledge Discovery and Management Group

Use clustering strategies (coarse evaluation) A B Where is the best output? C Well known (and acknowledged) algorithms for clustering

Use clustering strategies (coarse evaluation) Brewster et al., 2004 Domain A Topic 1 Topic 2 Domain A Topic 3 Topic 4

Human evaluation A B C

Human Evaluation (cont d) In order to ease the evaluators task, one could show the definitions for each (new) concept in the ontology. (Navigli et al.): festival a day or period of time set aside for feasting and celebration jazz a style of dance music popular in the 1920s; similar to New Orleans jazz but played by large bands jazz festival a kind of festival, a day or period of time set aside for feasting and celebration, related to jazz, a style of dance music popular in the 1920s

How can I evaluate my work? Manual Inspection! Compare to other resources being constructed: Luís Sarmento (Linguteca, Porto) extracting relations from corpora. Marcírio Chaves (Linguteca, Lisboa) creating e geographical ontology. Feed the ontology to ongoing projects: AI Lab - ReBuilder Linguateca, Oslo - Esfinge.

Word senses: Polysemy vs. Homonymy An individual word or phrase that can be used (in different contexts) to express two or more different meanings. Polysemy - senses are related in some way (complementary). School starts at 8:30. The School was founded in 1910 Homonymy - senses are unrelated (contrastive). The bank has several offices. We walked along the bank of the river.

Systematic Polysemy Polysemy of word A with meanings a i and a j is regular [systematic] if there exists at least one other word B with meanings b i and b j which are semantically distinguished from each other in exactly the same way as a i and a j and if a i and b i, and a j and b j are nonsynonymous. Ju. Apresjan (1974)

Some examples Habitante/Língua (Habitant/Language) norueguês, português, escocês, (68) Fabricante/Vendedor (Producer/Seller) pasteleiro, ourives, queijeiro, (57) Abertura/Acto (Opening/Act) vista, entrada, perfuração,... (11)

Role of Systematic Polysemy Acknowledging the systematic nature of polysemy and its relationship to underspecified representations allows one to structure ontologies for semantic processing more efficiently, generating more appropriate interpretations within context Paul Buitelaar (1998)

Progress so far Studying the physical format of the dictionary of Porto Editora, Dicionário da Língua Portuguesa. Looking for frequent patterns, indicative of interesting relations. Parsing the definitions using some of these patterns to obtain a taxonomic structure to the lexicon. Preliminary mining of systematic polysemy patterns.

Building a Large Scale Lexical Ontology for Portuguese Nuno Seco Linguateca Node of Coimbra http://linguateca.dei.uc.pt

The Dictionary in Numbers Porto Editora s Dictionary (open class words) Number of entries: Nouns - 61980 Verbs - 12378 Adjectives - 26524 Adverbs - 1280 Number of senses: Nouns - 110451 Verbs - 35439 Adjectives - 44281 Adverbs - 2299

The Dictionary in Numbers Frequent patterns in noun definitions: acto ou efeito de (3851) pessoa que (1386) indivíduo (1235) aquele que (1148) parte (1052) conjunto de (1004)

The Dictionary in Numbers Frequent patterns in verbs definitions: fazer (1680) tornar (1359) tirar (744) pôr (674) causar (299) estar (284)

The Dictionary in Numbers Frequent patterns in adjective definitions: que tem (2698) que ou aquele que (1393) relativo a/ao/à (1236+725+1162) relativo ou pertencente (647) que ou o que (527) que diz respeito (494)

The Dictionary in Numbers Frequent patterns in adverb definitions: de modo (393) de maneira (48) do ponto de vista (28) por meio de (14)

Some difficult issues Finding the right sense of word in the definition: arquibancada banco grande cujo assento What sense of banco? Circularity: passagem transição de um transição passagem que comporta

Complementary Studies árvore (tree) => planta lenhosa (woody plant) => organismo (organism) => ser vivo (living thing) => ente (entity) Extracted from pt dictionary tree => woody plant => vascular plant => plant => organism => living thing => physical object => entity Taken from WordNet 2.1