The PALAVRAS parser and its Linguateca applications - a mutually productive relationship



Similar documents
The use of semantic prototypes in semantic role annotation

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

An Online Service for SUbtitling by MAchine Translation

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Master of Arts in Linguistics Syllabus

Special Topics in Computer Science

Semantic annotation of requirements for automatic UML class diagram generation

REACTION Workshop Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION

How To Complete The Danish Masters Program In Lct

Schema documentation for types1.2.xsd

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Study Plan for Master of Arts in Applied Linguistics

Master Degree Project Ideas (Fall 2014) Proposed By Faculty Department of Information Systems College of Computer Sciences and Information Technology

Extraction and Visualization of Protein-Protein Interactions from PubMed

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Customizing an English-Korean Machine Translation System for Patent Translation *

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Clustering Connectionist and Statistical Language Processing

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Learning Translation Rules from Bilingual English Filipino Corpus

Processing: current projects and research at the IXA Group

A chart generator for the Dutch Alpino grammar

Annotation and Evaluation of Swedish Multiword Named Entities

Interactive Dynamic Information Extraction

WebLicht: Web-based LRT services for German

Effective Self-Training for Parsing

Natural Language to Relational Query by Using Parsing Compiler

What s in a Lexicon. The Lexicon. Lexicon vs. Dictionary. What kind of Information should a Lexicon contain?

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Automatic Text Analysis Using Drupal

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Compiler I: Syntax Analysis Human Thought

Brill s rule-based PoS tagger

BILINGUAL TRANSLATION SYSTEM

The Rise of Documentary Linguistics and a New Kind of Corpus

Visual Interactive Syntax Learning: A Case of Blended Learning

Hybrid Strategies. for better products and shorter time-to-market

Aprimorando o Corretor Gramatical CoGrOO. William Daniel Colen de Moura Silva

Natural Language Database Interface for the Community Based Monitoring System *

LGPLLR : an open source license for NLP (Natural Language Processing) Sébastien Paumier. Université Paris-Est Marne-la-Vallée

Terminology Extraction from Log Files

Context Grammar and POS Tagging

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

SENTIMENT ANALYSIS: A STUDY ON PRODUCT FEATURES

Masters in Artificial Intelligence

Research Portfolio. Beáta B. Megyesi January 8, 2007

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Prosodic Phrasing: Machine and Human Evaluation

Crystal Tower Shiromi, Chuo-ku, Osaka JAPAN.

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Master of Science in Computer Science

The University of Amsterdam s Question Answering System at QA@CLEF 2007

Technical Presentations. Arian Pasquali, FEUP, REACTION Data Collection Plataform David Batista, INESC-ID, Sematic Relations Extraction REACTION

Word Completion and Prediction in Hebrew

Scandinavian Dialect Syntax Transnational collaboration, data collection, and resource development

RRSS - Rating Reviews Support System purpose built for movies recommendation

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

INF5820 Natural Language Processing - NLP. H2009 Jan Tore Lønning jtl@ifi.uio.no

Search Result Diversification Methods to Assist Lexicographers

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Masters in Human Computer Interaction

Masters in Advanced Computer Science

MULTIFUNCTIONAL DICTIONARIES

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

D2.4: Two trained semantic decoders for the Appointment Scheduling task

How to make Ontologies self-building from Wiki-Texts

Frequency Dictionary of Verb Phrase Constructions

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

M LTO Multilingual On-Line Translation

Search and Information Retrieval

Building gold-standard treebanks for Norwegian

Annotation in Language Documentation

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

ETL Ensembles for Chunking, NER and SRL

Transcription:

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark eckhard.bick@mail.dk

Outline Flow chart Linguateca <-> Palavras History and milestones pre- and co-linguateca Parallel development strands A recent example of ressource integration: QuickDict Público + Folha + Dependency Parsing + Statistics = Lexicography

Raw Corpus teaching research VISL form/function category standardisation QA NER Linguateca tool integration debugging annotation (PoS, syntax, sem,...) PALAVRAS parser AC/DC gold revision HAREM Compara Floresta Annotated Corpus format filters: xml, tokenization data challenges evaluation

History of the parser (pre-linguateca) 1990-93 Portuguese lexicography (master's thesis) 1992-93 Morphological analyzer for Portuguese 1994-96 Development of the first version of PALAVRAS' Constraint Grammar modules (morphology and syntax) 1996 presentation at the 2. Propor in Curitiba, Brazil 1997 focus on tree structures and corpus work (Borba-Ramsey, VEJA, NILC) 1998-99 experiments with heuristics, transscribed speech (NURC), dialectal variation (Cordial-syn, Moçambique) and historical data (Tycho Brahe) 1997-? PALAVRAS and derived systems for other languages used for syntax teaching in Denmark (VISL) 2000 Dr.phil. thesis centering on PALAVRAS

History of the parser (co-linguateca) - corpus annotation 1999-? AC/DC project (e.g. Santos & Bick 2000, LREC) both Portuguese (esp. Público) and Brazilian (Folha) data quantitatively mostly newspaper genre mostly unrevised annotation, but feedback and multiple reannotations 2000-? Floresta Sintá(c)tica Project (e.g. Afonso et al. 2002, LREC) 3 phases: (a) Odense-Oslo, (b) multipolar international, (c) Portugal automatic annotation with human linguistic revision (a-b) PALAVRAS + PSG (Bick 2003, CL), two-pass annotation and revision (c) PALAVRAS + DG (Bick 2005, TLT), one-pass revision 2004 PALAVRAS-Linguateca license at SINTEF, Oslo raw and edited annotation (e.g. COMPARA)

History of the parser (co-linguateca) - joint evaluation 2003 Morfolimpíadas/Avalon (Santos 2007) PALMORF, an adapted morphology module from PALAVRAS, achieves top results 2005 1. HAREM (evaluation meeting 2006, Porto) PALAVRAS-NER (Bick 2006, Propor Itatiaia), with separate modules for name recognition and classification, participates with good results (best F-scores) 2008 2. HAREM the SeReLep system (by PUCRS) integrates a licensed PALAVRAS

Other development strands Question & Answering A prototype using PALAVRAS syntactic analysis (Bick 2003, EPIA) PALAVRAS used as a syntactic module in other systems University of Evora (Quaresma et al. 2004, CLEF) Esfinge (Costa 2006, CLEF) Historical Portuguese Parser and lexicon daptations for 18 th and 19 th century Brazilian Portuguese (Bick & Módolo 2005) Semantic annotation beyond NER Semantic prototype annotation (used for MT and Floresta) Semantic role annotation (Bick 2007, TIL) CG3, a new Constraint Grammar formalism and compiler direct creation and referencing of dependency and anaphora relations, context windows larger than a sentece, reg. expressions, integration of statistical data,...

DeepDict A lexicographic tool to provide contextual dictionary information on the fly useful for dictionary publishers, students, linguists.. example of a product integrating different types of ressources into a real life tool: Corpus data, Tagger/Parser, Statistical tools example of interactive, corpus-driven lexicography (1) results can be fed back into the parsing lexicon (valency tags, semantic lumping etc) (2) the improved parsing lexicon allows better corpus annotation and in turn more DeepDict data the Portuguese version is freely available on the Internet

DeepDict www.gramtrans.com syntactically analyzed corpus (dependency links and functions) Linguateca's Público and Folha corpora (CETEMPúblico, CETENFolha) ca. 180+30 M words Portuguese Wikipedia (Nov. 2005) ca. 8 M words Portuguese section of Europarl ca. 27 M words lemmatization, normalization (passives, numbers, names) extraction of mother-daughter relations, depgrams, not ngrams (cf. Adam Kilgariff's Sketch Engine) N + @N< (vaca louca), @>A + ADJ (gravemente doente) V + @ACC (ganhar terreno), @SUBJ ktp, V + PRP (pensar em) co-occurrence measure: p(ab) / p(a) * p (B) graphical interface

http://beta.visl.sdu.dk/ visl/pt/parsing/automatic/ (live parsing, file upload etc.) constraint_grammar.html (formalism) visl/pt/info/ (categories and annotation docs)