Detecting semantic overlap: Announcing a Parallel Monolingual Treebank for Dutch

Similar documents
Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde

The University of Amsterdam s Question Answering System at QA@CLEF 2007

How To Identify And Represent Multiword Expressions (Mwe) In A Multiword Expression (Irme)

NEDERBOOMS Treebank Mining for Data- based Linguistics. Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

Reduction of Dutch Sentences for Automatic Subtitling

Convergence of Translation Memory and Statistical Machine Translation

Your boldest wishes concerning online corpora: OpenSoNaR and you

Dutch Parallel Corpus

Extraction of Hypernymy Information from Text

An Online Service for SUbtitling by MAchine Translation

A chart generator for the Dutch Alpino grammar

Linguistic Research with CLARIN. Jan Odijk MA Rotation Utrecht,

Dutch-Flemish Research Programme for Dutch Language and Speech Technology. stevin programme. project results

Natural Language to Relational Query by Using Parsing Compiler

Anotaciones semánticas: unidades de busqueda del futuro?

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Chapter 8. Final Results on Dutch Senseval-2 Test Data

From D-Coi to SoNaR: A reference corpus for Dutch

Timeline (1) Text Mining Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?

Building a Question Classifier for a TREC-Style Question Answering System

Shallow Parsing with Apache UIMA

Schema documentation for types1.2.xsd

Effective Self-Training for Parsing

The English-Swedish-Turkish Parallel Treebank

Machine Translation. Agenda

FoLiA: Format for Linguistic Annotation

Interactive Dynamic Information Extraction

Search Engine Based Intelligent Help Desk System: iassist

Interoperability, Standards and Open Advancement

PoliticalMashup. Make implicit structure and information explicit. Content

Micro blogs Oriented Word Segmentation System

Semantic Search in E-Discovery. David Graus & Zhaochun Ren

Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu

Sentence Simplification and Automatic Syntactic Analysis

Automatic Text Analysis Using Drupal

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

Proceedings of the Sixteenth Computational Linguistics in the Netherlands

Analysis of Social Media Streams

Coreference Resolution on Blogs and Commented News

Annotation Guidelines for Dutch-English Word Alignment

THE EMOTIONAL VALUE OF PAID FOR MAGAZINES. Intomart GfK 2013 Emotionele Waarde Betaald vs. Gratis Tijdschrift April

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

CLARIN project DiscAn :

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Collecting Polish German Parallel Corpora in the Internet

Machine Learning Approach To Augmenting News Headline Generation

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

IP-NBM. Copyright Capgemini All Rights Reserved

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Question Answering and Multilingual CLEF 2008

Off-line answer extraction for Dutch QA

Search and Information Retrieval

Post-doctoral researcher, Faculty of Translation Studies, University College Ghent

Long, often quite boring, notes of meetings

Questions, Pictures, Answers: Introducing Pictures in Question-Answering Systems 1

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Mining a Corpus of Job Ads

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Finding Syntactic Characteristics of Surinamese Dutch

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

WikiSimple: Automatic Simplification of Wikipedia Articles

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Optimization of Internet Search based on Noun Phrases and Clustering Techniques

How To Write A Question Answering System For A Quiz At The University Of Amsterdam

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

The Evalita 2011 Parsing Task: the Dependency Track

DAM-LR at the INL Archive Formation and Local INL. Remco van Veenendaal 01/03/2007 DAM-LR

Question Answering for Dutch: Simple does it

A prototype infrastructure for D Spin Services based on a flexible multilayer architecture

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

Semantic annotation of requirements for automatic UML class diagram generation

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Automatic Detection and Correction of Errors in Dependency Treebanks

The CroCo Translation Archive

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Detecting Forum Authority Claims in Online Discussions

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Daan & Rembrandt Research Wendelien Daan By Willemijn Jongbloed Group D October 2009

Mining Text Data: An Introduction

COC131 Data Mining - Clustering

Big Data and Scripting. (lecture, computer science, bachelor/master/phd)

A Multi-document Summarization System for Sociology Dissertation Abstracts: Design, Implementation and Evaluation

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Project no.

Integrating Annotation Tools into UIMA for Interoperability

Searching Questions by Identifying Question Topic and Question Focus

A stream computing approach towards scalable NLP

SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer

DEPENDENCY PARSING JOAKIM NIVRE

The IT contract The rules of the game. Remi-Armand Collaris. Dick van der Sar. Introduction Who are we. Introduction Workshop

Automated Extraction of Security Policies from Natural-Language Software Documents

Transcription:

Detecting semantic overlap: Announcing a Parallel Monolingual Treebank for Dutch Erwin Marsi & Emiel Krahmer CLIN 2007, Nijmegen

What is semantic overlap? Zangeres Christina Aguilera heeft eindelijk verteld waarom haar buik zo dik is. [NOS] Christina Aguilera heeft in het Amerikaanse tijdschrift Glamour bevestigd dat zij zwanger is. [AD] Christina Aguilera heeft eindelijk bevestigd wat de hele wereld al wist: ze is zwanger. [NOVUM] Iedereen wist het al, maar nu zou Christina Aguilera het zelf voor het eerst hebben toegegeven: ze is zwanger. [The Agenda]

Why bother? Similar information can be expressed ( paraphrased ) in many different ways. Major stumbling block for robust NLP applications such as IE, IR or QA. Resources exist on the word level (e.g., Wordnet), but are mostly lacking for more complex phrases. The Stevin Daeso (Detecting and Exploiting Semantic Overlap) project intends to fill this gap.

The Daeso corpus Building a 1M word parallel monolingual treebank. Basic idea: look for pairs of sentences where there is an independent criterion that there will be some amount of overlap. The corpus should contain different text genres and different amounts of overlap. 500K manually aligned and corrected [now]; 500K automatic [2009]

Corpus collection Autocue - Subtitling (NOS, TwNC) High Parallel translations into Dutch Le Petit Prince van de Saint-Exupery (1960, 2000) The Origin of Species van Darwin (2001, 2002) Essais van Montaigne (2001, 2004) Google Headlines (mined by Wauter Bosma) Degree of Overlap Different press releases (ANP, Novum) about the same (Dutch) event. Potential sets of answers to different questions (from the IMIX project). Low

Corpus data Manual Available Autocue-subtitles 125k 192k Book translations Darwin1 25k 154k Darwin 2 25k 191k Montaigne 1 25k 462k Montaigne 2 25k ~500k Saint-Exupéry 1 15k 15k Saint-Exupéry 2 15k 15k News headlines 24k > 900k Press releases ANP 125k 197k Novum 125k 136k QA system output 1k 1k

Pre-processing and annotation steps 1. XML TEI format (Text Encoding Initiative). 2. Sentence splitting and tokenization with the DCOI tokenizer for Dutch (Reynaert 2007). 3. Dependency parsing with the Alpino parser (van Noord et al.). 4. Alignment at text and sentence level. 5. Alignment of dependency trees.

Sentence alignment Standard alignment methods (e.g., Gale and Church 1993) assume alignment is mostly 1-to-1 and that crossing alignments and unaligned sentences are rare. These assumptions are often violated. Obviously in comparable texts But also in e.g., translations of Darwin s Origin of Species Developed: A new alignment method to boost manual annotation. A new annotation tool to check sentence alignments

Automatic sentence alignment Tricky for comparable texts As a first approximation: low level, multiple pass, shallow features. Experiments with: types vs token; different overlap metrics (MaxSim, Cosine, Jaccard, Dice, Tanimoto,...); tf-idf weighting (Nelken & Schieber 2006) Ongoing...

Hitaext: Tool for text and sentence alignment First public release (october 2007): http://daeso.uvt.nl/hitaext/

Alignments of words and phrases Given two dependency trees for two aligned sentences: align nodes and label the alignment relation. Christina Aguilera equals Christina Aguilera zwanger restates in verwachting de zangeres Aguilera specifies Aguilera Aguilera generalizes de zangeres Aguilera Christina Aguilera en Beyoncé intersects Beyoncé en Pink Marsi & Krahmer (2005): for first five chapters of Le Petit Prince, two annotators reached an F-score of.98 on relations and.95 on labels.

Algraeph: Tool for aligning nodes and labeling alignments

State of affairs Work on manual alignment of words and phrases currently ongoing. Other work on the corpus is now finished. Further activities: Sentence fusion: combine two related sentences into a single grammatical sentence. New results on question-driven fusion just in. Multi-document summarization: currently building a baseline multi-document summarization system for Dutch, to be extended with Daeso tools later on.

About the Daeso Stevin project People involved: Paul van Pelt, Jurry de Vos, Iris Hendricks, Walter Daelemans, Jakub Zavrel, Maarten de Rijke, Erwin Marsi, Emiel Krahmer More info: http://daeso.uvt.nl/