Dutch Parallel Corpus

Similar documents
Post-doctoral researcher, Faculty of Translation Studies, University College Ghent

Department of Translation, Interpreting and Communication. Post-doctoral researcher, Department of Translation, Interpreting and Communication,

Assistant professor, Department of Translation, Interpreting and Communication, Faculty of Arts and Philosophy, Ghent University

An Online Service for SUbtitling by MAchine Translation

Annotation Guidelines for Dutch-English Word Alignment

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Your boldest wishes concerning online corpora: OpenSoNaR and you

Collecting Polish German Parallel Corpora in the Internet

Understanding Interpretation and Translation

Question template for interviews

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

The University of Amsterdam s Question Answering System at QA@CLEF 2007

Schema documentation for types1.2.xsd

Language and Computation

Master in Digital Humanities

Statistical Machine Translation

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

Dutch-Flemish Research Programme for Dutch Language and Speech Technology. stevin programme. project results

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Extraction and Visualization of Protein-Protein Interactions from PubMed

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

TS3: an Improved Version of the Bilingual Concordancer TransSearch

WebLicht: Web-based LRT services for German

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR CORD 01 5

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Search and Information Retrieval

Report Builder Easily create exportable reports

Fixed issue that could hang a domain controller. It can occur when the filter has difficulty resolving a user's SID and the first 2 methods fail.

Customizing an English-Korean Machine Translation System for Patent Translation *

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

Database Design For Corpus Storage: The ET10-63 Data Model

Example-based Translation without Parallel Corpora: First experiments on a prototype

Appendix C Comparison of the portals with regard to the interface requirements

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

Central and South-East European Resources in META-SHARE

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

FoLiA: Format for Linguistic Annotation

The Language Archive at the Max Planck Institute for Psycholinguistics. Alexander König (with thanks to J. Ringersma)

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Convergence of Translation Memory and Statistical Machine Translation

Learning Translations of Named-Entity Phrases from Parallel Corpora

a Chinese-to-Spanish rule-based machine translation

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Mining a Corpus of Job Ads

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

Multilingual Term Extraction as a Service from Acrolinx. Ben Gottesman Michael Klemme Acrolinx CHAT2013

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

Getting Off to a Good Start: Best Practices for Terminology

TRANSLATIONS FOR A WORKING WORLD. 2. Translate files in their source format. 1. Localize thoroughly

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

PoS-tagging Italian texts with CORISTagger

Interactive Dynamic Information Extraction

Collaborative Machine Translation Service for Scientific texts

INPUTLOG 6.0 a research tool for logging and analyzing writing process data. Linguistic analysis. Linguistic analysis

Language Translation Services RFP Issued: January 1, 2015

Why are Organizations Interested?

PyCantonese: Cantonese linguistic research in the age of big data

SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF

SYSTRAN v6 Quick Start Guide

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Brill s rule-based PoS tagger

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

Natural Language Database Interface for the Community Based Monitoring System *

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Joint Sequence Translation Model with Integrated Reordering

Key words related to the foci of the paper: master s degree, essay, admission exam, graders

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Optimizing Multilingual Search With Solr

Analysis of Social Media Streams

How To Write The English Language Learner Can Do Booklet

BIRMINGHAM CITY UNIVERSITY. Marketing and Communications JOB DESCRIPTION

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015

Introduction to the Translation Workspace

Transcription:

Dutch Parallel Corpus Lieve Macken lieve.macken@hogent.be LT 3, Language and Translation Technology Team Faculty of Applied Language Studies University College Ghent November 29th 2011 Lieve Macken (LT 3 ) DPC November 29th 2011 1 / 28

Dutch Parallel Corpus Characteristics Parallel Corpus of 10 million words 2 language pairs, 4 language directions Dutch French / Dutch English Bidirectional Also comparable corpus (translated vs. non-translated language) Balanced 5 text types: administrative texts, instructive texts, literature, journalistic texts, texts for external communication Text types further divided into subtypes ± 500,000 words per translation direction/text type Metadata files info on translation direction and text types other relevant info such as intended audience, text provider,... Lieve Macken (LT 3 ) DPC November 29th 2011 2 / 28

Dutch Parallel Corpus Characteristics (ctd.) Balanced (ctd.) Condition on translation direction relaxed for instructive texts difficult to find information on translation direction Literary texts balanced according to language pairs difficult to obtain copyright clearance Copyright-cleared Difficult and time-consuming task Commercial publishers (journalistic texts and literature) longer and more difficult negotiation cycles Different type of IPR-agreements commercial vs. non-commercial use recognizable vs. partially recognizable Lieve Macken (LT 3 ) DPC November 29th 2011 3 / 28

Dutch Parallel Corpus Characteristics (ctd.) Automaticaly aligned at sentence level Combined three alignment tools reduce manual verification effort No word alignment Linguistically enriched Preprocessing Character conversion (UTF-8) Text split in sentences and tokens Part-of-speech tagging Info on tag sets: see user doc Lemmatization Lieve Macken (LT 3 ) DPC November 29th 2011 4 / 28

Dutch Parallel Corpus Characteristics (ctd.) Obtained accuracy scores on a manually validated DPC sample Sample size Lemmata PoS PoS (tokens) (full tag) (main category) Dutch 211,000 96.5% 94.8% 97.4% English 300,000 98.1% 96.2% N/A French 330,000 98.1% 94.6% 97.4% Lieve Macken (LT 3 ) DPC November 29th 2011 5 / 28

Dutch Parallel Corpus Distribution Distributed and maintained by HLT-agency (TST-centrale) http://www.inl.nl/tstcentrale/nl/producten/corpora/dpc-corpus-dutchparallel-corpus/6-65 Lieve Macken (LT 3 ) DPC November 29th 2011 6 / 28

XML data 4 files per text 2 monolingual files Indications of paragraphs and sentences Linguistic annotations File containing metadata Author, date, publication,... Source and target language Text type and subtype Domain, keywords Type of IPR-agreement File with sentence alignments Lieve Macken (LT 3 ) DPC November 29th 2011 7 / 28

Monolingual XML file Lieve Macken (LT 3 ) DPC November 29th 2011 8 / 28

Metadata XML file Lieve Macken (LT 3 ) DPC November 29th 2011 9 / 28

Alignment XML file Lieve Macken (LT 3 ) DPC November 29th 2011 10 / 28

Webinterface Demo version http://dpc.inl.nl/indexd.php Lieve Macken (LT 3 ) DPC November 29th 2011 11 / 28

Webinterface ctd. Three steps 1 Monolingual vs. bilingual search 2 Full corpus vs. subcorpus 3 Search query Lieve Macken (LT 3 ) DPC November 29th 2011 12 / 28

Webinterface ctd. Putting together a subcorpus Multifunctional resource for diverse group of users Researchers in translation studies vs. contrastive linguistics Language teachers and language learners Human translators Create a subcorpus according to your needs! E.g. filter on translation direction E.g. filter on text type But: the more restrictions... the less data Lieve Macken (LT 3 ) DPC November 29th 2011 13 / 28

Webinterface ctd. Lieve Macken (LT 3 ) DPC November 29th 2011 14 / 28

Webinterface ctd. Search query Define search words max. 2 per language Define search type Word form or lemma Part-of-speech Position Lieve Macken (LT 3 ) DPC November 29th 2011 15 / 28

Example 2 consecutive words Lieve Macken (LT 3 ) DPC November 29th 2011 16 / 28

Example 2 consecutive words (results) Lieve Macken (LT 3 ) DPC November 29th 2011 17 / 28

Webinterface ctd. Metadata Context Lieve Macken (LT 3 ) DPC November 29th 2011 18 / 28

Webinterface ctd. Export to Excel Source and target sentences and accompanying metadata Lieve Macken (LT 3 ) DPC November 29th 2011 19 / 28

Puisque aangezien Lieve Macken (LT 3 ) DPC November 29th 2011 20 / 28

Results Puisque aangezien Lieve Macken (LT 3 ) DPC November 29th 2011 21 / 28

Dutch compounds Lieve Macken (LT 3 ) DPC November 29th 2011 22 / 28

Results Dutch compounds Lieve Macken (LT 3 ) DPC November 29th 2011 23 / 28

Passé composé (French) Simple past (Dutch) Lieve Macken (LT 3 ) DPC November 29th 2011 24 / 28

Results passé composé (French) Simple past (Dutch) Lieve Macken (LT 3 ) DPC November 29th 2011 25 / 28

Final verbal group in Dutch Lieve Macken (LT 3 ) DPC November 29th 2011 26 / 28

Results final verbal group in Dutch Lieve Macken (LT 3 ) DPC November 29th 2011 27 / 28

Acknowledgements Funded by the STEVIN programme DPC Team K.U. Leuven campus Kortrijk Faculty of Applied Language Studies University College Ghent Contributors Piet Desmet, Willy Vandeweghe, Hans Paulussen, Lieve Macken, Maribel Montero Perez, Orphée De Clercq, Lidia Rura, Julia Trushkina and Antoine Besnehard http://www.kuleuven-kortrijk.be/dpc User Manual Publications Lieve Macken (LT 3 ) DPC November 29th 2011 28 / 28