DAM-LR at the INL Archive Formation and Local INL. Remco van Veenendaal veenendaal@inl.nl http://imdi.inl.nl 01/03/2007 DAM-LR

Similar documents

LAMUS & LAT Archiving software

How To Create A Clarin Metadata Infrastructure

DAM-LR Distributed Access Management

The Language Archive at the Max Planck Institute for Psycholinguistics. Alexander König (with thanks to J. Ringersma)

LEXUS: a web based lexicon tool

Technology in language documentation

Your boldest wishes concerning online corpora: OpenSoNaR and you

Dutch-Flemish Research Programme for Dutch Language and Speech Technology. stevin programme. project results

Overview The Corpus Nederlandse Gebarentaal (NGT; Sign Language of the Netherlands)

Dutch Parallel Corpus

Central and South-East European Resources in META-SHARE

Towards Web Services for Speech Recording and Annotation

Deliverable 12.1 Training Plan

DAM-LR Distributed Solution. - ideas -

Annotation in Language Documentation

How To Identify And Represent Multiword Expressions (Mwe) In A Multiword Expression (Irme)

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Enterprise Edition. Hardware Requirements

Master of Arts in Linguistics Syllabus

MEDAR Mediterranean Arabic Language and Speech Technology An intermediate report on the MEDAR Survey of actors, projects, products

FoLiA: Format for Linguistic Annotation

Processing: current projects and research at the IXA Group

ECCAIRS 5 Technical Course. Minimal System Requirements. Minimal System Requirements. Server Installation and Configuration

Linguistic Research with CLARIN. Jan Odijk MA Rotation Utrecht,

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Minimum System Requirements

Hybrid Strategies. for better products and shorter time-to-market

CLARIN-NL Second Open Call. Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010

FEATURES FOR AN INTERNET ACCESSIBLE CORPUS OF SPOKEN TURKISH DISCOURSE

Professional and Enterprise Edition. Hardware Requirements

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Linguistic Resources for OpenHaRT-13

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

CLARIN: Common Language Resources and Technology Infrastructure

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Intera Intera Deliverable D2.3

SYSTEM SETUP FOR SPE PLATFORMS

From D-Coi to SoNaR: A reference corpus for Dutch

Sage Grant Management System Requirements

Computerized Language Analysis (CLAN) from The CHILDES Project

Working on the future of the language

Localization Profile 2014

What Does Interoperability Mean, Anyway? Toward an Operational Definition of Interoperability for Language Technology

STEPS IN LANGUAGE DOCUMENTATION AND REVITALIZATION JACK MARTIN NICK THIEBERGER

Table of Contents. Server Virtualization Peer Review cameron : modified, cameron

Fast, Easy to use and On-demand A Content Platform from the 21st Century

Sustainable Solutions for Endangered Languages Data: The Language Archive

Turkish Radiology Dictation System

IBM Lotus Instant Messaging and Web Conferencing 6.5.1

CloudFTP: A free Storage Cloud

Hardware and Software Requirements for Installing California.pro

Minimum System Requirements

The Knowledge Sharing Infrastructure KSI. Steven Krauwer

The Rise of Documentary Linguistics and a New Kind of Corpus

How to use and archive data at the Archive of the Indigenous Languages in Latin America (AILLA): A workshop-presentation

NEDERBOOMS Treebank Mining for Data- based Linguistics. Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde

MoversSuite by EWS. System Requirements

Collecting Polish German Parallel Corpora in the Internet

Adonis Technical Requirements

Microsoft Office Outlook 2013: Part 1

Special Topics in Computer Science

Overview of MT techniques. Malek Boualem (FT)

Language and Computation

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

D2.4: Two trained semantic decoders for the Appointment Scheduling task

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Online Student Attendance Management System using Android

Exploiting Sign Language Corpora in Deaf Studies

AP ENPS ANYWHERE. Hardware and software requirements

The Lund Lab language by ear, eye and touch

From Fieldwork to Annotated Corpora: The CorpAfroAs project

Microsoft Office Outlook 2010: Level 1

services. Anders Wiehe IT department Gjøvik University College

Dedicated Hosting. The best of all worlds. Build your server to deliver just what you want. For more information visit: imcloudservices.com.

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Annotation Guidelines for Dutch-English Word Alignment

Post-doctoral researcher, Faculty of Translation Studies, University College Ghent

Queen s Open Journal System (OJS) Business Case

How To Help The European Single Market With Data And Information Technology

3M Occupational Health and Environmental Safety 3M E-A-Rfit Validation System. Version 4.2 Software Installation Guide (Upgrade) 1 P age

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

FREE VOICE CALLING IN WIFI CAMPUS NETWORK USING ANDROID

Version 2010 System Requirements Revised 8/9/2010 1

Transcribing and annotating audio and video: Jeff Good MPI EVA and the Rosetta Project

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

An Online Service for SUbtitling by MAchine Translation

Integrated Linguistic Resources for Language Exploitation Technologies

Web Server (IIS) Requirements

MASTER OF PHILOSOPHY IN ENGLISH AND APPLIED LINGUISTICS

Data Management Plans and Data Centers

SMART Bridgit software

Hardware Configuration Guide

The SweDat Project and Swedia Database for Phonetic and Acoustic Research

The construction of a 500-million-word reference corpus of contemporary written Dutch

DGT-TM: A freely Available Translation Memory in 22 Languages

A sustainable archiving software solution for The Language Archive

Introduction to Cisco Inventory and Reporting

RSA Digital Certificate Solution

Video Transcription in MediaMosa

SOFTWARE TECHNOLOGIES

Transcription:

DAM-LR at the INL Archive Formation and Local INL Remco van Veenendaal veenendaal@inl.nl http://imdi.inl.nl

Introducing Remco van Veenendaal Project manager DAM-LR Acting project manager Dutch HLT Agency (TST-centrale) Computational Linguist Vincent Wagelaar Software engineer and system administrator DAM-LR

Overview The INL Archive Formation at the INL Local Solution at the INL Summary

The INL

The INL (1/2) INL studies the Dutch language Focus on lexicology: Dutch language database Word lists, dictionaries, corpora and software tools Jan 07: http://wnt.inl.nl The world s largest dictionary free access Track record of 40 yrs (1967-2007), about 40 employees Previous international projects Parole (creation of electronic LRs) Telri (network with Eastern Eu. LR creators) Elan (infrastructure for Parole LRs) Simple (semantics for Parole LRs) Enabler (infrastructure for LRs)

The INL (2/2) Home of the Dutch HLT Agency (TST-centrale) Initiative of the Dutch Language Union (NTU) Single point of access for Dutch language resources (STEVIN) Strengthen position of Dutch (in language technology) Stimulation of (re)use of Dutch LRs Acquisition, Maintenance, Support, Distribution (ITIL) IPR and service IPR, linguistic and technical consultancy (free) DAM-LR logical next step in international projects and DAM-LR and TST-centrale s services mutually beneficial Management buy-in Co-operation between DAM-LR, TST-centrale and IT department

Archive Formation at the INL

Archive Formation at the INL (1/2) 2004/5 Training at MPI: IMDI, IMDI corpora (Spoken Dutch Corpus, CGN 1.0) and IMDI software (Corex, LAMUS) 2005 Inventory of LRs (INL/TST-centrale) Inventory of access rights for LRs IPR! Pilot version of archive with CGN 1.0 Hardware (dedicated DAM-LR server) 3.4 GHz single core Intel Pentium 4 1 Gb RAM and redundant 250 Gb disk storage OS: FreeBSD 5.4 (Java 1.4.2) Start of conversion of IMDI LRs to IMDI 3.0 75 person months (Archive + Local INL)

Archive Formation at the INL (2/2) 2006 Production Line system forlrs One system for storage, back-up, versioning, conversion/integration and distribution/exploration of LRs Boekestein, M. et al. (2006). The functioning of the Centre for Dutch Language and Speech Technology. Proceedings of the 5th International Conference of Language Resources. Genoa : pp. 2303-2306. CGN 2.0 and IFA corpus in archive Many LRs in pipeline (next) Handles generated and integrated in IMDI 52 person months (Archive + Local INL)

CGN corpus 2.0 Spoken Dutch Corpus 900 hours spoken Dutch (Netherlands and Flanders) 12,780 audio fragments * annotation types orthography, phonetics, word alignment, part-of-speech, 12,780 IMDI session metadata files 127 IMDI (sub)corpus metadata files Frequency lists, tools, documentation About 120 GB Version 2.0 contains a lot of (meta)data bugfixes, annotations for 13 Flemish audio files, updated software and documentation

IFA corpus IFA Corpus of Spoken Dutch Open source (GPL) database of hand-segmented Dutch speech 8 speakers and 8 speaking styles 50,000 words / 5½ hours IMDI 3.0 session and (sub)corpus metadata Documentation 400+ MB

Resources to add in 2007 RBN (reference list of Dutch); RBBN (reference list of Flemish Dutch); OMBI (dictionary editor); several bilingual lexicons; Official Dutch word lists of 1995 and 2005 (lexicons); dictionary of Old-Dutch; E-LEX (large Dutch electronic single word and multi-word lexicon; 5, 27 and 38 million words written text corpora and exploration software; 20 million words Parole written text corpus, lexicon and exploration software; Neologisms list for Dutch; ANW corpus (100 million word text corpus); Nl-En, Nl-Fr, Fr-Nl and En-Nl translation dictionaries, corpora and software; Terminology lexicons and software (TermExtractor; E-ANS (Dutch grammar rules); Regional (dialect) dictionaries; STEVIN D-Coi (pilot for Written Dutch Corpus); STEVIN WDC (500 million word text corpus); STEVIN Jasmin-CGN (extension of CGN); STEVIN COREA (co-reference tools and annotated text corpus); STEVIN IRME (multi-word database and software); STEVIN Autonomata (spoken name database and software); STEVIN Daeso (corpus and software for semantics); STEVIN DPC (Nl-En and Nl-Fr parallel corpora); STEVIN Lassy (syntactically annotated corpus); STEVIN Midas (software for robustness in ASR); STEVIN N-best (benchmark for Dutch ASRs); STEVIN can Praat (software for speech research); STEVIN Spraak (ASR toolkit for Dutch); 01/03/2007 STEVIN Cornetto (software toolkit DAM-LR for lexical semantics); 14th century written text corpus; Eindhoven writtten text corpus (the first written Dutch text corpus)

Local Solution at the INL

Local Solution at the INL (1/3) Finding and implementing solutions for the four access pillars

Local Solution at the INL (2/3) 2004/5 Training at MPI also included IMDI portal software 2005 Pilot implementation of IMDI portal Test LR: CGN 1.0 Experiences/documentation on project wiki Solution for distributed metadata domain (IMDI)

Local Solution at the INL (3/3) 2006 Solution for URID service (Handle) Solution for authentication (LDAP) Solution for authorisation (Shibboleth) PKI: Leiden University as Certification Authority Portal beta online (http://imdi.inl.nl) CGN 2.0 and IFA corpus as test LRs Solutions for all pillars up and running Towards a distributed solution INL portal links to MPI and Lund archives Succesful Shibboleth test between MPI and INL (November 2006)

Summary DAM-LR at INL in close co-operation with TST-centrale and IT department Archive Formation + Local Solution 75 pm in 2005 (total: 108 pm) 52 pm in 2006 (total: 78 pm) Solution for distr. metadata domain: Solution for URID service: Solution for authentication: Solution for authorisation: IMDI Handle (licensed) (open)ldap Shibboleth Two large LRs in beta portal: http://imdi.inl.nl Start of distributed solution: links MPI & Lund Successful Shibboleth test between MPI and INL

End Questions?