Annotation in Language Documentation



Similar documents
Annotation tool Toolbox how to gloss/annotate in Toolbox. Regensburg DOBES summer school Language Documentation Sebastian Drude

Sustainable Solutions for Endangered Languages Data: The Language Archive

Transcribing and annotating audio and video: Jeff Good MPI EVA and the Rosetta Project

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

CoLang 2014 Data Management and Archiving Course. Session 2. Nick Thieberger University of Melbourne

Elan. Complex annotations of video and audio resources Multiple annotation tiers, hierarchically structured Search multiple coded files

The Rise of Documentary Linguistics and a New Kind of Corpus

Computerized Language Analysis (CLAN) from The CHILDES Project

From Fieldwork to Annotated Corpora: The CorpAfroAs project

The Language Archive at the Max Planck Institute for Psycholinguistics. Alexander König (with thanks to J. Ringersma)

InqScribe. From Inquirium, LLC, Chicago. Reviewed by Murray Garde, Australian National University

Using ELAN for transcription and annotation

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Towards a Cross-Linguistic Production Data Archive: Structure and Exploration*

User Guide for ELAN Linguistic Annotator

Transcription Format

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Study Plan for Master of Arts in Applied Linguistics

Carla Simões, Speech Analysis and Transcription Software

SPRING SCHOOL. Empirical methods in Usage-Based Linguistics

Transcriptions in the CHAT format

AphasiaBank. Audrey Holland, Margaret Forbes, Davida Fromm & Brian Macwhinney Brian MacWhinney

STEPS IN LANGUAGE DOCUMENTATION AND REVITALIZATION JACK MARTIN NICK THIEBERGER

Master of Arts in Linguistics Syllabus

SLDTC: The Sign Language Documentation Training Center

LAMUS & LAT Archiving software

THE BACHELOR S DEGREE IN SPANISH

LEXUS: a web based lexicon tool

University of Massachusetts Boston Applied Linguistics Graduate Program. APLING 601 Introduction to Linguistics. Syllabus

209 THE STRUCTURE AND USE OF ENGLISH.

Scandinavian Dialect Syntax Transnational collaboration, data collection, and resource development

Transcribing and annotating spoken language with EXMARaLDA

St. Petersburg College. RED 4335/Reading in the Content Area. Florida Reading Endorsement Competencies 1 & 2. Reading Alignment Matrix

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

An exchange format for multimodal annotations

Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

Essentials of Language Documentation

PONTIFICIA UNIVERSIDAD CATÓLICA DEL PERÚ - PUCP FIELD SCHOOL PROGRAM IN PERU LINGUISTIC SUMMER SCHOOL 2014 SEASON

Tools & Resources for Visualising Conversational-Speech Interaction

Technology in language documentation

A Short Introduction to Transcribing with ELAN. Ingrid Rosenfelder Linguistics Lab University of Pennsylvania

Why major in linguistics (and what does a linguist do)?

Multilingual, Multiperson, Multimedia: Linking Audio-Visual with Text Material in Language Documentation.

What Is Linguistics? December 1992 Center for Applied Linguistics

Contemporary Linguistics

MA APPLIED LINGUISTICS AND TESOL

An Overview of Applied Linguistics

UNIVERSITY OF PUERTO RICO RIO PIEDRAS CAMPUS COLLEGE OF HUMANITIES DEPARTMENT OF ENGLISH

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

PONTIFICIA UNIVERSIDAD CATÓLICA DEL PERÚ - PUCP FIELD SCHOOL PROGRAM IN PERU AMAZONIAN LINGUISTICS SUMMER SCHOOL 2015 SEASON

SignLEF: Sign Languages within the European Framework of Reference for Languages

How To Teach Reading

LINGUISTIC PROCESSING IN THE ATLAS PROJECT

COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE. Fall 2014

Course Description (MA Degree)

Turkish Radiology Dictation System

CLARIN project DiscAn :

Sign language transcription conventions for the ECHO Project

A high speed transcription interface for annotating primary linguistic data

Toolbox 1! Susan Gehr!! Cell/text (707) !

Text-To-Speech Technologies for Mobile Telephony Services

MASTER OF PHILOSOPHY IN ENGLISH AND APPLIED LINGUISTICS

How To Teach English To Other People

Overview of MT techniques. Malek Boualem (FT)

European Masters Program in Language and Communication Technologies (LCT) Module Handbook for Prospective Students

Efficient diphone database creation for MBROLA, a multilingual speech synthesiser

ENGLISH LANGUAGE. A Guide to co-teaching The OCR A and AS level English Language Specifications. A LEVEL Teacher Guide.

Managing large sound databases using Mpeg7

Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

The World Atlas of Language Structures & Follow-up notes

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

The Knowledge Sharing Infrastructure KSI. Steven Krauwer

Linguistic Resources for OpenHaRT-13

Processing: current projects and research at the IXA Group

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

TESOL Standards for P-12 ESOL Teacher Education = Unacceptable 2 = Acceptable 3 = Target

Things to remember when transcribing speech

How To Write The English Language Learner Can Do Booklet

Transcription:

Annotation in Language Documentation Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-29

Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges

Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges

1) Language Documentation

1) Language Documentation

1) Language Documentation New subfield of linguistics (Himmelmann 1998): documentary linguistics, with language documentations as results Triggered by language endangerment, enabled by technical / digital revolution Different from Language Description: In addition to the Boas ian triad (grammar, dictionary, text collection): corpora of annotated multimedia-data

1) Language Documentation A modern Langague Documentation (LD) cosists of a corpus of primary data (audio & vídeo recordings) of utterances and texts of a broad spectrum of genres and domains Annotation accompanies the utterances A LD is digital and sustainable (metadata, open standards, archiving, maintenance) It is generally accessible, e.g. via the internet

Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges

2) Data and Annotation S E SSION Metadata (describe the event and the respective Data) Videorecording Audiorecording Annotation PrimArY data SeCOndArY data Transcription: Orthographical / Phonolog.... Word-by-word / Idiomatic Translation... (linguistic / ethnograph. comment... ) (Morpheme-Glosses... )...

2) Data and Annotation Data Data is always data FOR something, or at least OF something usually it is a systematic representation of physical states and events ( facts used FOR a scientific argument) In LD, primary data is a direct rendering or result of communicative (speech) events, for instance a written text or, in particular, an audio/video recording of a speech event

2) Data and Annotation Annotation Annotation of data is a symbolic representation of properties of the state/event represented in the data In LD, the most common and basic types of (primary) annotation are a transcription and a translation of the expressions represented in primary data (e.g., a/v recording)

2) Data and Annotation Annotation = Secondary Data Represents symbolically properties of represented in Primary Data Direct Measurement / Rendering / Result of REALITY (Communicative Events)

2) Data and Annotation Global vs. unit-oriented Annotation Global or holistic annotation represents properties of the event as a whole and is in LD part of the metadata Unit-oriented annotation refers to specific parts of the data, in particular, utterances of individual sentences or words or sounds etc. We speak of individual annotations (plural)

2) Data and Annotation Secondary and derived data If unit-oriented annotation is directly based on primary data (such as a written text or a audio or video recording), then it is secondary data Annotation commenting on previous annotation is tertiary data, and so forth recursively In sum, all unit-o. annotation is derived data There are other types of derived data (lexicon...)

2) Data and Annotation Time-aligned annotation Annotation of a media file is time-aligned anotation if each piece of annotation is explicitly associated with the corresponding chunk (time-span, segment) of the media file Time-linking is the activity and result of specifying the time-alignment of each annotation associated with a certain chunk in the media file

2) Data and Annotation This is usually done by using the time marks Time marks: the start/end times of each chunk Segmenting (of a media file): identification of relevant chunks and their time marks Work-flow: segmenting adding annotation Older unit-oriented annotation can later be time-aligned, but this is very labour-intensive (but now see web-maus from CLARIN/BAS)

Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges

3) Types and Interdependencies of Annotations Linguistic types of annotations Annotations differ according to the types of properties of the speech event that are represented in the annotation Annotations can be phonetic, phonological, morphological, syntactic, semantic, pragmatic, (possibly others), and on each level they can focus on the units, or on structures of units, or on relations that hold among units, etc.

3) Types and Interdependencies of Annotations Coverage of annotation Basic annotation: only transcriptions, translations and optionally notes, on a sentence / clause / intonational unit level Basic glossing: additionally information on individual morphs: a gloss (indication of meaning or function) and perhaps a part-of-speech tag Advanced glossing: one or several of additional levels, from phonetic to pragmatic (for instance, a prosodic transcription, or annotation of the syntactic structure, of grammatical relations, etc.)

3) Types and Interdependencies of Annotations Most often used format in lang. description: Interlinear Morpheme Translations / Glossing (standard glossing ) C. Lehmann: Interlinear Morphemic Glossing. In Morphology (2004, first version 1982) Leipzig Glossing Rules = Linguistics @ MPI- EVA (B. Comrie, M. Haspelmath) & Linguistics @ Univ. Leipzig (B. Bickel), 2008

3) Types and Interdependencies Example: of Annotations time-o ne veni-a-t fear-1.sg NEG.VOL come-sbjv.pres-3.sg I am afraid he might come

3) Types and Interdependencies of Annotations Problems: Theory-specific (item-and-arrangement, not itemand-process nor word-and-paradigm) Mixes morphology and syntax Problems with synthetic word forms timeo: 1P, SG, IND, ACT, PRES where PRES? (Ø) Analytical word forms (esp. discontinuous) What do the labels designate? Meaning? Categories? Functions? Often undefined.

3) Types and Interdependencies of Annotations Hans-Heinrich Lieb & Sebastian Drude Advanced Glossing: A Language Documentation Format (DOBES Working Paper, November 2000) http://dobes.mpi.nl/documents/advanced- Glossing1.pdf

Advanced Glossing (AG): a syntactic glossing table

Advanced Glossing (AG): a morphological glossing table

Glossing table AG: A Glossing Table a l i n e a c e l l a h o l i s t i c l i n e a h o l i s t i c l i n e........ is a list

AG: A Glossing Glossing Glossing table Comment General comment is linked to....

AG: Syntactic and Morphological Syntactic glossing of a sentence Glossings of a sentence Morphological glossing of a sentence is a glossing of.... M. glossing of word 1 M. glossing of word 2 M. glossing of word 3

AG: Glossing of a Text Syntactic glossing of a sentence Glossings of the sentences Glossing table Syntactic a and morphological l i n eglossings of sentence 1.... Comment General General comment on the text a c e l l.... Morphological glossing of a sentence.... is a glossing of M. glossing of word 1 M. glossing of word 2 M. glossing of word 3.... Syntactic and morphological glossings of sentence 2 is a list Raw data................ Syntactic and morphological glossings of sentence 3 (Other components)

AG: The number line

AG: Phonetic Form and Intonation

AG: Phonological Forms and Intonation

AG: Orthographical Base Forms

AG: Lexical Categories and Form Categories

AG: Meanings and Semantic Effects

AG: Constituent Structure and Relations

AG: Orthographical Unit and Meaning

3) Types and Interdependencies of Annotations Time-linked annot. for sentence-utterances Other dependent sentence-annotations Subdivision into annotat. for syntactic units (can be internally time-aligned or not) Dependent syntactic-unit-annotations Further subdivision into annotations f. morphs (hardly possible to time-align internally) Dependent morph-annotations

Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges

4) Annotation Tools Transcriber Tool for the segmentation and transcription of audio files Pros: Compatible with MAC, Windows & Linux; very easy to use; produces simple XML-files Cons: No Unicode input possible; only one line of annotation; no video; no lexicon, outdated (new version not tested)

Transcriber

4) Annotation Tools ELAN Tool for the complex annotation of audio and video files Pros: Compatible with MAC, Windows & Linux; audio and multiple video files; unlimited tiers for different speakers; state-of-the-art; wide user community; XML output (but complex) Cons: Complex tool for beginners (but now: easier transcription mode); no lexicon (yet)

ELAN

ELAN

4) Annotation Tools Toolboox Text-oriented general database tool for linguistic fieldwork with lexicon and texts Pros: Flexible and powerful; Export to different formats (incl. XML); therefore easy to integrate with other tools; many users Cons: Too flexible; poor data format Standard Format ; complex to set up; tricky on MAC/Linux; no video and no time-aligning; at end of lifecycle; produced by SIL

Toolbox

Toolbox

4) Annotation Tools FLEX Extensive linguistic database tool for linguistic fieldwork with lexicon and texts Pros: Powerful and well-designed; inbuilt ontology and analysis tools; growing user community Cons: Not flexible (8 tiers); one huge XML database with no good import or export function for texts; Windows only; difficult to configure; no audio, no video, no time-alignment; produced by SIL

FLEX

FLEX

4) Annotation Tools Other tools Praat for segmenting, best for phonetic annotation. CLAN does audio and video annotation, in the CHAT or CA (Conversation Analysis) formats, for child language data (CHILDES project). ANVIL seems to be similar to ELAN (not tested). The EXMARaLDA Partitur Editor (U. Hamburg) is widely used for discourse analysis. Audiamus and Eopas (N. Thieberger) organize (not create) annotation. Poio (developed in the context of CLARIN, API) There are several others.

4) Annotation Tools Transcriber ELAN Toolbox FLEX Complexity Easy Complex, w. easier modes Complex to configure Audio Yes Yes No (can play) No Video No Yes No No Complex Tiers 1 per speaker Unlimited Unlimited Fixed: 8 Lexicon interop., automatic glossing No No (is planned) Unicode No input Yes Yes Yes Data format Simple XML Compl. XML Faulty TXT XML database Interoperability Good Fair Good Bad User community / support Life cycle Small?, no support? Old (but new version 2011) Large, good support Constantly developed Yes Large, fair support Not officially supported, old Yes Small, good support New, being developed

4) Annotation Tools Transcriber ELAN Toolbox FLEX Complexity Easy Complex, w. easier modes Complex to configure Audio Yes Yes No (can play) No Video No Yes No No Complex Tiers 1 per speaker Unlimited Unlimited Fixed: 8 Lexicon interop., automatic glossing No No (is planned) Unicode No input Yes Yes Yes Data format Simple XML Compl. XML Faulty TXT XML database Interoperability Good Fair Good Bad User community / support Life cycle Small?, no support? Old (but new version 2011) Large, good support Constantly developed Yes Large, fair support Not officially supported, old Yes Small, good support New, being developed

4) Annotation Tools Transcriber ELAN Toolbox FLEX Complexity Easy Complex with easier modes Complex to configure Audio Yes Yes No (can play) No Video No Yes No No Complex Tiers 1 per speaker Unlimited Unlimited Fixed: 8 Lexicon interop., automatic glossing No No (is planned) Unicode No input Yes Yes Yes Data format Simple XML Compl. XML Faulty TXT XML database Interoperability Good Fair Good Bad User community / support Life cycle Small?, no support? Old (but new version 2011) Large, good support Constantly developed Yes Large, fair support Not officially supported, old Yes Small, good support New, being developed

Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges

5) Annotation data formats Transcriber *.TRS

5) Annotation data formats ELAN *.EAF

5) Annotation data formats Toolbox standard format *.SDB, *.TBT, *.SF

Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges

6) Standing challenges No standardized conventions for layers of linguistic annotation Problems with interlinear morpheme glosses Unclear status / interpretation of labels Different labels for same categories Different definitions for same categories based on different theories CLARIN: partial solution: ISOcat CLAVAS

Annotation in Language Documentation Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-29

6) Standing challenges EUROTYP: ca. 550 abbreviations of terms : morphological categories 246 lexical word classes 114 Syntactic relations 56 Syntactic constituent categories 27 Semantic roles 16 Word order 16 Sentence types 2 Varieties and other 6+2 Unspecific or unclear 78

6) Standing challenges Inflection: analytical word forms Where is PLUSQUAMPERFEKT to be annotated? moni -t ask -PART.PF.PASS -us er -a -m PASS -IND.PAST -1.SG.ACT -NOM.SG.M monitus eram (analytical form): 1P, Sg, Ind, Pass, Plpf, Nom V, Masc V