Annotation in Language Documentation Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-29
Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges
Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges
1) Language Documentation
1) Language Documentation
1) Language Documentation New subfield of linguistics (Himmelmann 1998): documentary linguistics, with language documentations as results Triggered by language endangerment, enabled by technical / digital revolution Different from Language Description: In addition to the Boas ian triad (grammar, dictionary, text collection): corpora of annotated multimedia-data
1) Language Documentation A modern Langague Documentation (LD) cosists of a corpus of primary data (audio & vídeo recordings) of utterances and texts of a broad spectrum of genres and domains Annotation accompanies the utterances A LD is digital and sustainable (metadata, open standards, archiving, maintenance) It is generally accessible, e.g. via the internet
Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges
2) Data and Annotation S E SSION Metadata (describe the event and the respective Data) Videorecording Audiorecording Annotation PrimArY data SeCOndArY data Transcription: Orthographical / Phonolog.... Word-by-word / Idiomatic Translation... (linguistic / ethnograph. comment... ) (Morpheme-Glosses... )...
2) Data and Annotation Data Data is always data FOR something, or at least OF something usually it is a systematic representation of physical states and events ( facts used FOR a scientific argument) In LD, primary data is a direct rendering or result of communicative (speech) events, for instance a written text or, in particular, an audio/video recording of a speech event
2) Data and Annotation Annotation Annotation of data is a symbolic representation of properties of the state/event represented in the data In LD, the most common and basic types of (primary) annotation are a transcription and a translation of the expressions represented in primary data (e.g., a/v recording)
2) Data and Annotation Annotation = Secondary Data Represents symbolically properties of represented in Primary Data Direct Measurement / Rendering / Result of REALITY (Communicative Events)
2) Data and Annotation Global vs. unit-oriented Annotation Global or holistic annotation represents properties of the event as a whole and is in LD part of the metadata Unit-oriented annotation refers to specific parts of the data, in particular, utterances of individual sentences or words or sounds etc. We speak of individual annotations (plural)
2) Data and Annotation Secondary and derived data If unit-oriented annotation is directly based on primary data (such as a written text or a audio or video recording), then it is secondary data Annotation commenting on previous annotation is tertiary data, and so forth recursively In sum, all unit-o. annotation is derived data There are other types of derived data (lexicon...)
2) Data and Annotation Time-aligned annotation Annotation of a media file is time-aligned anotation if each piece of annotation is explicitly associated with the corresponding chunk (time-span, segment) of the media file Time-linking is the activity and result of specifying the time-alignment of each annotation associated with a certain chunk in the media file
2) Data and Annotation This is usually done by using the time marks Time marks: the start/end times of each chunk Segmenting (of a media file): identification of relevant chunks and their time marks Work-flow: segmenting adding annotation Older unit-oriented annotation can later be time-aligned, but this is very labour-intensive (but now see web-maus from CLARIN/BAS)
Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges
3) Types and Interdependencies of Annotations Linguistic types of annotations Annotations differ according to the types of properties of the speech event that are represented in the annotation Annotations can be phonetic, phonological, morphological, syntactic, semantic, pragmatic, (possibly others), and on each level they can focus on the units, or on structures of units, or on relations that hold among units, etc.
3) Types and Interdependencies of Annotations Coverage of annotation Basic annotation: only transcriptions, translations and optionally notes, on a sentence / clause / intonational unit level Basic glossing: additionally information on individual morphs: a gloss (indication of meaning or function) and perhaps a part-of-speech tag Advanced glossing: one or several of additional levels, from phonetic to pragmatic (for instance, a prosodic transcription, or annotation of the syntactic structure, of grammatical relations, etc.)
3) Types and Interdependencies of Annotations Most often used format in lang. description: Interlinear Morpheme Translations / Glossing (standard glossing ) C. Lehmann: Interlinear Morphemic Glossing. In Morphology (2004, first version 1982) Leipzig Glossing Rules = Linguistics @ MPI- EVA (B. Comrie, M. Haspelmath) & Linguistics @ Univ. Leipzig (B. Bickel), 2008
3) Types and Interdependencies Example: of Annotations time-o ne veni-a-t fear-1.sg NEG.VOL come-sbjv.pres-3.sg I am afraid he might come
3) Types and Interdependencies of Annotations Problems: Theory-specific (item-and-arrangement, not itemand-process nor word-and-paradigm) Mixes morphology and syntax Problems with synthetic word forms timeo: 1P, SG, IND, ACT, PRES where PRES? (Ø) Analytical word forms (esp. discontinuous) What do the labels designate? Meaning? Categories? Functions? Often undefined.
3) Types and Interdependencies of Annotations Hans-Heinrich Lieb & Sebastian Drude Advanced Glossing: A Language Documentation Format (DOBES Working Paper, November 2000) http://dobes.mpi.nl/documents/advanced- Glossing1.pdf
Advanced Glossing (AG): a syntactic glossing table
Advanced Glossing (AG): a morphological glossing table
Glossing table AG: A Glossing Table a l i n e a c e l l a h o l i s t i c l i n e a h o l i s t i c l i n e........ is a list
AG: A Glossing Glossing Glossing table Comment General comment is linked to....
AG: Syntactic and Morphological Syntactic glossing of a sentence Glossings of a sentence Morphological glossing of a sentence is a glossing of.... M. glossing of word 1 M. glossing of word 2 M. glossing of word 3
AG: Glossing of a Text Syntactic glossing of a sentence Glossings of the sentences Glossing table Syntactic a and morphological l i n eglossings of sentence 1.... Comment General General comment on the text a c e l l.... Morphological glossing of a sentence.... is a glossing of M. glossing of word 1 M. glossing of word 2 M. glossing of word 3.... Syntactic and morphological glossings of sentence 2 is a list Raw data................ Syntactic and morphological glossings of sentence 3 (Other components)
AG: The number line
AG: Phonetic Form and Intonation
AG: Phonological Forms and Intonation
AG: Orthographical Base Forms
AG: Lexical Categories and Form Categories
AG: Meanings and Semantic Effects
AG: Constituent Structure and Relations
AG: Orthographical Unit and Meaning
3) Types and Interdependencies of Annotations Time-linked annot. for sentence-utterances Other dependent sentence-annotations Subdivision into annotat. for syntactic units (can be internally time-aligned or not) Dependent syntactic-unit-annotations Further subdivision into annotations f. morphs (hardly possible to time-align internally) Dependent morph-annotations
Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges
4) Annotation Tools Transcriber Tool for the segmentation and transcription of audio files Pros: Compatible with MAC, Windows & Linux; very easy to use; produces simple XML-files Cons: No Unicode input possible; only one line of annotation; no video; no lexicon, outdated (new version not tested)
Transcriber
4) Annotation Tools ELAN Tool for the complex annotation of audio and video files Pros: Compatible with MAC, Windows & Linux; audio and multiple video files; unlimited tiers for different speakers; state-of-the-art; wide user community; XML output (but complex) Cons: Complex tool for beginners (but now: easier transcription mode); no lexicon (yet)
ELAN
ELAN
4) Annotation Tools Toolboox Text-oriented general database tool for linguistic fieldwork with lexicon and texts Pros: Flexible and powerful; Export to different formats (incl. XML); therefore easy to integrate with other tools; many users Cons: Too flexible; poor data format Standard Format ; complex to set up; tricky on MAC/Linux; no video and no time-aligning; at end of lifecycle; produced by SIL
Toolbox
Toolbox
4) Annotation Tools FLEX Extensive linguistic database tool for linguistic fieldwork with lexicon and texts Pros: Powerful and well-designed; inbuilt ontology and analysis tools; growing user community Cons: Not flexible (8 tiers); one huge XML database with no good import or export function for texts; Windows only; difficult to configure; no audio, no video, no time-alignment; produced by SIL
FLEX
FLEX
4) Annotation Tools Other tools Praat for segmenting, best for phonetic annotation. CLAN does audio and video annotation, in the CHAT or CA (Conversation Analysis) formats, for child language data (CHILDES project). ANVIL seems to be similar to ELAN (not tested). The EXMARaLDA Partitur Editor (U. Hamburg) is widely used for discourse analysis. Audiamus and Eopas (N. Thieberger) organize (not create) annotation. Poio (developed in the context of CLARIN, API) There are several others.
4) Annotation Tools Transcriber ELAN Toolbox FLEX Complexity Easy Complex, w. easier modes Complex to configure Audio Yes Yes No (can play) No Video No Yes No No Complex Tiers 1 per speaker Unlimited Unlimited Fixed: 8 Lexicon interop., automatic glossing No No (is planned) Unicode No input Yes Yes Yes Data format Simple XML Compl. XML Faulty TXT XML database Interoperability Good Fair Good Bad User community / support Life cycle Small?, no support? Old (but new version 2011) Large, good support Constantly developed Yes Large, fair support Not officially supported, old Yes Small, good support New, being developed
4) Annotation Tools Transcriber ELAN Toolbox FLEX Complexity Easy Complex, w. easier modes Complex to configure Audio Yes Yes No (can play) No Video No Yes No No Complex Tiers 1 per speaker Unlimited Unlimited Fixed: 8 Lexicon interop., automatic glossing No No (is planned) Unicode No input Yes Yes Yes Data format Simple XML Compl. XML Faulty TXT XML database Interoperability Good Fair Good Bad User community / support Life cycle Small?, no support? Old (but new version 2011) Large, good support Constantly developed Yes Large, fair support Not officially supported, old Yes Small, good support New, being developed
4) Annotation Tools Transcriber ELAN Toolbox FLEX Complexity Easy Complex with easier modes Complex to configure Audio Yes Yes No (can play) No Video No Yes No No Complex Tiers 1 per speaker Unlimited Unlimited Fixed: 8 Lexicon interop., automatic glossing No No (is planned) Unicode No input Yes Yes Yes Data format Simple XML Compl. XML Faulty TXT XML database Interoperability Good Fair Good Bad User community / support Life cycle Small?, no support? Old (but new version 2011) Large, good support Constantly developed Yes Large, fair support Not officially supported, old Yes Small, good support New, being developed
Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges
5) Annotation data formats Transcriber *.TRS
5) Annotation data formats ELAN *.EAF
5) Annotation data formats Toolbox standard format *.SDB, *.TBT, *.SF
Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations 4. Annotation Tools (overview and comparison) 5. Annotation data formats (TRS, EAF, SF) 6. Standing challenges
6) Standing challenges No standardized conventions for layers of linguistic annotation Problems with interlinear morpheme glosses Unclear status / interpretation of labels Different labels for same categories Different definitions for same categories based on different theories CLARIN: partial solution: ISOcat CLAVAS
Annotation in Language Documentation Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-29
6) Standing challenges EUROTYP: ca. 550 abbreviations of terms : morphological categories 246 lexical word classes 114 Syntactic relations 56 Syntactic constituent categories 27 Semantic roles 16 Word order 16 Sentence types 2 Varieties and other 6+2 Unspecific or unclear 78
6) Standing challenges Inflection: analytical word forms Where is PLUSQUAMPERFEKT to be annotated? moni -t ask -PART.PF.PASS -us er -a -m PASS -IND.PAST -1.SG.ACT -NOM.SG.M monitus eram (analytical form): 1P, Sg, Ind, Pass, Plpf, Nom V, Masc V