CoLang 2014 Data Management and Archiving Course Session 2 Nick Thieberger University of Melbourne
Quiz In a morning recording session you recorded two speakers, each telling a story, then recorded your questions to both of them together to describe what they told you, now in English. This was done in 3 audio files, 2 video files (of which the audio is also recorded) and 3 photographs. In addition you have handwritten notes. Later a speaker (who you are paying for her assistance) will transcribe the media. How would you prepare this dataset? What files would there be? What naming conventions would you use? How would you name the transcripts?
Briefly Why Toolbox, or Flex?
Toolbox IGT Id number - Media reference Text line Parsed Morphemic line Gloss line Free translation line
Parse function
Lookup function
Lexical database functions
Tracking text processing in Toolbox
Tracking text processing in Toolbox
How to get from the field to analysis to the archive? Using good tools, for example: Transcriber, Elan Toolbox Fieldworks Why are these good tools?
Transcription with timealignment Necessary step in building a corpus from which to make generalizations. No extra cost to use time-alignment Index of media Many possible outputs Tools create simple text files that encode the relationship of the text to the timecodes in the media
Other transcription tools http://liceu.uab.es/~joaquim/phonetics/fon_anal_acus/herram_anal_acus.html
Other annotation tools Anvil ATLAS CLAN CSLU Toolkit The MATE Workbench Multitool SignStream SmartKom SyncWRITER TalkBank Transana Voicewalker List of tools at this wiki: http://www.exmaralda.org/annotation/index.php/main_page
What is so good about these programs: Toolbox, Flex, Elan etc.? They produce good textual outputs Simple (but structured) text can be easily converted and archived
Working form The form in which information is stored as it is created and edited. Archival form The form in which information is stored for access long into the future. Presentation form The form in which information is presented to the public. Simons (2004)
Working form The form in which information is stored as it is created and edited. Can include notes, not all of which may be useful later. E.g., files being processed (in Elan, Toolbox or Flex) and ancillary files (.typ,.prj, etc) that are only necessary while working to create the annotated data.
Archival form The form in which information is stored for access long into the future. Highest resolution form of the data
Presentation form The form in which information is presented to the public. Derived from working or archival form. May be compressed and arranged to make it easier to deliver and interpret.
Reuse Recording >Transcript Preg nafkal skot namer nig Emlakul. NT1-98002-A 316.43 320.317
Reuse Recording >Transcript > Interlinear \id 061:005 \aud NT1-98002-A 316.43 320.317 \tx Preg nafkal skot namẽr \mr preg nafkal skot namẽr \mg make war nig Emlakul. nig Emlakul with people of p.name \fg To fight the people of Malakula.
Book production based on IGT
Playable media, produced in Elan Text, interlinearised in Fieldworks
Longevity of character encoding Problems of legacy fonts fonts as variable representation of one underlying form e.g., ASCII character N could be represented by ŋ if IPATimes was selected as the font
Need for documents to be legible! Characters must be portable so that the documents are legible International standard for character encoding - unicode
Unicode issues - Choose characters carefully! cf. Mia Kalish s article in LD&C (2) - mixing different code sets can lead to sorting issues. à - option a à - a plus combining diacritic 0300 à - U-00E0
Using Unicode Setting up keyboards for more extensive unicode entry Windows XP: Tavultesoft Keyman: http://www.tavultesoft.com/keyman/ SIL IPA keyboard for Keyman: http://scripts.sil.org/keyman Microsoft Keyboard Layout Creator: http://www.microsoft.com/ globaldev/tools/msklc.mspx MacOS X: Ukelele: http://scripts.sil.org/ukelele Linux: Keyboard Mapping for Linux; http://kmfl.sourceforge.net/ Handy tools for inserting characters: http://scripts.sil.org/inputtoollinks
Using Unicode Entering characters Windows XP: Windows XP character map - To Find Character Map: Programs > Accessories >System Tools > Character Map Use Alt-+-XXXX (e.g. Alt+00E9 [hex code]). Some applications (e.g. MS Word) support typing the hex code followed by Alt-x. Mac OS X: Mac - character palette Set up the Unicode Hex Input Keyboard (International Preferences). Use Option-XXXX (e.g. Option-00E9).
Questions / discussion?
Archiving
Archives, old places with old stuff http://www.kings.cam.ac.uk/library/archives/images/archives3.jpg
According to Jacobs & Humphrey (2004), Data archiving is a process, not an end state where data is simply turned over to a repository at the conclusion of a study. Rather, data archiving should begin early in a project and incorporate a schedule for depositing products over the course of a project s life cycle and the creation and preservation of ac- curate metadata, ensuring the usability of the research data itself. Such practices would incorporate archiving as part of the research method. Jacobs, James A., & Charles Humphrey. (2004). Preserving Research Data. Communications of the ACM 47(9): 27-29.
Archiving If we make records of languages we need some way of making sure they last
Reusable and interoperable research data DVDs of field recordings archived directly from the northern Philippines
Why archive? Our responsibility to ensure longterm access and availability of the data we record For the speakers and their descendants Centrality of data in language documentation
Endangered data Digital word processing is our most advanced writing technology to date, but it is also the most ephemeral. Hardware and software technologies are changing so rapidly that a typical storage medium or file format is obsolete within 5 to 10 years. Digital records of endangered languages are in danger of dying out before the languages themselves
What should we do? 1. Put the materials into an enduring file format. 2. Provide sufficient metadata to make the files discoverable. 3. Deposit the materials with an archive that will make a practice of migrating them to new storage media as needed.
Archiving formats Text - XML, pdf, txt Media Audio pcm/wav/bwf Images TIF Video JPEG2000, MXF How do we get our data into these formats?