CoLang 2014 Data Management and Archiving Course. Session 2. Nick Thieberger University of Melbourne

Similar documents
STEPS IN LANGUAGE DOCUMENTATION AND REVITALIZATION JACK MARTIN NICK THIEBERGER

Annotation tool Toolbox how to gloss/annotate in Toolbox. Regensburg DOBES summer school Language Documentation Sebastian Drude

Annotation in Language Documentation

Transcribing and annotating audio and video: Jeff Good MPI EVA and the Rosetta Project

Preservation Handbook

The Language Archive at the Max Planck Institute for Psycholinguistics. Alexander König (with thanks to J. Ringersma)

Bradford Scholars Digital Preservation Policy

Transcription Format

A grant number provides unique identification for the grant.

Cherokee Language Technology Program - Education Services

Server-Based PDF Creation: Basics

College Archives Digital Preservation Policy. Created: October 2007 Last Updated: December 2012

An Introduction to Managing Research Data

Current Page Location. Tips for Authors and Creators of Digital Content: Using your Institution's Repository: Using Version Control Software:

Electronic Records Management Guidelines - File Formats

Sustainable Solutions for Endangered Languages Data: The Language Archive

Using ELAN for transcription and annotation

Best practices for producing high quality PDF files

Swarthmore College Libraries Digital Collection Development Policy

The Rise of Documentary Linguistics and a New Kind of Corpus

InqScribe. From Inquirium, LLC, Chicago. Reviewed by Murray Garde, Australian National University

Multiple Digital Content Types in a Single Collection. Dina Sokolova and Jane Gorjevsky, Columbia University

Checklist and guidance for a Data Management Plan

Standards Development. PROS 14/00x Specification 3: Long term preservation formats

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

Toolbox 1! Susan Gehr!! Cell/text (707) !

Preservation Handbook

Research Data Management PROJECT LIFECYCLE

In addition, a decision should be made about the date range of the documents to be scanned. There are a number of options:

Digital Preservation Guidance Note: Selecting File Formats for Long-Term Preservation

International Standards for Online Finding Aids in German Archives

AAF. Improving the workflows. Abbreviations. Advanced Authoring Format. Brad Gilmer AAF Association

Adobe Acrobat 9 Pro Accessibility Guide: Creating Accessible PDF from Microsoft Word

Virtual Exhibit 5.0 requires that you have PastPerfect version 5.0 or higher with the MultiMedia and Virtual Exhibit Upgrades.

Guidelines for the submission of invoices

How to research and develop signatures for file format identification

Adobe Acrobat 9 Pro Accessibility Guide: PDF Accessibility Overview

Planning and Infrastructure for Analog to Digital Preservation Projects

Smithsonian Institution Archives Guidance Update SIA. ELECTRONIC RECORDS Recommendations for Preservation Formats. November 2004 SIA_EREC_04_03

File Formats. Summary

How To Useuk Data Service

Research & Development. White Paper WHP 241. A Guide to Understanding BBC Archive MXF Files BRITISH BROADCASTING CORPORATION.

A Selection of Questions from the. Stewardship of Digital Assets Workshop Questionnaire

Computerized Language Analysis (CLAN) from The CHILDES Project

Sharing Files and Whiteboards

DATA MANAGEMENT FOR QUALITATIVE DATA USING NVIVO9

PRESERVATION NEEDS ASSESSMENT PRESERVATION 101

AHDS Digital Preservation Glossary

Elan. Complex annotations of video and audio resources Multiple annotation tiers, hierarchically structured Search multiple coded files

STATE OF NEBRASKA STATE RECORDS ADMINISTRATOR DURABLE MEDIUM WRITTEN BEST PRACTICES & PROCEDURES (ELECTRONIC RECORDS GUIDELINES) OCTOBER 2009

Purdue University Oral History Program Documentation

Welcome to the Quick Start Guide for TrialPad 2.0, the leading trial presentation and legal file management app for the ipad!

Scanning and Tossing. Requirements for Scanning and the Destruction of Paper Based Records

11 ways to migrate Lotus Notes applications to SharePoint and Office 365

VThis A PP NOTE PROCESSING P2 MEDIA WITH FLIPFACTORY

Lessons from document archiving PDF/A Dave McAllister, Director, Open Source and Standards Adobe Systems Incorporated. All Rights Reserved.

What's new in Word 2010

Archival Data Format Requirements

PDF Accessibility Overview

WESTERN KENTUCKY UNIVERSITY. Web Accessibility. Objective

Producing Accessible Slide Presentations for Scientific Lectures: a Case Study for the Italian University in the Mac OS X Environment

RELEASE NOTES OF NEW FEATURES AND ENHANCEMENTS FOR TRIALDIRECTOR VERSION PRESENTATION. Additional Layouts for Displaying Synchronized Video

Administrative Office of the Courts

Administrator Manual Across Personal Edition v6 (Revision: February 4, 2015)

Aligning text to audio and video using ELAN

Introduction. 1. Name of your organisation: 2. Country (of your organisation): Page 2

TEXT FILES. Format Description / Properties Usage and Archival Recommendations

How are tags and messages archived in WinCC flexible? WinCC flexible. FAQ May Service & Support. Answers for industry.

01. Introduction of Android

4D Plugin SDK v11. Another minor change, real values on 10 bytes is no longer supported.

1 About This Proposal

Preservation Handbook

Office of History. Using Code ZH Document Management System

11.5 E-THESIS SUBMISSION PROCEDURE (RESEARCH DEGREES)

INSTRUCTIONS FOR THE PRODUCTION OF ELECTRONICALLY STORED INFORMATION (ESI)

Overview Document Framework Version 1.0 December 12, 2005

Using the ETDR Word Template Masters Theses and Reports

SPRING SCHOOL. Empirical methods in Usage-Based Linguistics

Administrator Manual Across Translator Edition v6.3 (Revision: 10. December 2015)

BIRT Document Transform

The Key Elements of Digital Asset Management

SURFsara Data Services

Best Practices for Data Management. RMACC HPC Symposium, 8/13/2014

Video Encoding Best Practices

Research Data Management in Horizon 2020

Transcription:

CoLang 2014 Data Management and Archiving Course Session 2 Nick Thieberger University of Melbourne

Quiz In a morning recording session you recorded two speakers, each telling a story, then recorded your questions to both of them together to describe what they told you, now in English. This was done in 3 audio files, 2 video files (of which the audio is also recorded) and 3 photographs. In addition you have handwritten notes. Later a speaker (who you are paying for her assistance) will transcribe the media. How would you prepare this dataset? What files would there be? What naming conventions would you use? How would you name the transcripts?

Briefly Why Toolbox, or Flex?

Toolbox IGT Id number - Media reference Text line Parsed Morphemic line Gloss line Free translation line

Parse function

Lookup function

Lexical database functions

Tracking text processing in Toolbox

Tracking text processing in Toolbox

How to get from the field to analysis to the archive? Using good tools, for example: Transcriber, Elan Toolbox Fieldworks Why are these good tools?

Transcription with timealignment Necessary step in building a corpus from which to make generalizations. No extra cost to use time-alignment Index of media Many possible outputs Tools create simple text files that encode the relationship of the text to the timecodes in the media

Other transcription tools http://liceu.uab.es/~joaquim/phonetics/fon_anal_acus/herram_anal_acus.html

Other annotation tools Anvil ATLAS CLAN CSLU Toolkit The MATE Workbench Multitool SignStream SmartKom SyncWRITER TalkBank Transana Voicewalker List of tools at this wiki: http://www.exmaralda.org/annotation/index.php/main_page

What is so good about these programs: Toolbox, Flex, Elan etc.? They produce good textual outputs Simple (but structured) text can be easily converted and archived

Working form The form in which information is stored as it is created and edited. Archival form The form in which information is stored for access long into the future. Presentation form The form in which information is presented to the public. Simons (2004)

Working form The form in which information is stored as it is created and edited. Can include notes, not all of which may be useful later. E.g., files being processed (in Elan, Toolbox or Flex) and ancillary files (.typ,.prj, etc) that are only necessary while working to create the annotated data.

Archival form The form in which information is stored for access long into the future. Highest resolution form of the data

Presentation form The form in which information is presented to the public. Derived from working or archival form. May be compressed and arranged to make it easier to deliver and interpret.

Reuse Recording >Transcript Preg nafkal skot namer nig Emlakul. NT1-98002-A 316.43 320.317

Reuse Recording >Transcript > Interlinear \id 061:005 \aud NT1-98002-A 316.43 320.317 \tx Preg nafkal skot namẽr \mr preg nafkal skot namẽr \mg make war nig Emlakul. nig Emlakul with people of p.name \fg To fight the people of Malakula.

Book production based on IGT

Playable media, produced in Elan Text, interlinearised in Fieldworks

Longevity of character encoding Problems of legacy fonts fonts as variable representation of one underlying form e.g., ASCII character N could be represented by ŋ if IPATimes was selected as the font

Need for documents to be legible! Characters must be portable so that the documents are legible International standard for character encoding - unicode

Unicode issues - Choose characters carefully! cf. Mia Kalish s article in LD&C (2) - mixing different code sets can lead to sorting issues. à - option a à - a plus combining diacritic 0300 à - U-00E0

Using Unicode Setting up keyboards for more extensive unicode entry Windows XP: Tavultesoft Keyman: http://www.tavultesoft.com/keyman/ SIL IPA keyboard for Keyman: http://scripts.sil.org/keyman Microsoft Keyboard Layout Creator: http://www.microsoft.com/ globaldev/tools/msklc.mspx MacOS X: Ukelele: http://scripts.sil.org/ukelele Linux: Keyboard Mapping for Linux; http://kmfl.sourceforge.net/ Handy tools for inserting characters: http://scripts.sil.org/inputtoollinks

Using Unicode Entering characters Windows XP: Windows XP character map - To Find Character Map: Programs > Accessories >System Tools > Character Map Use Alt-+-XXXX (e.g. Alt+00E9 [hex code]). Some applications (e.g. MS Word) support typing the hex code followed by Alt-x. Mac OS X: Mac - character palette Set up the Unicode Hex Input Keyboard (International Preferences). Use Option-XXXX (e.g. Option-00E9).

Questions / discussion?

Archiving

Archives, old places with old stuff http://www.kings.cam.ac.uk/library/archives/images/archives3.jpg

According to Jacobs & Humphrey (2004), Data archiving is a process, not an end state where data is simply turned over to a repository at the conclusion of a study. Rather, data archiving should begin early in a project and incorporate a schedule for depositing products over the course of a project s life cycle and the creation and preservation of ac- curate metadata, ensuring the usability of the research data itself. Such practices would incorporate archiving as part of the research method. Jacobs, James A., & Charles Humphrey. (2004). Preserving Research Data. Communications of the ACM 47(9): 27-29.

Archiving If we make records of languages we need some way of making sure they last

Reusable and interoperable research data DVDs of field recordings archived directly from the northern Philippines

Why archive? Our responsibility to ensure longterm access and availability of the data we record For the speakers and their descendants Centrality of data in language documentation

Endangered data Digital word processing is our most advanced writing technology to date, but it is also the most ephemeral. Hardware and software technologies are changing so rapidly that a typical storage medium or file format is obsolete within 5 to 10 years. Digital records of endangered languages are in danger of dying out before the languages themselves

What should we do? 1. Put the materials into an enduring file format. 2. Provide sufficient metadata to make the files discoverable. 3. Deposit the materials with an archive that will make a practice of migrating them to new storage media as needed.

Archiving formats Text - XML, pdf, txt Media Audio pcm/wav/bwf Images TIF Video JPEG2000, MXF How do we get our data into these formats?