"Best practices in digital language archiving of language and music data" Sep. 6 7, University of Cologne. Abstracts

"Best practices in digital language archiving of language and music data" Sep. 6 7, University of Cologne Abstracts The role of the archive in mediating endangered language documentation Gary Holton, Alaska Native Language Archive, University of Alaska Fairbanks In traditional approaches to archiving the role of the archive has been limited to maintaining physical and intellectual control that is, the archive strives to know what they have and where it is located. Making sense of the material, knowing what is relevant and why, is left to the user. Increasingly though, archivists have become areal specialists, expected to know not just what items are held by the archive but also how those items relate to the larger intellectual effort in the field. This is certainly true of endangered language archives, which increasingly play an important role not just in preserving documentation but in contributing to further documentation and revitalization. To a certain extent this is not an entirely new phenomenon. The line between language documenter and language archivist has always been a fine one, and language archivists have played important roles in moving the field of language documentation forward. By helping to identify relevant extant documentation for under- documented languages, Freeman and Smith s (1966) guide to the Native American collections at the American Philosophical Society inspired much new research in Native American languages. Similarly, Krauss and McGary s (1980) bibliographic catalog Alaskan Indian languages offers a critical analysis of scholarly contributions to Alaska Native language documentation, thus providing a basis for future research. This type of mediated access to language archives provides a point of entry which allows better utilization of the archive. In theory, mediated access should be less necessary for digital archives, since electronic access and rich metadata should allow for ready discovery of relevant materials. In practice, the sheer volume of accessible materials, coupled with the often sub- standard metadata, complicates the resource discovery process. Users are flooded with too much data and may find it difficult to discern the more useful resources. To address this issue the Alaska Native Language Archive has initiated an effort to create featured collections highlighting the most valuable and useful resources for each Alaska Native language. In this presentation we describe the process of creating these mediated collections and report on initial reactions from user groups. It is hoped that this presentation will inspire further discussion about the role of mediation in endangered language archives.

Metadata for Endangered Languages and Global Biodiversity Mary S. Linn, Sam Noble Oklahoma Museum of Natural HistoryUniversity of Oklahoma The University of Oklahoma is home to a growing collection of North American indigenous languages, concentrating on the languages of Oklahoma and surrounding areas. The Native American Languages (NAL) collection is housed and curated in the Sam Noble Oklahoma Museum of Natural History. The metadata captured for language has more in common with the zoology departments (such as modern invertebrates and mammalogy), and even the paleontology department (such as paleobotany and vertebrate paleontology), than the other anthropology departments (archeology and ethnology). We have been in the process of updating the NAL database to capture metadata that fits the standards for language set by the Open Language Archive Community (OLAC) and to a lesser extent the ISLE Metadata Initiative (IMDI). In addition, the database is designed to adhere closely to standards for natural history collections set by Global Biodiversity Information Facility (GBIF) and Biodiversity Information Standards (TDWG). Thus, the NAL database looks very much like a database found in any biodiversity collection. This paper will examine the correspondences and differences in standards and controlled vocabulary between the two systems. It will show how we embedded OLAC standards into a GBIF framework. Finally, the paper will discuss the benefits of having language metadata as biodiversity metadata, including that of funders recognizing endangered language documentation and description as scientific investigation and language collections on par with other natural history collections. ELAR: experiences from a social networking archive David Nathan, ELAR, SOAS University of London ELAR (the Endangered Languages Archive) launched its current catalogue platform (see http://elar- archive.org) in June 2012 as a "social networking archive". Its design was driven by three factors: research into features judged important for an archive dedicated to endangered languages documentation; lessons learned from pioneer archives in the field; and the evolving broader technological trends and expectations from 2005. While ELAR's platform currently has limited social networking features, and has not been operational long enough to enable robust conclusions, we can already detect benefits of the approach - benefits for depositors, users and language community members - for example through ELAR's method of implementing access protocols. On the other hand, we have also faced difficulties, and some plans and predictions have not come to fruition. This talk will discuss these pros and cons and describe ELAR's future plans.

What's so special about music? Some methodological aspects of ethnomusicological field data in the Phonogrammarchiv of the Austrian Academy of Sciences in Vienna Jürgen Schöpf, Phonogrammarchiv, Austrian Academy of Sciences, Wien In 1899, the then Imperial Academy of Sciences in Vienna founded what was to be become the world's first sound archive. Today, the Phonogrammarchiv of the Austrian Academy of Sciences mirrors the Austrian research scene in many disciplines, from bioacoustics and anthropology to ethnomusicology, linguistics, and religious studies, to name just a few. It holds, since the year 2000, also video recordings. Interdisciplinary from the outset, music and language have always ranged at eye level in the Phonogrammarchiv. Most field working linguists in recent decades have recorded music. Be it as part of their linguistic work that includes vocal music, or in an approach of documenting culture as a reference along side with language, or because of their personal interest, or the interests of their informants. Therefore it appears not unduly to speak about music recording and archiving in linguistic circles - much less even in times when researchers are encouraged to use archival ressources across disciplines as today. In my presentation I argue that the demands of music in both field work and archiving exceed the requirements of language in important aspects. The arguments discussed will comprise technological ones (e.g. size of a sound source, duration of a performance, dynamic range), archival ones (e.g. multi track), and legal and ethical ones (commodification of music, ethical demands). It is claimed, since musicological demands in those aspects appear to be higher, that ethnomusicological approaches may lead the methodological discussions in the choir of disciplines. Networking digital ethnographic archives Nick Thieberger, University of Melbourne / PARADISEC What can a network of digital endangered language archives offer that each archive on its own cannot? There are no doubt a number of areas that could be dealt with at this higher level, including: agreement on what (technical and metadata) standards should be used; perhaps providing accreditation of archives (similar to the five- star system used by the Open Language Archives Community); providing mirrored backup of each other s collections; jointly developing software (e.g., for cataloging, ingestion, metadata creation); and so on. In this presentation I want to focus on methods for locating endangered collections, digitising and accessioning them, and incorporating their metadata into federated search tools. Endangered collections may be a collection of records in a deceased estate with no further information, or perhaps are a set of described recordings made by a Native Patrol Officer and now held by their children. It is only by a concerted effort that we can locate these collections and obtain the trust of their owners to accession them into an archive. Of course, our

responsibility as linguistic fieldworkers should ensure that we create proper collections and ensure they are archived, but my experience over ten years of working with the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) is that it is still not the case that linguists take the creation of a research collection seriously. A federation of archives subscribing to a central body could apply collectively for philanthropic funds to support the location and accession of otherwise inaccessible primary materials. Much recorded material is outside of academia, and outside of state or national repositories. The federation could actively seek this material. For material that is in established collections the federation could provide an online referral service, matching language codes to the URL of the collection, thus bringing it in to the search mechanisms of the language archives community. The Language Archive in the context of emerging research infrastructures Paul Trilsbeek, The Language Archive, Max Planck Institute for Psycholinguistics, Nijmegen Archiving and publication of research data is a hot topic in a growing number of research disciplines. It is also seen as very important by decision makers in the research funding landscape which has resulted in significant sums of money being invested in projects that aim at developing the necessary technical infrastructure for archiving and utilizing research data. Interoperability between repositories and between research tools is a key aspect of these projects. The Language Archive at the Max Planck Institute for Psycholinguistics has been involved in a number of these research infrastrucure projects during the past 5 years and has significantly contributed to the conceptualization and development of archiving and research infrastructures for language data and data in the humanities in general. The European CLARIN project and the national CLARIN Germany and CLARIN Netherlands projects are examples of these as well as the more recent DASISH and EUDAT projects. In this talk I will give an overview of these developments and their relevance for archives of endangered languages. Speech Resources and Tools at BAS - the World seen by Speech Database Providers Christoph Draxler, Bavarian Archive for Speech Signals, München The Bavarian Archive for Speech Signals has been creating, distributing and maintaining speech and multimodal databases since 1995. These databases were mainly collected for speech and video processing technology - speech recognition, speech synthesis, speaker identification, etc. A number of tools have been developed to facilitate or even automate some of the processing steps in creating, annotating and distributing these databases - some of these tools have become de facto standards in the speech processing community. Increasingly, these databases are being used by phoneticians and linguists for a number of

reasons: the databases are well documented, in most cases there are no or low license fees, the databases are large both in terms of speakers and speech phenomena, the recording quality is very high, and these databases have been validated for technical quality against their specifications. As a CLARIN- D centre, the BAS is currently making a subset of its speech and multimodal databases available online to the research community. Additionally, web- based speech processing services are now being implemented that allow students and researchers to perform complex speech processing tasks without the need to install software on their own computer. In this talk I will present an overview of the BAS tools and technology and discuss how they can be used to create speech databases of endangered languages. Audit and Certification of Digital Repositories Natascha Schumann, GESIS - Leibniz- Institut für Sozialwissenschaften, Datenarchiv für Sozialwissenschaften, Köln GESIS Data Archive for the Social Science: Overview and long- term preservation activities Audit and Certification EU Framework of audit and certification of digital repositories: 1 Support of efforts regarding to trusted digital repositories 2 Harmonisation of existing initiative 3 Memorandum of Understanding with three levels of certification brief overview of the three levels/criteria Why and how we use the OAIS model for cost-effective long-term and medium-term preservation of digital resources Bernard Bel, Laboratoire Parole et Langage, CNRS - Université d'aix- Marseille Audit and Certification of Digital Repositories GESIS Leibniz- Institut für Sozialwissenschaften, Datenarchiv für Sozialwissenschaften, Köln