LAMUS & LAT Archiving software Daan Broeder Max-Planck Institute for Psycholinguistics The Language Archive Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands
The Language Archive - 2011 MPI for Psycholinguistics research corpora: child language, bilingualism, gesture, sign language, orpus Spoken Dutch, second learner corpora, etc. Archive for the DOBES project Hosting (and inviting) corpora for other projects in need (UNESO study: 80% of all material is endangered) DBD, NGT, Leiden Univ. language documentation corpora Donated endangered language corpora Eibl Eibersfeldt human ethology collection Maintain a metadata catalog for properly described resources from other institutes BAS, -ORAL-ROM (Univ. Florence), LR from Lund Univ, INL, other archive partners opy of HILDES and Talkbank corpora from MU Mainly annotated audio/video recordings 50 TB: 200k MD records, 250k AV resources, 200k annotation files, lexicons, sketch grammars, etc.
History Started in 2000 to try solve the mounting data chaos at the MPI for Psycholinguistics First needed proper data descriptions Archive software development linked to the IMDI metadata set for Language Resource First archive was basically a file-system with metadata descriptions and resource files Tools operating directly on the files A researcher s notebook disk was just as sophisticated
IMDI ISLE Metadata Initiative Metadata schema for Language Resources Developed from 2000 also in several EU projects ISLE, EHO, INTERA Especially multi-media/multi-modal recordings 3 XML metadata schema + special profiles for specific communities: Sign-Language, SL-acquisition, T I S S S S S T M M T T M
TLA Archive Organization Archiving formats only Metadata in XML files Relations represented by links DBs only as helpers Data safety through HSM, pushing data to TLs TLA ARHIVE S S S S S M M M M T T T T } IMDI metadata }resources language expedition age group genre sessionx media file annot. file
Archive Access Browsing/Search/Visualization WWW browser TROVE LARI Local tools - ARBIL - ELAN IMDI- Browser HTTP server resource download ARHIVE metadata annotations media files LAMUS AMS PID service Upload data All resources accessible by HTTP if authorized LOAL DATA All web-apps can be configured to use either Shibboleth or a local LDAP for authentication
Archive Administration API API API API IMDI search IMDI browser content search AMS amsdb IMDI lucene idx imdidb. corpus structure annexdb lamusdb crawler archive manager S S S S S archive LAMUS API
Why user managed deposition? Increasing costs New cheaper technologies for recording, digitization and storage causes huge increase in data quantities. Using depositor knowledge Researcher/depositor knows where to put the data in the logical structure (catalogue) of the archive. ommunication with archive managers is overhead. Offer remote archiving services Support distributed projects Stricter checking Make checks explicit Archive managers have short contracts, knowledge seems to get lost. Maximizing deposition 80 percent of all recordings is in danger (UNESO report) We want to open our archive for external depositors But cannot afford extra workload for archive managers
LAMUS LAMUS is a web-application that allows Uploading and naming individual resources (media, annotations, information files) Specifying limited metadata and mutual relations for and between resources reating relevant linguistic groupings for the data (subcorpora) LAMUS will: arry out checks for consistency and coherence: check for accepted formats etc. (configurable list) Updating databases and indexes Issue PID for the new resources and metadata records
local disk WORKSPAE ARHIVE
orpus check-out check-in cycle The Archive check out Local tools: Arbil, ELAN, Shoebox, Using Arbil Add to original after consistency check versioning modify/add/.. check in workspace using LAMUS
TLA Versioning of resources TLA versioning policy Nothing gets actually deleted Users can delete resources which are removed from the visible collection (corpus tree) but remain in the archive Users can update (replace) existing resources The new version will get a new PID Old version will be shelved but keep their PID Access to old versions is managed by the owner
AMS Access Management System Sign academic license S M M Rule 1 S S S S S M M M M Rule 2 M Rule 3 Rule 1 Rule 2 Rule 3 User role administration: archive manager, domain curator, domain manager, domain editor Set a required license Set access rules per media type: annotations, images, audio, video, info A rule sets access/denial to user/ group for type of data Special groups: all, registered user Rules have priority Inheritance of rules by descendant nodes
IMDI-Browser & Metadata Search Browse the hierarchy of corpora Inspect metadata records reate bookmarks resources IMDI-Browser showing resources Show PIDs, URLs for resources and metadata Make resource access requests Search the metadata: simple keyword, complex queries
IMDI-Browser as a jump board
http://corpus1.mpi.nl/ds/imdi_browser?openpath=mpi541199%23
Publishing resources
Regional Archives Initiative Regional Archives Initiative: ooperation of TLA/MPI-PL with other organizations interested in EL archiving They use TLA LAT archiving software Encourage local resource collecting & archiving Network of South American archives has been established and contacts with LARA were made
Data Synchronization I S S S S S S S S Logical synchronization
Data Synchronization II S S S HTTP server OSIX OSIX: complex logic to compare corpus trees and determine what is new what to replace what to add what to delete S S S S S archive API LAMUS In a cooperation with MU, OSIX is used to copy HILDES and Talkbank corpora into our archive. MU generating IMDI records on the fly from their DBs
Technical Info Java web-applications running inside Tomcat servlet container Postgress DBMS Platform: Linux Web-app frameworks: JSP, Applets, JSF, FLEX, Wicket, Works with most web browsers (Explorer, Firefox, Opera, Safari)
LAMUS & LAT Future TLA is part of LARIN and is promoting MDI, so We are planning the transition from LAMUS IMDI to LAMUS MDI We analyzed our set-up and still like the LAT fundaments e.g. file based, modularity, But we will also alleviate some current problems and inconveniences: limited metadata editing in LAMUS Insufficient provenance tracking of resources Better handling of download/modify/upload cycle Better integration with other (LAT) archives and infrastructures.
THANK YOU FOR YOUR ATTENTION
Thank you for your attention