Automation of metadata processing

Automation of metadata processing CLARIN-Conference in Wroclaw, Poland, 15-17, Octobre Except where otherwise noted, content on this poster is licensed under a Creative Commons Attribution 4.0 International license. 1

Introduction Repositories Introduction HZSK- and (Daniel Jettka) LAUDATIO-Repository (Dennis Zielke) Open-Source technologies Generalized model of the data ingest process Role of standardized metadata in the import process Validation of data Modelling import formats and data structures Indexing of metadata 2

Introduction HZSK-Repository is based on the software triad Fedora, Islandora, and Drupal currently contains 19 corpora of transcribed spoken language stored research data includes texts, transcripts, audio and video data, images, metadata, and other data types is connected to the CLARIN-D infrastructure on several levels, e.g. the central services Virtual Language Observatory (for metadata search) and the CLARIN Federated Content Search (for search directly in the content) 3

Introduction LAUDATIO-Repository is an open access environment for persistent storage of historical texts and their annotations it currently contains historical corpora from various disciplines with a total of 2000 texts that contain about two million word forms the main focus lies on German historical texts and linguistic annotations including all dialects of time periods ranging from the 9th to the 19th century 4

Introduction LAUDATIO-Repository technical the technical repository infrastructure is based on generalizable software modules such as the graphical user interface, the data exchange module between research data and the Fedora REST API the metadata search for indexing and faceting is based on the Lucene-based technology ElasticSearch the imported corpora are stored in their original structure in a permanent and unchangeable version 5

LAUDATIO: Used Open-Source-Technologies (1) CakePHP 2.4 to use MVC PHP5 Web-Framework Authorization and Authentication in the user management via Access Control List Fedora 3.6 for Data storage REST-API for Data exchange ElasticSearch as Search engine REST-API for Data exchange Implemented customized and versioned IndexMapping 6

LAUDATIO: Used Open-Source-Technologies (2) External PID-Webservice (EPIC API Version 2) to assign the Persistent Identifier Third party Open Source libraries auf Github http://tinyurl.com/lf26u97 Flat-Design (HTML5, CSS3) (Coming soon) 7

LAUDATIO: appropriated Data structure TEI XML P5 Description of the corpus data structure using the TEI metadata standards 8

LAUDATIO: View/Index Mapping ElasticSearch 9

LAUDATIO: Examples ElasticSearch for Indexing IndexMapping ViewMapping 10

LAUDATIO: Object model Fedora via RIDGES-Korpus 11

LAUDATIO: Schema config stored in Fedora 12

If you have questions please contact us: Dennis Zielke, Humboldt-Universität zu Berlin, E-Mail: dennis.zielke@hu-berlin.de Daniel Jettka, Hamburg Centre of spoken language corpora, E-Mail: daniel.jettka@uni-hamburg.de 13