Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF



Similar documents
WebLicht: Web-based LRT services for German

The CroCo Translation Archive

FoLiA: Format for Linguistic Annotation

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Schema documentation for types1.2.xsd

CLARIN-NL Third Call: Closed Call

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

A GrAF-compliant Indonesian Speech Recognition Web Service on the Language Grid for Transcription Crowdsourcing

Shallow Parsing with Apache UIMA

The Knowledge Sharing Infrastructure KSI. Steven Krauwer

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

NoSta-D: A Corpus of German Non-standard Varieties

Shibboleth Configuration in Tübingen

Dutch Parallel Corpus

Annotation in Language Documentation

How To Create A Clarin Metadata Infrastructure

Developing Java Web Services

ITS. Java WebService. ITS Data-Solutions Pvt Ltd BENEFITS OF ATTENDANCE:

Java Web Services Training

Ontology based Recruitment Process

JVA-561. Developing SOAP Web Services in Java

CorA: A web-based annotation tool for historical and other non-standard language data

LEXUS: a web based lexicon tool

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

WEB SERVICES. Revised 9/29/2015

TEANLIS - Text Analysis for Literary Scholars

technische universiteit eindhoven WIS & Engineering Geert-Jan Houben

ESS EA TF Item 2 Enterprise Architecture for the ESS

CLARIN project DiscAn :

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Integration of Hotel Property Management Systems (HPMS) with Global Internet Reservation Systems

What Does Interoperability Mean, Anyway? Toward an Operational Definition of Interoperability for Language Technology

10CS73:Web Programming

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Integrating Annotation Tools into UIMA for Interoperability

Sustainable Solutions for Endangered Languages Data: The Language Archive

WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus

CLARIN: Common Language Resources and Technology Infrastructure

High Performance XML Data Retrieval

An Online Service for SUbtitling by MAchine Translation

12 The Semantic Web and RDF

Natural Language to Relational Query by Using Parsing Compiler

PoS-tagging Italian texts with CORISTagger

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Data Warehouses in the Path from Databases to Archives

The Language Archive at the Max Planck Institute for Psycholinguistics. Alexander König (with thanks to J. Ringersma)

UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis

A HUMAN RESOURCE ONTOLOGY FOR RECRUITMENT PROCESS

GetFormatList. Webservice name: GetFormatList. Adress:

Lou Burnard Consulting

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

GetLibraryUserOrderList

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

31 Case Studies: Java Natural Language Tools Available on the Web

Using the BNC to create and develop educational materials and a website for learners of English

Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

A Conceptual Framework of Online Natural Language Processing Pipeline Application

Machine Learning for natural language processing

An XML Based Data Exchange Model for Power System Studies

CoLang 2014 Data Management and Archiving Course. Session 2. Nick Thieberger University of Melbourne

A Semantic web approach for e-learning platforms

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

PyCantonese: Cantonese linguistic research in the age of big data

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

EPNML an XML format for Petri nets

XBRL Processor Interstage XWand and Its Application Programs

Making Content Easy to Find. DC2010 Pittsburgh, PA Betsy Fanning AIIM

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

DEPENDENCY PARSING JOAKIM NIVRE

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Chapter 1: Introduction

Database Design For Corpus Storage: The ET10-63 Data Model

Transcription:

Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF Susanne Haaf & Bryan Jurish Deutsches Textarchiv

1. The Metadata Format CMDI

Metadata? Metadata Format? and more

Metadata? Metadata Format? and more

Metadata? Metadata Format? and more CMDI (Component Metadata Infrastructure)

CMDI? What's that? Component Metadata Infrastructure Metadata Components (e.g. author, title, license, ) combined to Metadata Profiles (e.g. DTA Basisformat teiheader) Create new components/profiles or re-use those which are already there One basic CMDI structure all resources have in common ISOcat Data Categories for definition of the semantics of components

Why CMDI? CMDI is not a format per se but rather a framework Hence: I don't really have to decide on a format I define the semantics of my metadata categories myself Plus in CMDI you can describe any resource you like: collections/corpora, single texts historical sources, recent sources sound (spoken, music), film, text, multimedia lexical resources (lexica & dictionaries, treebanks, ) tools, services, applications These descriptions can then be represented as a whole Hence: Get all there is in CLARIN through one portal http://catalog.clarin.eu/vlo/?2

CMDI Basic Structure (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD Here: DTA-CMDI profile xsi:schemalocation="http://www.clarin.eu/cmd/ http://media.dwds.de/dta/media/schema/cmdi-header.xsd" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xmlns="http://www.clarin.eu/cmd/" CMDVersion="1.1"> <Header> [...] </Header> <Resources> [...] </Resources> <Components> [...] </Components> </CMD> Namespace information Schema specification Version information (N.b. new version CMDI 1.2 coming up)

CMDI Basic Structure (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD> <Header> <MdCreator>Deutsches Textarchiv</MdCreator> <MdCreationDate>2014-11-14</MdCreationDate> <MdSelfLink> http://www.deutschestextarchiv.de/api/cmdi/altmann_elementarorganismen_1890 </MdSelfLink> <MdProfile> clarin.eu:cr1:p_1381926654438 </MdProfile> <MdCollectionDisplayName> Deutsches Textarchiv (1600 1900) </MdCollectionDisplayName> </Header> <Resources>[ ]</Resources> <Components>[ ]</Components> </CMD> Header for Meta-Metadata

CMDI Basic Structure (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD> <Header>[ ]</Header> <Resources> <ResourceProxyList> <ResourceProxy id="dta-altmann_elementarorganismen_1890.landing_page"> <ResourceType>LandingPage</ResourceType> <ResourceRef> http://www.deutschestextarchiv.de/altmann_elementarorganismen_1890 </ResourceRef> </ResourceProxy> </ResourceProxyList> <JournalFileProxyList>[ ]</JournalFileProxyList> <ResourceRelationList>[ ]</ResourceRelationList> <IsPartOfList> <ispartof>[ ]</ispartof> </IsPartOfList> </Resources> <Components>[ ]</Components> </CMD> Resources described and resources somehow related to them

CMDI Components (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD> <Header>[ ]</Header> <Resources>[ ]</Resources> <Components> <teiheader> <filedesc> <titlestmt> <title type="main"> Die Elementarorganismen und ihre Beziehungen zu den Zellen </title> <author>[...]</author> [...] <publicationstmt>[including availability]</publicationstmt> <sourcedesc> [including depository of the physical source] </sourcedesc> </filedesc> <encodingdesc>[...]</encodingdesc> <profiledesc>[including genre]</profiledesc> </teiheader> </Components> </CMD> Components: Actual metadata of the resource described

The world of Components: Components http://catalog.clarin.eu/ds/componentregistry

The world of Components: ISOcat DC-2978 Data Element Name: Person PID: http://www.isocat.org/datcat/dc-2978 Definition: the name of a person http://catalog.clarin.eu/ds/componentregistry

The world of Components: Profiles http://catalog.clarin.eu/ds/componentregistry

The world of Components Think of what you need Put together components Create your own CMDI profile Or: re-use something which is already there Questions about CMDI? Helpdesk (Timm Lehmberg's talk) CLARIN Centers CLARIN User Guide

CMDI Components (Ex. WebLicht Webservices - CAB) <?xml version="1.0" encoding="utf-8"?> <CMD>[ ] <Header>[ ]</Header> <Resources>[ ]</Resources> <Components> <WebLichtWebService> <Service> <Name>CAB orthographic canonicalizer</name> <Description> orthographic normalization for historical German </Description> <TypeOfWebservice>RESTfull</TypeOfWebservice> <url>http://kaskade.dwds.de/demo/cab/query?fmt=tcf-orth</url> <LifeCycleStatus>production</LifeCycleStatus> <PublicationDate>2013-07-12T07:34:20Z</PublicationDate> <LastUpdate>2013-07-12T07:34:20Z</LastUpdate> <ServiceDescriptionLocation ref="s056"/> <Contact> <Email>jurish@bbaw.de</Email> </Contact> <Creation>[Information about creation and creators]</creation> Components: Actual metadata of the resource described

CMDI Components (Ex. WebLicht Webservices - CAB) <Operations><Operation> <Name>Default</Name> <Input><ParameterGroup> <Name>Input Parameters</Name> <Parameters><Parameter> <Name>tokens</Name> Components: Actual metadata of the resource described <AllowManualSelectionFallback>false</AllowManualSelectionFallback> </Parameter> <Parameter> <Name>sentences</Name> <AllowManualSelectionFallback>false</AllowManualSelectionFallback> </Parameter>[ ]</Parameters>[ ]</ParameterGroup> </Input> <Output><ParameterGroup> <Name>Output Parameters</Name> <ReplacesInput>false</ReplacesInput> <Parameters><Parameter> <Name>orthography</Name> </Parameter></Parameters> </ParameterGroup></Output> </Operation></Operations></Service></WebLichtWebService></Components></CMD>

2. The Text Corpus Format TCF

TCF: Text Corpus Format http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/the_tcf_format What is it? XML stand-off format for linguistic annotations Developed for WebLicht in the context of CLARIN-D Compatibility LAF (Linguistic Annotation Format / ISO 24612:2012) GrAF (Graph Annotation Format/ Ide & Suterman, 2007) What is it good for? Facilitates annotation-tool interoperability & orchestration Lingua franca for web-service execution ( tool chains ) Explicit specification for concrete annotation tasks Incremental processing annotation layers e.g. tokens, sentences, PoS-tags, lemmata, parse trees,

TCF + WebLicht: Example Chain All tools use the same I/O format (TCF) Each tool adds one or more annotation layer(s) Existing layers are passed through unchanged information from input document is preserved Some TCF layers: text tokens sentences POStags lemmas parsing depparsing morphology namedentities references matches orthography... and more!

TCF Example (1): Input Input: simple XML text <text> EJn zamer Elephant gilt ohngefa hr zweyhundert Thaler. Ceterum censeo Carthaginem esse delendam. </text> Converter: XML TCF (text layer) http://kaskade.dwds.de/demo/cab/file?a=null&fmt=tei&ofmt=tcf-text XML serialization Desgined for DTABf

TCF Example (2): Text Layer Output: TCF superstructure and text layer <D-Spin xmlns=... version="0.4"> <TextCorpus xmlns=... lang="de"> <text> EJn zamer Elephant gilt ohngefa hr zweyhundert Thaler. Ceterum censeo Carthaginem esse delendam. </text> </TextCorpus> </D-Spin> TCF version document language raw (serialized) document text

TCF Example (3): Tokenization http://kaskade.dwds.de/demo/cab/file?a=null&fmt=tei&ofmt=tcf-tok <D-Spin... version="0.4"> <TextCorpus... lang="de"> <text>...</text> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens> <sentences> <sentence ID="s1" tokenids="w1 w2 w3 w4 w5 w6 w7 w8"/> <sentence ID="s2" tokenids="w9 wa wb wc wd we"/> </sentences> </TextCorpus> </D-Spin> tokenization tokens- and sentences-layers unique IDs for inter-layer cross-references

TCF Example (4): (modern) Orthography http://kaskade.dwds.de/demo/cab/file?fmt=tei&ofmt=tcf-orth <D-Spin... version="0.4"> <TextCorpus... lang="de"> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens>... <orthography> <correction tokenids="w1" operation="replace">ein</correction> <correction tokenids="w2"...="replace">zahmer</correction> <correction tokenids="w3...="replace">elefant</correction>... </orthography> </TextCorpus> </D-Spin> Orthographic normalization orthography-layer

TCF Example (5): Part-of-Speech Tags http://kaskade.dwds.de/demo/cab/file?fmt=tei&ofmt=tcf <D-Spin... version="0.4"> <TextCorpus... lang="de"> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens>... <POStags tagset="stts"> <tag tokenids="w1">art</tag> <tag tokenids="w2">adja</tag> <tag tokenids="w3">nn</tag>... </POStags> </TextCorpus> </D-Spin> PoS-tagging POStags-layer (+ tagset attribute)

TCF Example (6): (modern) Lemmata http://kaskade.dwds.de/demo/cab/file?fmt=tei&ofmt=tcf <D-Spin... version="0.4"> <TextCorpus... lang="de"> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens>... <lemmas> <lemma tokenids="w1">eine</lemma> <lemma tokenids="w2">zahm</lemma> <lemma tokenids="w3">elefant</lemma>... </lemmas> </TextCorpus> </D-Spin> Lemmatization lemmas-layer

WebLicht Further Processing of TCF data within CLARIN's WebLicht cf. Thorsten Trippel's talk http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/main_page