Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF

Transcription

1 Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF Susanne Haaf & Bryan Jurish Deutsches Textarchiv

2 1. The Metadata Format CMDI

3 Metadata? Metadata Format? and more

4 Metadata? Metadata Format? and more

5 Metadata? Metadata Format? and more CMDI (Component Metadata Infrastructure)

6 CMDI? What's that? Component Metadata Infrastructure Metadata Components (e.g. author, title, license, ) combined to Metadata Profiles (e.g. DTA Basisformat teiheader) Create new components/profiles or re-use those which are already there One basic CMDI structure all resources have in common ISOcat Data Categories for definition of the semantics of components

7 Why CMDI? CMDI is not a format per se but rather a framework Hence: I don't really have to decide on a format I define the semantics of my metadata categories myself Plus in CMDI you can describe any resource you like: collections/corpora, single texts historical sources, recent sources sound (spoken, music), film, text, multimedia lexical resources (lexica & dictionaries, treebanks, ) tools, services, applications These descriptions can then be represented as a whole Hence: Get all there is in CLARIN through one portal

8 CMDI Basic Structure (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD Here: DTA-CMDI profile xsi:schemalocation=" xmlns:xsi=" xmlns=" CMDVersion="1.1"> <Header> [...] </Header> <Resources> [...] </Resources> <Components> [...] </Components> </CMD> Namespace information Schema specification Version information (N.b. new version CMDI 1.2 coming up)

9 CMDI Basic Structure (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD> <Header> <MdCreator>Deutsches Textarchiv</MdCreator> <MdCreationDate> </MdCreationDate> <MdSelfLink> </MdSelfLink> <MdProfile> clarin.eu:cr1:p_ </MdProfile> <MdCollectionDisplayName> Deutsches Textarchiv ( ) </MdCollectionDisplayName> </Header> <Resources>[ ]</Resources> <Components>[ ]</Components> </CMD> Header for Meta-Metadata

10 CMDI Basic Structure (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD> <Header>[ ]</Header> <Resources> <ResourceProxyList> <ResourceProxy id="dta-altmann_elementarorganismen_1890.landing_page"> <ResourceType>LandingPage</ResourceType> <ResourceRef> </ResourceRef> </ResourceProxy> </ResourceProxyList> <JournalFileProxyList>[ ]</JournalFileProxyList> <ResourceRelationList>[ ]</ResourceRelationList> <IsPartOfList> <ispartof>[ ]</ispartof> </IsPartOfList> </Resources> <Components>[ ]</Components> </CMD> Resources described and resources somehow related to them

11 CMDI Components (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD> <Header>[ ]</Header> <Resources>[ ]</Resources> <Components> <teiheader> <filedesc> <titlestmt> <title type="main"> Die Elementarorganismen und ihre Beziehungen zu den Zellen </title> <author>[...]</author> [...] <publicationstmt>[including availability]</publicationstmt> <sourcedesc> [including depository of the physical source] </sourcedesc> </filedesc> <encodingdesc>[...]</encodingdesc> <profiledesc>[including genre]</profiledesc> </teiheader> </Components> </CMD> Components: Actual metadata of the resource described

12 The world of Components: Components

13 The world of Components: ISOcat DC-2978 Data Element Name: Person PID: Definition: the name of a person

14 The world of Components: Profiles

15 The world of Components Think of what you need Put together components Create your own CMDI profile Or: re-use something which is already there Questions about CMDI? Helpdesk (Timm Lehmberg's talk) CLARIN Centers CLARIN User Guide

16 CMDI Components (Ex. WebLicht Webservices - CAB) <?xml version="1.0" encoding="utf-8"?> <CMD>[ ] <Header>[ ]</Header> <Resources>[ ]</Resources> <Components> <WebLichtWebService> <Service> <Name>CAB orthographic canonicalizer</name> <Description> orthographic normalization for historical German </Description> <TypeOfWebservice>RESTfull</TypeOfWebservice> <url> <LifeCycleStatus>production</LifeCycleStatus> <PublicationDate> T07:34:20Z</PublicationDate> <LastUpdate> T07:34:20Z</LastUpdate> <ServiceDescriptionLocation ref="s056"/> <Contact> < >[email protected]</ > </Contact> <Creation>[Information about creation and creators]</creation> Components: Actual metadata of the resource described

17 CMDI Components (Ex. WebLicht Webservices - CAB) <Operations><Operation> <Name>Default</Name> <Input><ParameterGroup> <Name>Input Parameters</Name> <Parameters><Parameter> <Name>tokens</Name> Components: Actual metadata of the resource described <AllowManualSelectionFallback>false</AllowManualSelectionFallback> </Parameter> <Parameter> <Name>sentences</Name> <AllowManualSelectionFallback>false</AllowManualSelectionFallback> </Parameter>[ ]</Parameters>[ ]</ParameterGroup> </Input> <Output><ParameterGroup> <Name>Output Parameters</Name> <ReplacesInput>false</ReplacesInput> <Parameters><Parameter> <Name>orthography</Name> </Parameter></Parameters> </ParameterGroup></Output> </Operation></Operations></Service></WebLichtWebService></Components></CMD>

18 2. The Text Corpus Format TCF

19 TCF: Text Corpus Format What is it? XML stand-off format for linguistic annotations Developed for WebLicht in the context of CLARIN-D Compatibility LAF (Linguistic Annotation Format / ISO 24612:2012) GrAF (Graph Annotation Format/ Ide & Suterman, 2007) What is it good for? Facilitates annotation-tool interoperability & orchestration Lingua franca for web-service execution ( tool chains ) Explicit specification for concrete annotation tasks Incremental processing annotation layers e.g. tokens, sentences, PoS-tags, lemmata, parse trees,

20 TCF + WebLicht: Example Chain All tools use the same I/O format (TCF) Each tool adds one or more annotation layer(s) Existing layers are passed through unchanged information from input document is preserved Some TCF layers: text tokens sentences POStags lemmas parsing depparsing morphology namedentities references matches orthography... and more!

21 TCF Example (1): Input Input: simple XML text <text> EJn zamer Elephant gilt ohngefa hr zweyhundert Thaler. Ceterum censeo Carthaginem esse delendam. </text> Converter: XML TCF (text layer) XML serialization Desgined for DTABf

22 TCF Example (2): Text Layer Output: TCF superstructure and text layer <D-Spin xmlns=... version="0.4"> <TextCorpus xmlns=... lang="de"> <text> EJn zamer Elephant gilt ohngefa hr zweyhundert Thaler. Ceterum censeo Carthaginem esse delendam. </text> </TextCorpus> </D-Spin> TCF version document language raw (serialized) document text

23 TCF Example (3): Tokenization <D-Spin... version="0.4"> <TextCorpus... lang="de"> <text>...</text> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens> <sentences> <sentence ID="s1" tokenids="w1 w2 w3 w4 w5 w6 w7 w8"/> <sentence ID="s2" tokenids="w9 wa wb wc wd we"/> </sentences> </TextCorpus> </D-Spin> tokenization tokens- and sentences-layers unique IDs for inter-layer cross-references

24 TCF Example (4): (modern) Orthography <D-Spin... version="0.4"> <TextCorpus... lang="de"> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens>... <orthography> <correction tokenids="w1" operation="replace">ein</correction> <correction tokenids="w2"...="replace">zahmer</correction> <correction tokenids="w3...="replace">elefant</correction>... </orthography> </TextCorpus> </D-Spin> Orthographic normalization orthography-layer

25 TCF Example (5): Part-of-Speech Tags <D-Spin... version="0.4"> <TextCorpus... lang="de"> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens>... <POStags tagset="stts"> <tag tokenids="w1">art</tag> <tag tokenids="w2">adja</tag> <tag tokenids="w3">nn</tag>... </POStags> </TextCorpus> </D-Spin> PoS-tagging POStags-layer (+ tagset attribute)

26 TCF Example (6): (modern) Lemmata <D-Spin... version="0.4"> <TextCorpus... lang="de"> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens>... <lemmas> <lemma tokenids="w1">eine</lemma> <lemma tokenids="w2">zahm</lemma> <lemma tokenids="w3">elefant</lemma>... </lemmas> </TextCorpus> </D-Spin> Lemmatization lemmas-layer

27 WebLicht Further Processing of TCF data within CLARIN's WebLicht cf. Thorsten Trippel's talk