Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF Susanne Haaf & Bryan Jurish Deutsches Textarchiv
1. The Metadata Format CMDI
Metadata? Metadata Format? and more
Metadata? Metadata Format? and more
Metadata? Metadata Format? and more CMDI (Component Metadata Infrastructure)
CMDI? What's that? Component Metadata Infrastructure Metadata Components (e.g. author, title, license, ) combined to Metadata Profiles (e.g. DTA Basisformat teiheader) Create new components/profiles or re-use those which are already there One basic CMDI structure all resources have in common ISOcat Data Categories for definition of the semantics of components
Why CMDI? CMDI is not a format per se but rather a framework Hence: I don't really have to decide on a format I define the semantics of my metadata categories myself Plus in CMDI you can describe any resource you like: collections/corpora, single texts historical sources, recent sources sound (spoken, music), film, text, multimedia lexical resources (lexica & dictionaries, treebanks, ) tools, services, applications These descriptions can then be represented as a whole Hence: Get all there is in CLARIN through one portal http://catalog.clarin.eu/vlo/?2
CMDI Basic Structure (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD Here: DTA-CMDI profile xsi:schemalocation="http://www.clarin.eu/cmd/ http://media.dwds.de/dta/media/schema/cmdi-header.xsd" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xmlns="http://www.clarin.eu/cmd/" CMDVersion="1.1"> <Header> [...] </Header> <Resources> [...] </Resources> <Components> [...] </Components> </CMD> Namespace information Schema specification Version information (N.b. new version CMDI 1.2 coming up)
CMDI Basic Structure (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD> <Header> <MdCreator>Deutsches Textarchiv</MdCreator> <MdCreationDate>2014-11-14</MdCreationDate> <MdSelfLink> http://www.deutschestextarchiv.de/api/cmdi/altmann_elementarorganismen_1890 </MdSelfLink> <MdProfile> clarin.eu:cr1:p_1381926654438 </MdProfile> <MdCollectionDisplayName> Deutsches Textarchiv (1600 1900) </MdCollectionDisplayName> </Header> <Resources>[ ]</Resources> <Components>[ ]</Components> </CMD> Header for Meta-Metadata
CMDI Basic Structure (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD> <Header>[ ]</Header> <Resources> <ResourceProxyList> <ResourceProxy id="dta-altmann_elementarorganismen_1890.landing_page"> <ResourceType>LandingPage</ResourceType> <ResourceRef> http://www.deutschestextarchiv.de/altmann_elementarorganismen_1890 </ResourceRef> </ResourceProxy> </ResourceProxyList> <JournalFileProxyList>[ ]</JournalFileProxyList> <ResourceRelationList>[ ]</ResourceRelationList> <IsPartOfList> <ispartof>[ ]</ispartof> </IsPartOfList> </Resources> <Components>[ ]</Components> </CMD> Resources described and resources somehow related to them
CMDI Components (Example DTA) <?xml version="1.0" encoding="utf-8"?> <CMD> <Header>[ ]</Header> <Resources>[ ]</Resources> <Components> <teiheader> <filedesc> <titlestmt> <title type="main"> Die Elementarorganismen und ihre Beziehungen zu den Zellen </title> <author>[...]</author> [...] <publicationstmt>[including availability]</publicationstmt> <sourcedesc> [including depository of the physical source] </sourcedesc> </filedesc> <encodingdesc>[...]</encodingdesc> <profiledesc>[including genre]</profiledesc> </teiheader> </Components> </CMD> Components: Actual metadata of the resource described
The world of Components: Components http://catalog.clarin.eu/ds/componentregistry
The world of Components: ISOcat DC-2978 Data Element Name: Person PID: http://www.isocat.org/datcat/dc-2978 Definition: the name of a person http://catalog.clarin.eu/ds/componentregistry
The world of Components: Profiles http://catalog.clarin.eu/ds/componentregistry
The world of Components Think of what you need Put together components Create your own CMDI profile Or: re-use something which is already there Questions about CMDI? Helpdesk (Timm Lehmberg's talk) CLARIN Centers CLARIN User Guide
CMDI Components (Ex. WebLicht Webservices - CAB) <?xml version="1.0" encoding="utf-8"?> <CMD>[ ] <Header>[ ]</Header> <Resources>[ ]</Resources> <Components> <WebLichtWebService> <Service> <Name>CAB orthographic canonicalizer</name> <Description> orthographic normalization for historical German </Description> <TypeOfWebservice>RESTfull</TypeOfWebservice> <url>http://kaskade.dwds.de/demo/cab/query?fmt=tcf-orth</url> <LifeCycleStatus>production</LifeCycleStatus> <PublicationDate>2013-07-12T07:34:20Z</PublicationDate> <LastUpdate>2013-07-12T07:34:20Z</LastUpdate> <ServiceDescriptionLocation ref="s056"/> <Contact> <Email>jurish@bbaw.de</Email> </Contact> <Creation>[Information about creation and creators]</creation> Components: Actual metadata of the resource described
CMDI Components (Ex. WebLicht Webservices - CAB) <Operations><Operation> <Name>Default</Name> <Input><ParameterGroup> <Name>Input Parameters</Name> <Parameters><Parameter> <Name>tokens</Name> Components: Actual metadata of the resource described <AllowManualSelectionFallback>false</AllowManualSelectionFallback> </Parameter> <Parameter> <Name>sentences</Name> <AllowManualSelectionFallback>false</AllowManualSelectionFallback> </Parameter>[ ]</Parameters>[ ]</ParameterGroup> </Input> <Output><ParameterGroup> <Name>Output Parameters</Name> <ReplacesInput>false</ReplacesInput> <Parameters><Parameter> <Name>orthography</Name> </Parameter></Parameters> </ParameterGroup></Output> </Operation></Operations></Service></WebLichtWebService></Components></CMD>
2. The Text Corpus Format TCF
TCF: Text Corpus Format http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/the_tcf_format What is it? XML stand-off format for linguistic annotations Developed for WebLicht in the context of CLARIN-D Compatibility LAF (Linguistic Annotation Format / ISO 24612:2012) GrAF (Graph Annotation Format/ Ide & Suterman, 2007) What is it good for? Facilitates annotation-tool interoperability & orchestration Lingua franca for web-service execution ( tool chains ) Explicit specification for concrete annotation tasks Incremental processing annotation layers e.g. tokens, sentences, PoS-tags, lemmata, parse trees,
TCF + WebLicht: Example Chain All tools use the same I/O format (TCF) Each tool adds one or more annotation layer(s) Existing layers are passed through unchanged information from input document is preserved Some TCF layers: text tokens sentences POStags lemmas parsing depparsing morphology namedentities references matches orthography... and more!
TCF Example (1): Input Input: simple XML text <text> EJn zamer Elephant gilt ohngefa hr zweyhundert Thaler. Ceterum censeo Carthaginem esse delendam. </text> Converter: XML TCF (text layer) http://kaskade.dwds.de/demo/cab/file?a=null&fmt=tei&ofmt=tcf-text XML serialization Desgined for DTABf
TCF Example (2): Text Layer Output: TCF superstructure and text layer <D-Spin xmlns=... version="0.4"> <TextCorpus xmlns=... lang="de"> <text> EJn zamer Elephant gilt ohngefa hr zweyhundert Thaler. Ceterum censeo Carthaginem esse delendam. </text> </TextCorpus> </D-Spin> TCF version document language raw (serialized) document text
TCF Example (3): Tokenization http://kaskade.dwds.de/demo/cab/file?a=null&fmt=tei&ofmt=tcf-tok <D-Spin... version="0.4"> <TextCorpus... lang="de"> <text>...</text> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens> <sentences> <sentence ID="s1" tokenids="w1 w2 w3 w4 w5 w6 w7 w8"/> <sentence ID="s2" tokenids="w9 wa wb wc wd we"/> </sentences> </TextCorpus> </D-Spin> tokenization tokens- and sentences-layers unique IDs for inter-layer cross-references
TCF Example (4): (modern) Orthography http://kaskade.dwds.de/demo/cab/file?fmt=tei&ofmt=tcf-orth <D-Spin... version="0.4"> <TextCorpus... lang="de"> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens>... <orthography> <correction tokenids="w1" operation="replace">ein</correction> <correction tokenids="w2"...="replace">zahmer</correction> <correction tokenids="w3...="replace">elefant</correction>... </orthography> </TextCorpus> </D-Spin> Orthographic normalization orthography-layer
TCF Example (5): Part-of-Speech Tags http://kaskade.dwds.de/demo/cab/file?fmt=tei&ofmt=tcf <D-Spin... version="0.4"> <TextCorpus... lang="de"> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens>... <POStags tagset="stts"> <tag tokenids="w1">art</tag> <tag tokenids="w2">adja</tag> <tag tokenids="w3">nn</tag>... </POStags> </TextCorpus> </D-Spin> PoS-tagging POStags-layer (+ tagset attribute)
TCF Example (6): (modern) Lemmata http://kaskade.dwds.de/demo/cab/file?fmt=tei&ofmt=tcf <D-Spin... version="0.4"> <TextCorpus... lang="de"> <tokens> <token ID="w1">EJn</token> <token ID="w2">zamer</token> <token ID="w3">Elephant</token>... </tokens>... <lemmas> <lemma tokenids="w1">eine</lemma> <lemma tokenids="w2">zahm</lemma> <lemma tokenids="w3">elefant</lemma>... </lemmas> </TextCorpus> </D-Spin> Lemmatization lemmas-layer
WebLicht Further Processing of TCF data within CLARIN's WebLicht cf. Thorsten Trippel's talk http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/main_page