Canonical Text Service. Jochen Tiepmar ScaDS, ASV, Uni Leipzig



Similar documents
Canonical Text Service

Convergence of Translation Memory and Statistical Machine Translation


Business Process Management

ETD The application of Persistent Identifiers as one approach to ensure long-term referencing of Online-Theses

About Contract Management

GATE Mímir and cloud services. Multi-paradigm indexing and search tool Pay-as-you-go large-scale annotation

DATA CITATION. what you need to know

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

CPAS Overview. Josh Eckels LabKey Software

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

AAI for Mobile Apps How mobile Apps can use SAML Authentication and Attributes. Lukas Hämmerle

2012 LABVANTAGE Solutions, Inc. All Rights Reserved.

The presentation explains how to create and access the web services using the user interface. WebServices.ppt. Page 1 of 14

FRACTAL SYSTEM & PROJECT SUITE: ENGINEERING TOOLS FOR IMPROVING DEVELOPMENT AND OPERATION OF THE SYSTEMS. (Spain); ABSTRACT 1.

Web Archiving and Scholarly Use of Web Archives

WESTERNACHER OUTLOOK -MANAGER OPERATING MANUAL

Object systems available in R. Why use classes? Information hiding. Statistics 771. R Object Systems Managing R Projects Creating R Packages

Creating a TEI-Based Website with the exist XML Database

The Compatible One Application and Platform Service 1 (COAPS) API User Guide

Working with the British Library and DataCite A guide for Higher Education Institutions in the UK

Visualizing an Auto-Generated Topic Map

An Introduction to TextGrid

Big Data and Scripting map/reduce in Hadoop

Protein Protein Interaction Networks

Oracle BI Publisher Enterprise Cluster Deployment. An Oracle White Paper August 2007

Introducing Apache Pivot. Greg Brown, Todd Volkert 6/10/2010

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

USM Web Content Management System

CISC 275: Introduction to Software Engineering. Lab 5: Introduction to Revision Control with. Charlie Greenbacker University of Delaware Fall 2011

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

ManageEngine (division of ZOHO Corporation) Infrastructure Management Solution (IMS)

Research Data Management Guide

OpenAIRE Research Data Management Briefing paper

Cite My Data M2M Service Technical Description

Visual Aids. Release 2.4 Version 1.1

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

So in order to grab all the visitors requests we add to our workbench a non-test-element of the proxy type.

Enabling Better Business Intelligence and Information Architecture With SAP PowerDesigner Software

Literature Review Service Frameworks and Architectural Design Patterns in Web Development

Strategic solutions to drive results in matrix organizations

Business Intelligence with AxCMS.net

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

How To Make A Credit Risk Model For A Bank Account

Complex Networks Analysis: Clustering Methods

LCH.Clearnet Version 1.0 Date : 26 June /10

LIVE CHAT CLOUD SECURITY Everything you need to know about live chat and communicating with your customers securely

Scholarly Use of Web Archives

Distributed Computing and Big Data: Hadoop and MapReduce

VMware vcenter Log Insight User's Guide

Monitoring Pramati EJB Server

Session Service Architecture

Specify the location of an HTML control stored in the application repository. See Using the XPath search method, page 2.

Product description of the docuform Fleet & Service Management software

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

10A CA Plex in the Cloud. Rob Layzell CA Technologies

Novell Identity Manager

Capacity Auction System

Checklist for a Data Management Plan draft

Application of Next Generation Telecom Network Management Architecture to Satellite Ground Systems

Product Navigator User Guide

Flattening Enterprise Knowledge

How To Install Powerpoint 6 On A Windows Server With A Powerpoint 2.5 (Powerpoint) And Powerpoint On A Microsoft Powerpoint 4.5 Powerpoint (Powerpoints) And A Powerpoints 2

Deploying Oracle Business Intelligence Publisher in J2EE Application Servers Release

COL- CC and ESA Ground Segment. Page 1

Tutorial 5: Developing Java applications

Semantic Interoperability

A Peer-to-Peer Approach to Content Dissemination and Search in Collaborative Networks

Richmond SupportDesk Web Reports Module For Richmond SupportDesk v6.72. User Guide

Web Service Facade for PHP5. Andreas Meyer, Sebastian Böttner, Stefan Marr

... Introduction... 17

SEO Toolkit Magento Extension User Guide Official extension page: SEO Toolkit

Resource Management on Computational Grids

Working with WebSphere 4.0

OrecX Oreka Total Recorder Application Notes

Continuous Integration. CSC 440: Software Engineering Slide #1

Shoal: IaaS Cloud Cache Publisher

Introduction to Python for Text Analysis

Request for Information (RFI) Supply of information on an Enterprise Integration Solution to CSIR

How To Useuk Data Service

Moodle Administrator s Manual Table of Contents

PSW Guide. Version 4.7 April 2013

Recommendations for Interdisciplinary Interoperability (R 3.3.1)

Step by step guide to implement SMS authentication to Cisco ASA Clientless SSL VPN and Cisco VPN

Web. Services. Web Technologies. Today. Web. Technologies. Internet WWW. Protocols TCP/IP HTTP. Apache. Next Time. Lecture # Apache.

Transcription:

Canonical Text Service Jochen Tiepmar ScaDS, ASV, Uni Leipzig

Overview Canonical Text Services (CTS) protocol for a webbased citable text service Unique Identifiers(Unique Resource Name, URN) refer to text passages Developed in Homer Multitext Project(www.homermultitext.org), Smith et.al.2009 http://www.homermultitext.org/hmt-docs/specifications/ctsurn/ http://www.homermultitext.org/hmt-docs/specifications/cts/ This implementation was done in Billion Words Project Implementation for Tripelstore and XML-DB not suitable for BW-Project Demo webpage: www.urncts.de

Hierarchical Documents Shakespeare, Sonet 1, Vers 1 Shakespeare Sonetts Sonet 1 Sonet 35 Sonet 154 Vers 1 Vers 5 Word 1 Word 10

Citation Outer Hierarchy Shakespeare Sonnets english 1st edition Inner Hierarchy Sonnet 1 Vers 1 Combination Shakespeare Sonnets english 1st edition Sonnet 1 Vers 1 CTS-URN urn:cts:demo:shakespeare.sonnets.en.1:1.1

Canonical Text Services (CTS) urn:cts:demo:shakespeare.sonnets.en.1:1.1 From fairest creatures we desire increase, Shakespear e Kapitel 2 Shakespeare Sonette Satz1 Kapitel154 Sonett 1 Sonett 35 Sonett 154 CTS Vers 1 Vers 5 Wort 410 Wort 115 Wort 1 Wort 10 So ne Ve tt 35 rs 1 Shakes peare Sonette Wort 1 Sone tt 1 Shak espe are Sone Sone tte tt 35 Vers 1 Vers 5 Wort 0 Sone tt 154 Sonett 154

urn:cts:demo:shakespeare.sonnets: urn:cts:demo:shakespeare.sonnets.de: Kanonische Zitation Shakespeare Sonette Sonett 1 Sonett 35 Sonett 154 Vers 1 Vers 5 Wort 1 Wort 10

urn:cts:demo:shakespeare.sonnets:35.4 Kanonische Zitation Shakespeare Sonette Sonett 1 Sonett 35 Sonett 154 Vers 1 Vers 5 Wort 1 Wort 10

urn:cts:demo:shakespeare.sonnets:35 Kanonische Zitation Shakespeare Sonette Sonett 1 Sonett 35 Sonett 154 Vers 1 Vers 5 Wort 1 Wort 10

Kanonische Zitation urn:cts:demo:shakespeare.sonnets:35.1-35.5 urn:cts:demo:shakespeare.sonnets:35.1-35 Shakespeare Sonette Sonett 1 Sonett 35 Sonett 154 Vers 1 Vers 5 Wort 1 Wort 10

Kanonische Zitation urn:cts:demo:shakespeare.sonnets:35.1@grieved-35.5@faults[1] Shakespeare Sonette Sonett 1 Sonett 35 Sonett 154 Vers 1 Vers 5 Wort 1 Wort 10

Functions GetCapabilities ( ) GetValidReff ( urn, level ) GetFirstUrn ( urn ) GetPrevNextUrn ( urn ) GetLabel ( urn ) GetPassage ( urn ) GetPassagePlus ( urn ) Function name & Parameters as GET in HTTP-Request

Functions GetCapabilities ( ) Textinventory GetValidReff ( urn, level ) GetFirstUrn ( urn ) GetPrevNextUrn ( urn ) GetLabel ( urn ) GetPassage ( urn ) GetPassagePlus ( urn )

Functions GetCapabilities ( ) GetValidReff ( urn, level ) All URNs belonging to [urn] with distance [level] all URNs from chapter 8 GetFirstUrn ( urn ) GetPrevNextUrn ( urn ) GetLabel ( urn ) GetPassage ( urn ) GetPassagePlus ( urn )

Functions GetCapabilities ( ) GetValidReff ( urn, level ) GetFirstUrn ( urn ) First URN that belongs to [urn] 1 st URN in chapter 8 GetPrevNextUrn ( urn ) GetLabel ( urn ) GetPassage ( urn ) GetPassagePlus ( urn )

Functions GetCapabilities ( ) GetValidReff ( urn, level ) GetFirstUrn ( urn ) GetPrevNextUrn ( urn ) Left and right neighbour URNs from [urn] URNs left and right of line 8 GetLabel ( urn ) GetPassage ( urn ) GetPassagePlus ( urn )

Functions GetCapabilities ( ) GetValidReff ( urn, level ) GetFirstUrn ( urn ) GetPrevNextUrn ( urn ) GetLabel ( urn ) Informal desciption of [urn] Shakespears Sonnet 36 Line 9 GetPassage ( urn ) GetPassagePlus ( urn )

Functions GetCapabilities ( ) GetValidReff ( urn, level ) GetFirstUrn ( urn ) GetPrevNextUrn ( urn ) GetLabel ( urn ) GetPassage ( urn ) Passage specified by [urn] the text GetPassagePlus ( urn )

Functions GetCapabilities ( ) GetValidReff ( urn, level ) GetFirstUrn ( urn ) GetPrevNextUrn ( urn ) GetLabel ( urn ) GetPassage ( urn ) GetPassagePlus ( urn ) All of the above except the textinventory

CTS-Iterator Mapping HTTP Requests and XML Documents on Iterator functions -> More attractive for Developers JAVA and Ruby-Wrapper done, more when it s done Open Source CTSAccess cts = new CTSAccess(); String url = "http://ctstest.informatik.uni-leipzig.de/dta/cts/"; cts.setbaseurl(url); EditionIterator it = cts.geteditioniterator(); while(it.hasnext()){ Edition e = it.next(); UrnIterator urns = e.geturniterator(); while(urns.hasnext()) while(urns.hasnext(2)) { String urn = urns.next(); String urn = urns.next(2); System.out.print("->"+cts.getLabel(urn)+"->"+cts.getPassage(urn)); } }

Data structure urn:cts:demo:shakespeare.sonnets.en.1:1.1 urn:cts:demo:shakespeare.sonnets.en.1:20.4 urn:cts:demo:goethe.faust1.de.2:12.2 urn:cts:demo:goethe.faust2.de.2:12.2 urn:cts:demo: goethe.faust shakespeare:sonnets:en.1: 1.de.2:12.2 2.de.2:12.2 1.1 20.4

Data structure shakespeare:sonnets:en.1: 1 20. 1 2 3 4 a 3 Text Text Text Text Text Text 1 2 3 4 5 6

Data structure - Advantages Prefix Optimisation urn:cts:demo:shakespeare.sonnets.en.1:1.1 urn:cts:demo:shakespeare.sonnets.en. _20.4 urn:cts:demo_goethe.faust1.de.2:12.2 urn:cts:demo:goethe.faus_2.de.2:12.2 1 Text shakespeare:sonnets:en.1: 1 20. 2 Text 3 Text 4 a 3 Text Text 1 2 3 4 5 6 Text Logarithmical search times

Data Tokens DTA, Deutsches Text Archive 334 820 482 Various german texts each in 3 editions PBC, Parallel Bible Corpus 247 292 629 831 Bible translations Perseus 27 295 030 greeklit, latinlit, farsilit, pdlrefwk 100k 1 281 272 600 Randomly generated documents

DTA BBAW 5 136 Editions 3 Editions per Document (translit, transcript, norm) 1 Citationlevel (Sentence) Kafka, Goethe, Kant, Gauss

PBC 831 Editions Bible translations 5 german bibles 3 Citationlevels (Book, Chapter, Sentence)

Perseus Alphios CTS 1 175 Editions greeklit, latinlit, farsilit, pdlrefwk Heterogenely structured

Statistics Test PC: - Ubuntu-Server - VM in Universities network Testsetup: - Get all editions - Request the passage spanning the first 2 URNs Canoncal Text Services - Jochen Tiepmar 2014

DTA Benchmarks DTA (avg length 204) Time in MS min 31 avg 36,36 max 100 0 500 1000 Canoncal Text Services - Jochen Tiepmar 2014

PBC Benchmarks PBC (avg length 280'931) Time in MS min 33 avg 72,39 max 318 0 500 1000 Canoncal Text Services - Jochen Tiepmar 2014

Perseus Benchmarks Perseus (avg length 31'234) Time in MS min 30 avg 37,65 max 531 0 500 1000 Canoncal Text Services - Jochen Tiepmar 2014

100k Benchmarks 100k (avg length 19'892) Time in MS min 33 avg 80,79 max 705 0 500 1000 Canoncal Text Services - Jochen Tiepmar 2014

Statistics Shortest 1/4 (MS) Longest 1/4 (MS) DTA 36,00 37,29 PBC 60,70 91,50 Perseus 33,76 47,86 100K 78,64 83,08 Shortest 1/10 (MS) Longest 1/10 (MS) DTA 35.90 37,93 PBC 56,62 98,05 Perseus 33,59 60,24 100K 75,31 81,51 Canoncal Text Services - Jochen Tiepmar 2014

Advantages of CTS Online Access Central Dataset Less Redundancy Mapping large texts on small URNs Standardisation Requests independent from name of text units (Chapter, Song, Book) Text passage always in<passage> - element

Advantages of this CTS Online Access Central Dataset Less Redundancy Mapping large texts on small URNs Standardisation Requests independent from name of text units (Chapter, Song, Book) Text passage always in<passage> - element Configurable Views Passage is constructed from its text units -> rules for construction can be changed to present diff. Views Views per GET-Parameter at runtime Standardisation of text units creates access points for generic tools

Views - Plain Text urn:cts:songs:christmas.ohtennenbaum.de.1:1-1.2.4 <passage> O Tannenbaum, O Tannenbaum, ( ) Wie grün sind deine Blätter! O Tannenbaum, O Tannenbaum, ( ) Ein Baum von dir mich hoch erfreut! </passage>

Views - <div> urn:cts:songs:christmas.ohtennenbaum.de.1:1-1.2.4 <passage> <div1 n="1" type="song"> <div2 n="1" type="strophe"> <div3 n="1" type="line">o Tannenbaum, O Tannenbaum, </div3> ( ) <div3 n="6" type="line">wie grün sind deine Blätter! </div3> </div1> </passage> </div2> <div2 n="2" type="strophe"> <div3 n="1" type="line">o Tannenbaum, O Tannenbaum, </div3> ( ) <div3 n= 6" type="line">ein Baum von dir mich hoch erfreut!</div3> </div2>

Views - EpiDoc urn:cts:songs:christmas.ohtennenbaum.de.1:1-1.2.4 <passage><tei:tei><tei:text><tei:body> <tei:div n="1" type="song"> <tei:div n="1" type="strophe"> <l n="1">o Tannenbaum, O Tannenbaum, </l> ( ) <l n="6">wie grün sind deine Blätter! </l> </tei:div> <tei:div n="2" type="strophe"> <l n="1">o Tannenbaum, O Tannenbaum, </l> ( ) <l n="4">ein Baum von dir mich hoch erfreut!</l> </tei:div> </tei:div> </tei:body></tei:text></tei:tei></passage>

Views - Metainformation urn:cts:songs:christmas.ohtennenbaum.de.1:1.1.1 <passage> <div3 n="1" type="line>o Tannenbaum, O Tannenbaum, </div3> <passage> VS. <passage> <div3 n="1" type="line" letters="24" tokens="4" avg_tokensize="6">o Tannenbaum, O Tannenbaum, </div3> <passage>

Views - Various DeleteXML Strip XML from text EscapePassage Escape the passage FormatXML Format XML in passage

Use Cases -> Alignment -> Generic tools and reader -> CTS Text Miner

Alignment urn:cts:dta:albertinus.landtstoertzer011615.de.translit:11 Visualisierung: Stefan Jaenickes TRAViz

Alignierung http://ctstest.informatik.uni-leipzig.de/traviz2/test/bible/urns.html?urn=urn:cts:pbc:deu.elberfelder1905:1.1.1-1.1.4 Visualisierung: Stefan Jaenickes TRAViz Stefan Jänicke, Leipzig University DEV in BMBF-project etraces (PN 01UA1101A)

Generic Reader/Browser 2014 Leipzig University // Martin Reckziegel

Generic Reader TEI vs. Styled 2014 Leipzig University // Martin Reckziegel

CTS Text Miner (CTSTM) CTS Text Mining Framework Broad and comprehensive framework for text analysis Done: Term-Document Matrix Token/Types per Document/Corpus Document- and Termbased Pruning + lists of Stopwords Tokensequence /(Kookurenz) -> Fulltextsearch -> Citation Analysis.

CTS Additional Functions More request possibilities without XML documents Editions Authors CompactDepthTypes DepthTypes AlignmentCandidates Alignment (Bible ind) Alignment (Bible de)

CTS Admin Tool Implemented by Sascha Ludwig

CTS Cloning URNs specify @n-value of <div>s -> @n-values can be used to reconstruct URNs -> Content of one CTS can be cloned Data can be narrowed down from left to right by URNs Clone everything from Shakespeare: urn:cts:demo:shakespeare.sonnets.en.1:1.1 <passage> <div1 n="1" type="song"> <div2 n="1" type="strophe"> <div3 n="1" type="line"> </div3> </div2> <div2 n="2" type="strophe"> <div3 n="1" type="line"> </div3> </div2> </div1> </passage>

CTS Cloning Backup Data http://hdw.eweb4.com/out/1369880.html

Big Picture Backup Data global decentralised community organised community backup ed open access standardized persistent citable easy to install text repository for browsing, searching and analysis of text resources. http://hdw.eweb4.com/out/1369880.html