Content Management for Content Enrichment: Architectural Issues and Strategies

Similar documents

SHared Access Research Ecosystem (SHARE)

Digital Assets Repository 3.0. PASIG User Group Conference Noha Adly Bibliotheca Alexandrina

DDI Lifecycle: Moving Forward Status of the Development of DDI 4. Joachim Wackerow Technical Committee, DDI Alliance

Cataloguing is riding the waves of change Renate Beilharz Teacher Library and Information Studies Box Hill Institute

Tools for Researchers

Databases in Organizations

Building Semantic Content Management Framework

TopBraid Life Sciences Insight

Content Management in the 21 st Century

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

Semantic SharePoint. Technical Briefing. Helmut Nagy, Semantic Web Company Andreas Blumauer, Semantic Web Company

- a Humanities Asset Management System. Georg Vogeler & Martina Semlak

technische universiteit eindhoven WIS & Engineering Geert-Jan Houben

WHITE PAPER DATA GOVERNANCE ENTERPRISE MODEL MANAGEMENT

A grant number provides unique identification for the grant.

Why archiving erecords influences the creation of erecords. Martin Stürzlinger scopepartner Vienna, Austria

Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

Linked Open Data Infrastructure for Public Sector Information: Example from Serbia

A Business Case for Enterprise Content Integration using Ontology-based Content Analytics

ECM Governance Policies

Adding Robust Digital Asset Management to Oracle s Storage Archive Manager (SAM)

data.bris: collecting and organising repository metadata, an institutional case study

Semantic Exploration of Archived Product Lifecycle Metadata under Schema and Instance Evolution

Information and documentation The Dublin Core metadata element set

Tibiscus University, Timişoara

MarkLogic Semantics in Healthcare and Life Sciences for LIDER COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks

MultiMimsy database extractions and OAI repositories at the Museum of London

Cite My Data M2M Service Technical Description

A guide to the lifeblood of DAM:

DATA MODEL FOR STORAGE AND RETRIEVAL OF LEGISLATIVE DOCUMENTS IN DIGITAL LIBRARIES USING LINKED DATA

In ediscovery and Litigation Support Repositories MPeterson, June 2009

SCHOLARONE MANUSCRIPTS TM ORCID ID GUIDE

Selecting a Taxonomy Management Tool. Wendi Pohs InfoClear Consulting #SLATaxo

OvidSP Quick Reference Guide

ISSUES ON FORMING METADATA OF EDITORIAL SYSTEM S DOCUMENT MANAGEMENT

How To Write A Request For Information (Rfi)

A Java Tool for Creating ISO/FGDC Geographic Metadata

Service Road Map for ANDS Core Infrastructure and Applications Programs

The FAO Open Archive: Enhancing Access to FAO Publications Using International Standards and Exchange Protocols

EFFECTIVE STORAGE OF XBRL DOCUMENTS

DSpace: An Institutional Repository from the MIT Libraries and Hewlett Packard Laboratories

Extracting and Preparing Metadata to Make Video Files Searchable

Scientific Knowledge and Reference Management with Zotero Concentrate on research and not re-searching

Impelsys: Your Partner for Digital Product Development & Commercialization

A Secure Autonomous Document Architecture for Enterprise Digital Right Management

Interagency Science Working Group. National Archives and Records Administration

Terms and Definitions for CMS Administrators, Architects, and Developers

Data Consumer's Guide. Aggregating and consuming data from profiles March, 2015

OpenText Content Hub for Publishers

How To Write A Blog Post On Globus

Ten Tests for Microsoft s Document Inspector: Does it satisfy the Metadata Management Needs of Law Firms?

Taking full advantage of the medium does also mean that publications can be updated and the changes being visible to all online readers immediately.

Data Warehouses in the Path from Databases to Archives

Presentation / Interface 1.3

Content Management Using the Rational Unified Process By: Michael McIntosh

Yandex: Webmaster Tools Overview and Guidelines

Making Content Easy to Find. DC2010 Pittsburgh, PA Betsy Fanning AIIM

ELIS Managing Enterprise Level Learning Programs with Moodle

EPrints Preservation Update

How To Useuk Data Service

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

Lightweight Data Integration using the WebComposition Data Grid Service

How to create database in GlycomcsPortal?

TopBraid Insight for Life Sciences

Standards, Tools and Web 2.0

Archival Data Format Requirements

Metadata in Microsoft Office and in PDF Documents Types, Export, Display and Removal

White Paper. Software Development Best Practices: Enterprise Code Portal

Transcription:

Content Management for Content Enrichment: Architectural Issues and Strategies Evan Owens Chief Information Officer AIP Publishing, LLC STM E-Production Seminar 2013

This Presentation 2 Historical Introduction Content Management & Content Enrichment Architectural Issues, Questions, Strategies, Use Cases AIP Publishing Case Study: New Thesaurus Author Disambiguation Affiliation Disambiguation A related presentation: The Evolving Information Ecosystem of Publishing, JATS-Con Proceedings 2010 http://www.ncbi.nlm.nih.gov/books/nbk48528

Publishing & Content Management in 1990s 3 Publishing is adding a useful degree of uniformity to information How were we preparing for the digital future? Creating a version of record in SGML/XML full text Making the perfect master file Preparing to publish simultaneously to print and online Article SGML/XML file as a pseudo-database or pseudo-cms: <article copyeditor= XYZ maildate= 00/00/00 > A document-centric approach!

Publishing & Content Management Today 4 Publishing is adding value to a collection of content by enrichment and by managing the information life cycle What has changed? Content management is now multi-dimensional, multi-system Publishing is a much more complex ecosystem DOIs, ORCID, DataCite, Thesauri, etc., etc., etc. Less static and less document-centric, more database-like More complex information data models No longer publish and be done Life cycle management is now essential component

Content Management 5 A set of processes and technologies that support the evolutionary life cycle of digital information Capture, storage, security, revision control, retrieval, distribution, preservation, and description of documents and content Wikipedia 2007 Some CM components that support enrichment: Version control Technical metadata (formats, format versions, validation) Provenance metadata (processing history)

Questions for Enrichment Implementations 6 1. Does the enrichment benefit from author vetting? 2. Is the enrichment part of the permanent scholarly record? 3. How standardized is the enriched information? 4. How volatile is the enriched information?

7 Some Use Cases Examples Use Cases Reference Linking Keywords Affiliations Authors Identity Funding Information Key Questions: Author vetting? Scholarly record? How standardized? How volatile?

Key Architectural Choices 8 When is the content enhanced? By the author During submission During production/editorial process By the delivery/hosting system Where does the enhanced information live? Embedded in the content In the archival XML or in the exported XML External to the content Layered information architectures

Key Design Challenges 9 What is the master source/copy of the information? Is the information normalized or de-normalized? e.g., repeating parent metadata across child elements How to synchronize between multiple systems? e.g., Peer Review System XML content As we move from document-centric to more complex information models and architectures, robust entity-relationship modeling becomes critical.

10 AIPP CASE STUDY: NEW PHYSICS THESAURUS AUTHOR DISAMBIGUATION AFFILIATION DISAMBIGUATION

Enrichment / Disambiguation Goal 11 Author Pages Subject Pages Institution Pages Articles

Key Strategic Decision (Business & Technical) 12 Semantic enrichment and disambiguation to be considered as a feature of the delivery platform Not as part of the publication or version of record Past practice: PACS codes printed on pages and in the PDF Resulted in mismatches between older and newer content Differences visible in previous hosting platform

Delivery/Hosting System Architecture 13 Publishing Technology s Pub2Web (P2W) hosting platform is built on an RDF triple store But RDF is ideal for expressing complex relationships P2W manages RDF changes via set algebra P2W displays links dynamically based on the RDF Both parts of a relationship have to be present at RDF loading Content loading is very resource intensive Every technology has its quirks

AIPP s Implementation Choices 14 XML master in RSuite archive With all article assets: print, online, supplemental Export packaging for hosting platform Version control via RSuite Interactions between AI and P2W managed by RSuite Processing history captured in RSuite, not in the XML Semantic embedded in article XML Keywords and inline tagging XML markup strippable via XSLT Disambiguation in separate XML files Pointing back into the articles An external annotation layer

XML from AI Implementation Overview 15 Thesaurus Subject Pages Dynamically created By Pub2Web Author Pages Articles XML from AI Points into articles Institution Pages XML + semantic (not disambiguation) XML from AI Points into articles

16

Keyword group in header: AIPP JATS XML: Keywords 17 Keywords inline:

AIPP Disambiguated Author XML 18

AIPP Disambiguated Affiliation XML 19

AIPP Lifecycle Use Cases: Add / Change / Delete 20 Content Corrections could impact author / affiliation / keywords Thesaurus Terms Vocabulary will evolve Enrichment Rules Quarterly reruns of entire corpus with latest rule set Author Disambiguation New info could cause merge or split Institution Disambiguation Organizational changes (names, mergers, etc.)

21 Publishing is adding value to a collection of content by enrichment and by managing the information life cycle NO SINGLE MAGIC SOLUTION YOUR MILEAGE MAY VARY! QUESTIONS? COMMENTS?