Automated Workflow for the Ingest and Preservation of Electronic Journals. Evan Owens Chief Technology Officer Portico evan.owens@portico.

Similar documents
Portico. An Electronic Archiving Service

Monitor file integrity using MultiHasher

USGS Guidelines for the Preservation of Digital Scientific Data

Preservation Handbook

Ensure your digital records have a future. Digital preservation

Content Management for Content Enrichment: Architectural Issues and Strategies

11 ways to migrate Lotus Notes applications to SharePoint and Office 365

Chapter 5: The DAITSS Archiving Process

Ex Libris Rosetta: A Digital Preservation System Product Description

The Rutgers Workflow Management System. Workflow Management System Defined. The New Jersey Digital Highway

Digital Preservation Recorder 6.0.0

Archiving Systems. Uwe M. Borghoff Universität der Bundeswehr München Fakultät für Informatik Institut für Softwaretechnologie.

Electronic Records Management Guidelines - File Formats

AHDS Digital Preservation Glossary

Introduction Thanks Survey of attendees Questions at the end

WHY DIGITAL ASSET MANAGEMENT? WHY ISLANDORA?

THE BRITISH LIBRARY. Unlocking The Value. The British Library s Collection Metadata Strategy Page 1 of 8

Introduction to Digital Workflow Ticketing

Archival Data Format Requirements

Technical concepts of kopal. Tobias Steinke, Deutsche Nationalbibliothek June 11, 2007, Berlin

ERA Challenges. Draft Discussion Document for ACERA: 10/7/30

IFI Irish Film Archive Digital Preservation & Access Strategy

Migration of Controlled. Compliant SharePoint Document Management Systems. Presented by: Joe Lucadamo

Go Paperless. Save time and money by eliminating filing cabinets and organizing files electronically.

Software Requirements Specification vyasa

Bradford Scholars Digital Preservation Policy

Plagiarism. Dr. M.G. Sreekumar UNESCO Coordinator, Greenstone Support for South Asia Head, LRC & CDDL, IIM Kozhikode

Digital Rights Management - The Difference Between DPM and CM

DSpace: An Institutional Repository from the MIT Libraries and Hewlett Packard Laboratories

KM COLUMN. what is a content management system? JUNE CMS: A working definition. The business problem. Business benefits

The Benefit of Experience: the first four years of digital archiving at the National Archives of Australia

Chapter 6: Developing a Proper Audit Trail for your EBS Environment

Swarthmore College Libraries Digital Collection Development Policy

Overview of NDNP Technical Specifications

Clinical Data Management (Process and practical guide) Nguyen Thi My Huong, MD. PhD WHO/RHR/SIS

Feel the difference PRODUCT- AND PRICE CATALOGUE, 2015

A guide to the lifeblood of DAM:

Do you know how your TSM environment is evolving?

Evaluating File Formats for Long-term Preservation

Introduction to Research Data Management. Tom Melvin, Anita Schwartz, and Jessica Cote April 13, 2016

Introduction. 1. Name of your organisation: 2. Country (of your organisation): Page 2

Improved document archiving speeds; data enters the FileNexus System at a faster rate! See benchmark test spreadsheet.

Archival of Digital Assets.

Working With Templates in Web Publisher. Contributed by Paul O Mahony Developer Program

STATE OF NEBRASKA STATE RECORDS ADMINISTRATOR DURABLE MEDIUM WRITTEN BEST PRACTICES & PROCEDURES (ELECTRONIC RECORDS GUIDELINES) OCTOBER 2009

Why enterprise data archiving is critical in a changing landscape

Format OCR ICR. ID Protect From Vanguard Systems, Inc.

Two new DB2 Web Query options expand Microsoft integration As printed in the September 2009 edition of the IBM Systems Magazine

Harvard Library Preparing for a Trustworthy Repository Certification of Harvard Library s DRS.

How To Migrate From Eroom To Sharepoint From Your Computer To Your Computer

Portico Digital Preservation Service: Helping libraries transition from print to e-resources

Adding Robust Digital Asset Management to Oracle s Storage Archive Manager (SAM)

Preserving digital data - risk assessment and digital preservation strategies

HOW ARKIVUM USES LTFS TO PROVIDE ONLINE ARCHIVING AS A SERVICE

How To Build A Connector On A Website (For A Nonprogrammer)

Consulting Services. Kiersted is proud to provide a. wide range of legal technology and. consulting services to meet the

Standards Development. PROS 14/00x Specification 3: Long term preservation formats

Archiving digital documents and s in PDF/A

GFI LANguard 9.0 ReportPack. Manual. By GFI Software Ltd.

Exhibit 1 to March 12, 2015 Letter to Joint Chairmen

E-Content Service Group Virtual Meeting. Digital Preservation: How to Get Started

Shared service components infrastructure for enriching electronic publications with online reading and full-text search

WMO Climate Database Management System Evaluation Criteria

Recovering from a System Crash

InfoCenter Suite and the FDA s 21 CFR part 11 Electronic Records; Electronic Signatures

ALTERNATIVE DATA STORAGE:

INSTALLATION GUIDE Datapolis Process System v

Digital Assets Repository 3.0. PASIG User Group Conference Noha Adly Bibliotheca Alexandrina

EventTracker: Configuring DLA Extension for AWStats Report AWStats Reports

The Australian War Memorial s Digital Asset Management System

Digital Asset Management

ECM Migration Without Disrupting Your Business: Seven Steps to Effectively Move Your Documents

The Key Elements of Digital Asset Management

The Spectrum of Data Integration Solutions: Why You Should Have Them All

Data Management Plans & the DMPTool. IAP: January 26, 2016

Oracle Database. How To Get Started. April g Release 2 (10.2) for or IBM z/os (OS/390) B

TIBCO Spotfire and S+ Product Family

Realizing business flexibility through integrated SOA policy management.

SINTERO SERVER. Simplifying interoperability for distributed collaborative health care

The Evolution of PACS Data Migration

Digital Preservation Guidance Note: Selecting File Formats for Long-Term Preservation

Title: Harnessing Collaboration: SharePoint and Document Management

Digital Archiving at the National Archives of Australia: Putting Principles into Practice

The legal admissibility of information stored on electronic document management systems

BUSINESS FOCUSED EXCHANGE REPORTING. Guide To Successful Microsoft Exchange Reporting & Analysis

BLUESKIES. Microsoft SharePoint and Integration with Content Management Platforms. FileHold - Providing Advanced Content Management Functionality

Planning and Infrastructure for Analog to Digital Preservation Projects

2/24/2010 ClassApps.com

CDISC Journal. Using CDISC ODM to Migrate Data. By Alan Yeomans. Abstract. Setting the Scene

Defining Digital Forensic Examination and Analysis Tools Using Abstraction Layers

METADATA: EXAMINING. Records Managers. Its Role in E-Discovery. Julie Gable, CRM, CDIA, FAI

How To Use Access Online

DIGITAL PRESERVATION AT THE U.S. GOVERNMENT PRINTING OFFICE: WHITE PAPER. Version July 2008 UNITED STATES GOVERNMENT PRINTING OFFICE

Papermule Workflow. Workflow and Asset Management Software. Papermule Ltd

EventTracker: Configuring DLA Extension for AWStats report AWStats Reports

How To Create Employee Review Documents In Peoplesoft

Secstate: Flexible Lockdown, Auditing, and Remediation

Putting it all together Using technology to drive tax business processes

In this Lecture you will Learn: Implementation. Software Implementation Tools. Software Implementation Tools

Protecting Business Information With A SharePoint Data Governance Model. TITUS White Paper

Transcription:

Automated Workflow for the Ingest and Preservation of Electronic Journals Evan Owens Chief Technology Officer Portico evan.owens@portico.org

Portico Background Trusted third party archiving solution Funding from libraries, publishers, foundations, etc. Initial pilot described at Archiving 2004 Business and access model heavily revised Launched to publishers in September 2005 Launched to libraries in January 2005 13 publishers signed; more to follow soon ~3500 titles; ~7M articles See www.portico.org for latest information 25 May 2006 Archiving 2006 2

Electronic Journals and Digital Preservation Journal publishing models are evolving Publishing practice varies: Print only, E-only, both More / less / same in each edition E-product varies: HTML Header & PDF HTML Full-text with links and supplemental stuff & PDF HTML only A work with multiple manifestations XML or SGML source files Print PDF used to drive printing press Web PDF optimized for online delivery HTML header or full text Often generated from XML or SGML source 25 May 2006 Archiving 2006 3

Portico Archival Strategy for E-Journals Source file archiving Preserve the components not the rendition Include high-resolution files (PDF and figures) if available All e-only components (data, media, etc.) SGML / XML structured text by preference HTML as last resort Preserve intellectual content not look and feel of HTML HTML renditions are an artifact of current (and past) technology Often dynamically generated Fragile technology, overdue for change Preserve only essential features of the user interface Reference linking, other content-based features Not generic navigation or search or e-commerce features 25 May 2006 Archiving 2006 4

Electronic Journal Data Issues Inputs Whatever the publisher wants to send us However the publisher wants to send it to us Per article: one text or metadata file, zero or more other files Arbitrary (publisher-specific) collections of data Proprietary file & directory naming conventions Proprietary formats Undocumented business rules hidden in the data Output to Archive Normalized content (pre-emptive migration of proprietary formats) Automated metadata capture/generation: technical, descriptive, events Packaged in Portico s variant of METS 25 May 2006 Archiving 2006 5

Process Overview 25 May 2006 Archiving 2006 6

Automated Processing for E-Journal Content (high-level summary) Check for Viruses Verify Format Resolve File References Verify Checksums Apply Exclusion Rules Establish Unit Identity Extract Descriptive MD Is there a Layer? No Normalize Files (Based on Policies) Generate Technical MD Yes Extract File References Ready For QC Remove Layer 25 May 2006 Archiving 2006 7

PublisherA 0008543X 2006 106 8 CNCR21779 21779_ftp.pdf 21779_ftp.sgm equation aueq001.tif aueq002.tif nueq001.gif nueq002.gif image_m mfig001.jpg image_n nfig001.jpg image_t tfig001.gif 25 May 2006 Archiving 2006 8

Content Unit (Article) Text: Marked Up Text 21779_ftp.sgm Rendition: Page Images 21779_ftp.pdf Component: Formula Graphic aueq001.tif nueq001.gif Component: Formula Graphic aueq002.tif nueq002.gif Component: Figure Graphic mfig001.jpg nfig001.jpg tfig001.gif 25 May 2006 Archiving 2006 9

the following statistical model was fitted to the data: <UEQN NAME="ueq002" LOC="FIXED"></ UEQN> in which <I>T</I> &equals; 1 if Grade 3 or 4 neutropenia was present The overall survival for all patients is illustrated in Figure <FIGR HREF="fig1">1</ FIGR>. <FIG ID="fig1" LOC="FLOAT"><GRAPHIC NAME="fig001"></GRAPHIC><NUMBER>1</NUMBER> <CAPTION><P>Overall survival for 160 eligible and evaluable patients with recurrent solid tumors who were enrolled on Children&apos;s Cancer Group Study 0962.</P></CAPTION></FIG> 25 May 2006 Archiving 2006 10

25 May 2006 Archiving 2006 11

Automated Processing after QC (for all content types) Calculate Checksums Release to Archive Generate Portico METS Cleanup ConPrep Validate Portico METS End 25 May 2006 Archiving 2006 12

Archive Ingest Processing Check Agreement ID Validate Checksums Check Format and Preservation Level Validate Asset Inventory Add Ingest Event to Portico METS Load into Archive End 25 May 2006 Archiving 2006 13

Some Critical Issues Content isn t perfect Must have policies and workflow for invalid data There are degrees of badness Strict format validity does not equate to usefulness or usability E.g., Well-formed but not valid PDF E.g., Valid PDF with bad embedded font E.g., Invalid JPEG Content creation practices change over time Publishers (content providers) aren t consistent Or don t warn you that they are changing something Defensive programming required Software isn t perfect Assume that there will be internal failures Reversibility and audit trail are essential Portico Tool Registry and events metadata 25 May 2006 Archiving 2006 14

A Sample Tool Event <EventTransformedFile Timestamp="2006-05-22T11:39:46.830-04:00"> <Tool> <ToolInfo> <RunDate>2006-05-22T11:39:46.150-04:00</RunDate> <ToolWrapper> <RegisteredName>BepressTransformTool:1.0:2006-04-21</RegisteredName> <RuntimeEnv>Java:Sun Microsystems Inc.:1.5.0_04:SunOS:sparc:5.9</RuntimeEnv> <DependentLibSet> <DependentLib>bepress.xsl</DependentLib> <DependentLib>insert-titles.xsl</DependentLib> <DependentLib>insert-portico-doctype.xsl</DependentLib> <DependentLib>nlmpub2_1_to_ptc2_0.xsl</DependentLib> <DependentLib>porticoCommon_1_1.xsl</DependentLib> <DependentLib>gentext.xsl</DependentLib> </DependentLibSet> </ToolWrapper> <VendorTool> <VendorToolName>com.icl.saxon.TransformerFactoryImpl</VendorToolName> <RuntimeEnv>Java:Sun Microsystems Inc.:1.5.0_04:SunOS:sparc:5.9</RuntimeEnv> </VendorTool> </ToolInfo> </Tool> </EventTransformedFile> 25 May 2006 Archiving 2006 15

Envoi Digital preservation as interoperability with the future Let s test now while we can still recover If we can t move content from party to party today, Why do we think we will be able to in the future? Data exchange is valuable To both parties There is knowledge transfer as well as data transfer Standards are only standard when practiced, in use Robust electronic content production and content management systems will help to make preservation easier and cheaper We have to get ahead of the problem, not just clean up the mess afterwards 25 May 2006 Archiving 2006 16