VeraPDF: Building the definitive PDF/A validator The European Union s PREFORMA project

Similar documents

Archiving digital documents and s in PDF/A

Tibiscus University, Timişoara

Lessons from document archiving PDF/A Dave McAllister, Director, Open Source and Standards Adobe Systems Incorporated. All Rights Reserved.

Frequently Asked Questions (FAQs) ISO :2005 PDF/A-1 Date: July 10, 2006

PDF/A A standard for document archiving. Dipl. Inf. Reinhold Müller-Meernach. Dr. Uwe Wächter. SEAL Systems info@sealsystems.com

PDF/A the standard for long-term archiving

bbc PDF/A-1 in Acrobat 9 and LiveCycle ES PDF/A-1 PDF/A Generation and validation principles APPLIES TO CONTENTS Update March 2009

Sage CRM Technical Specification

Web Design Specialist

Processing PDF/A Documents

TechNote 0006: Digital Signatures in PDF/A-1

Server-Based PDF Creation: Basics

PDF/VT The ISO Standard for the Printing of Variable and Transactional Documents

Sage CRM Technical Specification

Outline. CIW Web Design Specialist. Course Content

WHITE PAPER. 3-Heights Scan to PDF Server Basics and Applications

v7.1 Technical Specification

Enterprise Content Management with Microsoft SharePoint

TEXT FILES. Format Description / Properties Usage and Archival Recommendations

How To Manage Your Digital Assets On A Computer Or Tablet Device

Sage CRM Technical Specification

ETSI TS V1.1.1 ( ) Technical Specification

Introducing Apache Pivot. Greg Brown, Todd Volkert 6/10/2010

Software Package Document exchange (SPDX ) Tools. Version 1.2. Copyright The Linux Foundation. All other rights are expressly reserved.

Electronic Records Management Guidelines - File Formats

Macromedia Dreamweaver 8 Developer Certification Examination Specification

Oracle Policy Automation A Modern Enterprise Policy Automation Solution

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Deploying Oracle Business Intelligence Publisher in J2EE Application Servers Release

Version Control Your Jenkins Jobs with Jenkins Job Builder

ICPPP484C Set up and operate automated workflow

BIRT Document Transform

IBM Watson Ecosystem. Getting Started Guide

Drupal CMS for marketing sites

XML Processing and Web Services. Chapter 17

How to research and develop signatures for file format identification

File S1: Supplementary Information of CloudDOE

An XML Based Data Exchange Model for Power System Studies

Rich Internet Applications

Digital Asset Management

Frequently Asked Questions (FAQs) PDF/E-1 Date: March 2008

An Oracle White Paper May Creating Custom PDF Reports with Oracle Application Express and the APEX Listener

Release Bulletin Sybase ETL Small Business Edition 4.2

ECG-1615A. How to Integrate IBM Enterprise Content Management Solutions With Microsoft SharePoint and IBM Connections. elinar.com

XML for Manufacturing Systems Integration

C# and Other Languages

Preface. Motivation for this Book

San Jose State University

JasperServer Localization Guide Version 3.5

How To Retire A Legacy System From Healthcare With A Flatirons Eas Application Retirement Solution

Report Writer's Guide Release 14.1

Introduction to XML Applications

GlobalScan NX. Server 32/Server 750. Intelligent scanning for smarter workflow

AN INTRODUCTION TO MAPBOX TOOLS AND SOFTWARE. Matt Gregory 24 July 2013

Migrate from Exchange Public Folders to Business Productivity Online Standard Suite

White Paper: Securely archiving s

NetIQ Identity Manager Setup Guide

Oracle Universal Content Management

Simplify essential workflows with dynamic scanning capabilities. GlobalScan NX Server 32/Server 750 Capture & Distribution Solution

Electronic Forms Processing (eforms)

Personalizing Your Signature Appearances

Release Management Within Open Source Projects

Contents. 2 Alfresco API Version 1.0

UNIVERSAL UNIQUE IDENTIFIERS AN ENTERTAINMENT IDENTIFIER REGISTRY (EIDR) IN MOVIE AND TELEVISION SUPPLY CHAIN MANAGEMENT WHITEPAPER

OpenText Output Transformation Server

Schematron Validation and Guidance

Smarter Balanced Technology Implementation Readiness v July 2014

Computing Concepts with Java Essentials

REDUCING THE COST OF GROUND SYSTEM DEVELOPMENT AND MISSION OPERATIONS USING AUTOMATED XML TECHNOLOGIES. Jesse Wright Jet Propulsion Laboratory,

Web Testing. Main Concepts of Web Testing. Software Quality Assurance Telerik Software Academy

Information and documentation The Dublin Core metadata element set

Configuration & Build Management

Adobe LiveCycle ES2 Output Solutions

AQA GCSE in Computer Science Computer Science Microsoft IT Academy Mapping

Introduction. Document Conventions. Administration. In This Section

Smartphone Enterprise Application Integration

Project Title: Judicial Branch Enterprise Document Management System RFP Number: FIN122210CK Appendix D Technical Features List

Federated, Generic Configuration Management for Engineering Data

We automatically generate the HTML for this as seen below. Provide the above components for the teaser.txt file.

Transforming Information Silos into Shareable Assets through Automated Content Conversion

HP Service Manager. Software Version: 9.40 For the supported Windows and Linux operating systems. Knowledge Management help topics for printing

Smithsonian Institution Archives Guidance Update SIA. ELECTRONIC RECORDS Recommendations for Preservation Formats. November 2004 SIA_EREC_04_03

Overview of NDNP Technical Specifications

Textual Modeling Languages

Smooks Dev Tools Reference Guide. Version: GA

Java the UML Way: Integrating Object-Oriented Design and Programming

Streamlining the drug development lifecycle with Adobe LiveCycle enterprise solutions

TOP 10 CHALLENGES FOR BUSINESS & GOVERNMENT SOLVED BY PDF

Interoperable Cloud Storage with the CDMI Standard

Creating an EAD Finding Aid. Nicole Wilkins. SJSU School of Library and Information Science. Libr 281. Professor Mary Bolin.

Transcription:

VeraPDF: Building the definitive PDF/A validator The European Union s PREFORMA project Boris Doubrov, Dual Lab and Duff Johnson, PDF Association on behalf of the verapdf consortium slightly changed by H-J Hübner, PDF Association

Why PDF/A? Traditional archival methods are becoming dated; more documents are becoming final without print (e-akte?) As a document format, PDF is far superior to TIFF; and accommodates raster images of pages equally well As a preservation format, PDF/A restricts the use of PDF to the specific set of issues affecting reproducibility PDF/A is the platform on which to build solutions for creating, retaining, sharing, authenticating and using reliable, high-quality, high-value electronic documents

Why does validating PDF/A matter? Validation is crucial to the archiving mission Numerous regulated industries (nuclear, aerospace, engineering, financial services, legal) demand a means of assessing compliance with regulated practices In the real world: if PDF/A software disagrees with other PDF/A software, what s a user to do? The PDF/A flag is only metadata, validation is laborious PDF/A is complex; more on that later

Why verapdf? PDF/A, and the standards on which PDF/A is based, are themselves complex and therefore imperfect When there s not much revenue in validation per se, widespread adoption demands open source solutions Industry acceptance requires that a commercial entity cannot define conformance with PDF/A A definitive validator guarantees the means of interoperability (it s PDF s holy grail, after all!)

verapdf s target audience ISO committee and PDF Association members PDF software developers (from Adobe Systems to zpdf) ECM / DMS / CCM industry product managers Digital preservation practitioners Government and industry regulators Procurement organizations IT researchers End users checking PDF/A conformance at verapdf.org

Business case For organizations creating and using documents Reduced operational risks and legal liabilities More capable, more interoperable software, enabling Easier and more feature-rich document sharing and management Process automation Archival of rich engineering content Enhanced accessibility For PDF technology developers Reduced software development costs Reduced support costs

The verapdf consortium Digital preservationist community Lead: Open Preservation Foundation Digital Preservation Coalition KEEP Solutions PDF technology industry Lead: PDF Association and members of its PDF Validation TWG Dual Lab (verapdf lead developer)

Why PDF/A validation is so special Relies on two different versions of PDF (1.4, 1.7), with specifications 1000+ pages long PDF/A comes in three versions, three levels and two technical corrigenda for PDF/A-1 The PDF/A standard aims for long-term (100+ years!) preservation of electronic documents Needs more formal analysis of the requirements

A way to start: test corpora Isartor, Bavaria, BFO: 323 test atomic self-documented test files Initial test case analysis shows about 500 extra files needed for proper test coverage of all PDF/A versions and levels subject to adjustment based on real-world cases and practical considerations to compare: the W3C corpus for XML 1.1 (specified on 38 pages) contains over 600 test files

How test files are created Sample clause of PDF/A-1 specification: ISO 19005-1:2005, 6.1.6 String objects: Hexadecimal strings shall contain an even number of non white-space characters, each in the range 0 to 9, A to F or a to f. ISO 19005-1:2005, 3.17 white-space character: NULL (00h), HORIZONTAL TABULATION (09h), LINE FEED (0Ah), FORM FEED (0Ch), CARRIAGE RETURN (0Dh) or SPACE (20h) character Test cases identified: Odd number of characters 0-9,A-F,a-f (fail) Non white-space characters outside the range 0-9,A-F,a-f (fail) Even number of 0-9,A-F,a-f; all possible white-space characters (pass)

How test files are created Test files created (QA engineer): verapdf test suite 6-1-6-t01-fail-a.pdf verapdf test suite 6-1-6-t01-fail-b.pdf verapdf test suite 6-1-6-t01-pass-a.pdf Github pull request is submitted (QA engineer) The file is merged to the master branch after the internal review (Architect): https://github.com/verapdf/verapdf-corpus-pdfa-1b

Acceptance process In case the specification allows several interpretations, or when a representative set of existing validators generates mixed results: Discuss with industry experts on the PDF Validation TWG mailing list Review and resolve discussions during TWG meetings + ISO Committee The corpus is announced as a release candidate (Github) Formal acceptance by PDF Validation TWG The next milestone: July 15, 2015 (release candidate of the enhanced PDF/A-1b corpus)

Examples of TWG discussions Private data: does it need to be validated for File Structure requirements? Which glyphs shall have widths synchronized with embedded fonts data? What is the length of a Name, which is a subject to implementation limits? Is it a violation of PDF/A-2,3 requirements, if a Page object does not contain a Resources dictionary, but inherits resources from its parent Pages object?

Validation extents PDF/A specification refers to PDF specifications, which rely on a number of external standards. How far PDF/A validation shall go to guarantee the reliable long-term preservation of PDF documents? Validity of embedded fonts, ICC profiles, font CMaps, XMP Metadata, Image compression, digital certificates is crucial! Collaboration with experts from relevant technologies is necessary for complete coverage

Validation model Hierarchy of object types Inheritable properties and named links to other object types Validation rules defined per object type (inheritable) by boolean expressions in JavaScript Each rule has metadata including the ID, description, normative references, error message Rules are combined into validation profiles and signed

PDF model in Xtext syntax # a single root type type Object {}; # parent object for all basic object types in PDF type CosObject extends Object {}; # the object representing PDF Array type type CosArray extends CosObject { property size: Integer; link values: CosObject*; }; # parent object for all content stream operators type Operator extends Object {}; # Operator q (save graphics state) type qoperator extends Operator { property nestinglevel: Integer; };

XML syntax for validation rules <rule id="1a-6-1-12-r05" object="cosarray"> <description>maximum capacity of array in elements less than 8192</description> <test>size < 8192</test> <error> <message>capacity of array greater than 8191</message> </error> <reference> <specification>iso 19005-1:2005</specification> <clause>6.1.12</clause> </reference> <reference> <specification>pdf Reference 1.4</specification> <clause>table C.1</clause> </reference> </rule>

PDF Parser implementation PDF Parser implements the interfaces automatically generated from the PDF Model syntax Proof of concept is implemented using PDFBox, an open-source Java library Due to licensing requirements of PREFORMA all code of the PDF/A Conformance Checker shall be delivered under the dual license: GPLv3 + MPLv2 As PDFBox is Apache-licensed, PREFORMA requested a greenfield implementation of a PDF parser

Development processes Github is the repository for both code and test corpora https://github.com/verapdf Travis manages the continuous builds https://travis-ci.org/verapdf Jenkins manages automated tests and deployment http://jenkins.opf-labs.org/job/verapdf-library/ Sonar monitors code quality http://sonar.opf-labs.org/dashboard/index/8021

Minimum Viable Product (MVP) demo Clone Java repository Build the verapdf library and the verapdf GUI client Check the location of test files and validation profiles Run the verapdf client selecting any PDF file and any validation profile Generate the machine-readable report (XML) Convert it to HTML and view in a system browser

Report generation Machine-readable reports generated in XML format Error messages defined in validation profiles Conversion to HTML by XSLT technology Conversion to PDF by XSLT to XSL-FO and further to PDF by Apache FOP processor Localization of the final report messages via the standard TMX syntax of translations

Other components Metadata Fixer Fix any incorrect claims of specific PDF/A validity Add PDF/A identification to an otherwise conforming PDF/A document Policy Checker Generate a PDF Features Report with page-count, lists of embedded fonts and images, and other information Verify the Report against Policy requirements in Schematron syntax Plug-in infrastructure Third-party tools can perform additional validation of embedded data Interoperability with other PREFORMA suppliers

Timeline 2015-04-14 Start of Phase 2 (implementation) 2015-06-08 First MVP released 2015 2015-07-15 Prototype PDF/A-1b support 2015-10-30 Release candidate PDF/A-1b support, prototype PDF/A-1a, 2, 3 2016-02-28 Release candidate PDF/A-1a, 2, 3 support 2016-06-30 TWG approval for PDF/A-1b corpora 2016 2016-12-20 TWG approval for PDF/A-1a, 2, 3 corpora 2016-12-31 End of Phase 2 2017-01 Beginning of Phase 3 (acceptance testing) 2017-06 End of Phase 3 (final delivery) 2017

Get involved! Join the PDF Association Join the PDF Validation Technical Working Group Download the candidate test suite files from GitHub Share your test files Test the code posted on GitHub and report issues Develop plug-ins for validating embedded data Think about how you could use definitive open source PDF/A validation