A Digital Library Feasibility Study C. Henshaw, D. Thompson, M. Savage-Jones Wellcome Library London, UK LIBER Annual Conference Aarhus, Denmark June 2010
Introduction 1. Who we are 2. Vision and strategy 3. Aims of the Feasibility Study 4. Methodology 5. Outcomes 6. Concluding remarks
Who we are One of the worlds major resources for the study of medical history Modern and historical collections Special collections archives, manuscripts, artworks, audio/visual Digitised collections, picture library
Vision and strategy Transformation of the Wellcome Library Ambitious plan to digitise up to 30m images over 5+ years. Mainly historic collections, but also modern (subject to copyright clearance). Almost entirely internally funded. Wellcome Digital Library Phase 1: 2010 2012, infrastructure and pilot digitisation 2012: seek remainder of funding for main programme Main programme: 2012 onward, digitise all suitable content
Strategy Phase 1: 2010-2012 Build a sustainable and expandable mechanism for creating, storing and delivering data the foundation stone of the future WDL; Digitise key library holdings relating to one of the Trust s major challenges as set out in the strategic plan ( Modern Genetics and its Foundations); Fund the digitisation of important third party content which complements our holdings; Use innovative web tools to encourage discovery and use of these collections. Explore commercial partnerships for cost-effective digitisation of other parts of our collection
Digitisation Phase 1: 2010-2012 Archival records: Crick, Sanger, etc., 500,000 images Printed books: 1,400 genetics-related books from 1850-1990 Commercial partnership: to be determined! External content: Will identify relevant external collections and fund digitisation; content will be ingested into the Wellcome Digital Library
Infrastructure Do not have the infrastructure required to create an integrated digital library providing a seamless interface across catalogues and digital collections. Digital Library requires several components: 1.Search and Discovery (Encore, catalogues) 2.Digital Delivery system 3.Digital Asset Management system (SDB) 4.Full-text index 5.Workflow system for managing digitisation and ingest
Aims of the Feasibility Study To answer some questions around the infrastructure, primarily: SDB (Safety Deposit Box - preservation system for born digital). Q: Can it be used for large-scale digitisation? Delivery system METS Q: How does it fit into the system architecture, what do we need to look for, what are the options? Q: Is METS really the way to go, do we need a Profile, how will we create METS files? Full-text index Q: How is an index constructed, what do we need, how does it fit into the system architecture? Workflow system Q: What are the key requirements, what are the options,
Methodology SDB: Commissioned the suppliers Tessella to investigate how SDB could be used as a DAM for the digital library. 1. Report Recommend modifications to SDB to meet requirements 2. Proof-of-concept demonstration To demonstrate the capabilities of SDB to ingest and manage digitised content and to make that content available to a 3rd party system.
Methodology CCS: Commissioned CCS (Content Conversion Systems) to provide information and recommendations: 1. Report Recommendations for implementation of: METS Full-text index Front-end delivery, including conversion of JPEG 2000 on-the-fly 2. Proof-of-concept demonstration Using CCS s Veridian system, to demonstrate ability to request and retrieve content from SDB. Did NOT look at front-end design, Web 2.0, authentication
Methodology Workflow system: Research carried out in-house. Came to realise that ad-hoc tracking and project monitoring systems were not suitable for large-scale digitisation and ingest of this content into the DAM. Workflow system requirements: Track and monitor digitisation and ingest activities Aggregate metadata Output XML as METS and other required formats How will the system fit into the overall infrastructure
Outcomes - SDB 1. PoC successfully ingested JPEG 2000 files Used a mocked up SIP containing content and metadata for a Logical Object (book, file of letters, etc.) Characterised the JPEG 2000 on ingest using JHOVE, adding administrative metadata to SDB Does not currently characterise audio/visual formats, but this can be added in at any time with a tool such as MediaInfo 2. PoC successfully delivered content to Veridian Accepted a remote request for content and was able to pass that content to the remote system (Veridian)
Outcomes SDB/Veridian interop Veridian Requests image Image in cache? No Submits webservices request to SDB Yes SDB Informs Veridian call was received SDB processes request and makes image available to ftp server Displays image for end user Sends request callback Veridian Downloads source image from ftp server JPEG copied to cache JPEG2000 converted to JPEG
Further work SDB 1. Investigate further Tessella s recommended modifications to SDB, including development of API s for remote request, and new ingest workflows. 2. Compare the value of customising SDB to other potential DAM systems on the market. 3. Carry out a full tender for the system as appropriate.
Outcomes Delivery system As neither the Library, not SDB, have an appropriate delivery system on hand, CCS s Veridian system was used as part of the proof-of-concept. 1. Veridian successfully requested and retrieved content from SDB 2. Veridian successfully converted JPEG 2000 files onthe-fly, using a limited cache. The Library will use JPEG 2000 archive and access files for all of its image content. It was important to look at how onthe-fly conversion could be implemented and whether it was feasible. 3. On-the-fly conversion was slow Remote locations meant using the Internet, rather than an internal network as in a real life situation.
Further work Delivery 1. Actual speed of content delivery to users to be tested further using the Wellcome s server and storage network. 2. Draw up complete specifications for delivery system. 3. Carry out a full tender for the system.
Outcomes Full-text index We will OCR all printed text. CCS made some recommendations: 1. Use of ICR (Intelligent Content Structure) 2. Dictionaries, word lists, alternate spellings 3. Set up indexing profiles for different types of content 4. Solr, based on Lucene was a suitable architecture for a large-scale word index 5. Consider transcribing hand-written content
Further work Full-text index 1. Investigate further the recommended indexing options to improve search results. 2. Investigate further how the index could be accessed by Encore. 3. Draw up specifications for indexing solution. 4. Carry out tender for the final system as appropriate.
Outcomes METS We expect to use METS as a wrapper for descriptive and administrative metadata. This would be used by the delivery system. CCS made some recommendations: 1. METS is a useful metadata format for this purpose. 2. The Wellcome should develop its own profile as a reference. 3. Use METS/ALTO for full-text content to provide a structure for the textual content 4. Use MODS and MIX metadata standards in the METS
Further work METS 1. Ensure METS will indeed be implemented by chosen delivery system. 2. Determine what metadata standard(s) to use in the METS for descriptive and technical metadata. 3. Consider further the use of ALTO extension. 4. Finalise model for Wellcome METS profile.
Outcomes Workflow system It became clear through looking at potential existing workflow systems on the market, that metadata processing should be carried out by a separate system. 1. A workflow tracking system (WTS) should be implemented 2. This should focus on tracking and managing processing of content and ingest. 3. A separate system a metadata normaliser (MNS) - should be implemented
Outcomes WTS key requirements 1. Allow projects to be managed on a Project, Batch and Unit level 2. Associate descriptive metadata with each unit using barcodes 3. Perform command line actions (such as converting images to JPEG 2000) where possible 4. Allow for flexibility in workflow steps for different workstreams 5. Store metadata in an industry-standard database
Outcomes MNS key requirements 1. Separate system, but would utilise the same database as the WTS 2. Map descriptive metadata from the Library cataloguing systems to a set of database fields 3. Map administrative metadata from the DAM (file names and unique identifiers, etc.) 4. Aggregate all ingested metadata into a unified databse 5. Output XML from the database using a number of templates (depending on type of content), e.g. METS
Further work WTS and MNS 1. Complete specifications for WTS and MNS 2. Investigate options for off-the-shelf and bespoke systems. 3. Carry out a full tender for the systems.
Concluding remarks Feasibility study gave us: 1. Far more understanding about how all the elements of a Digital Library fit together. 2. The tools to develop precise specifications for what we need to develop and/or procure. 3. Ability to start addressing the issues around gaps and dependencies in existing and new systems. 4. Insights into how to work with suppliers, particularly where multiple suppliers need to communicate with each other. 5. Plan of action to start actually developing our Digital Library