Austrian Books Online

Similar documents
Integrating the Fedora based DOMS repository with Hadoop

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library

Archives Ready To the AIPs Transmission. PREMIS Implementation Fair. Reminding the ipres2010 Presentation

Introduction. What are online publications?

The Australian War Memorial s Digital Asset Management System

Strategy and Cooperation on Long-term Preservation in the Czech Republic

Accessing the Deep Web: A Survey

How To Digitise Newspapers On A Computer At Nla.Com

A Digital Library Feasibility Study

Libraries and Disaster Recovery

Technical concepts of kopal. Tobias Steinke, Deutsche Nationalbibliothek June 11, 2007, Berlin

Specifying the content and formal specifications of document formats for QES

Project PESSIS 2 Title: Social Dialogue in the Social Services Sector in Europe

International comparisons of road safety using Singular Value Decomposition

Digitization a precondition to generate new research questions and information products in the digital humanities

MBooks: Google Books Online at the University of Michigan Library

Digitisation of cultural material Digital Libraries and Copyright Madrid, 12 April 2010

Long-term archiving and preservation planning

WESTERNACHER OUTLOOK -MANAGER OPERATING MANUAL

METADATA GENERATION FOR CULTURAL HERITAGE

The Czech Digital Library and Tools for the Management of Complex Digitization Processes

Fishing for Cyclists. Peter Eich Radweg Service & Bikemap

Understanding KVK, the technical base of artlibraries.net

OCR for historical printings a tutorial

Scala Storage Scale-Out Clustered Storage White Paper

A Selection of Questions from the. Stewardship of Digital Assets Workshop Questionnaire

Open Data Open Government

KIM.

Digitization Workflow of the. Bavarian State Library. Gabriele Messmer. Bavarian State Library. Munich, Germany

Department of Geological Survey and Mines (DGSM) Republic of Uganda

How To Build A Map Library On A Computer Or Computer (For A Museum)

Bebras Contest An International Contest on Informatics and Computer Fluency for all Secondary School Pupils

Long-term preservation activities of the Bavarian State Library

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Mathematical Risk Analysis

NEWS IN A BOX PLUS POINTS. All you need for small newsrooms, production suites and disaster recovery solutions in one box

Luc Declerck AUL, Technology Services Declan Fleming Director, Information Technology Department

Call: Disaster Recovery/Business Continuity (DR/BC) Services From VirtuousIT

Company Presentation. Vienna Forum IT Romania - June 2014

Overview of NDNP Technical Specifications

Cisco Physical Access Manager

WHY DIGITAL ASSET MANAGEMENT? WHY ISLANDORA?

Bridging the Gap Between Real World Repositories and Scalable Preservation Environments

B SVF - Bavaria Long Term Preservation

Offshore outsourcing of business services Threat or Opportunity

PRESERVATION NEEDS ASSESSMENT PRESERVATION 101

Scientific Library Services and Information Systems (LIS): DFG Practical Guidelines on Digitisation

Vorarlberger Landes- und Hypothekenbank Aktiengesellschaft

PDF/A for scanned documents

Integration of Hotel Property Management Systems (HPMS) with Global Internet Reservation Systems

Cloud Sync White Paper. Based on DSM 6.0

Implementing an Integrated Digital Asset Management System: FEDORA and OAIS in Context

Plants Plants Plants Plants Plants

Dolby Digital Plus in HbbTV

E-Signatures and E-Procurement

Discovery of Electronically Stored Information ECBA conference Tallinn October 2012

Privilege Escalation via Antivirus Software

Module 6 Other OCR engines: ABBYY, Tesseract

Tools for text digitisation and transcription

Winter and Summer Schools

THE BRITISH LIBRARY. Unlocking The Value. The British Library s Collection Metadata Strategy Page 1 of 8

Echtzeit-Analyse von Social Media Daten mit Jedox und GPU-beschleunigten OLAP Datenbanken

Quantum BACKUP. RECOVERY. ARCHIVE. IT S WHAT WE DO.

Encrypting and signing

Islandora: An Open Source Institutional Repository Solution. Consortium of MnPALS Libraries Annual Meeting April 2014

FROM COLLABORATIVE DATA EDITING TO LIBRARY CATALOGUES TOWARDS A SHARABLE DATA STRATEGY

Copyright Soleran, Inc. esalestrack On-Demand CRM. Trademarks and all rights reserved. esalestrack is a Soleran product Privacy Statement

How To Get A Memory Memory Device From A Flash Flash To A Memory Card (Iomemory) For A Microsoft Flash Memory Card From A Microsable Memory Card For A Flash (Ios) For An Iomemories Memory

Entrepreneurship education in Germany 1

Closed-Loop Engineering Integrated Product Development at a Vehicle Manufacturer

Der Aufbau von digitalen Forschungsinfrastrukturen für die Geistes- und Kulturwissenschaften in Österreich

Knowledge Base Copyright Law: An innovative Resource for Open Access Archives

Questionnaire on Digital Preservation in Local Authority Archive Services

Preservation Handbook

EfficientArchive V3.2.3e.

Austrian Post Investor Day Mail Division. Walter Hitziger, Member of the Management Board

Wharf T&T Cloud Backup Service User & Installation Guide

Digital Preservation Strategy,

Less paper less costly way to manage documents! Document and Process Management System

July 2014

Policy Based Encryption Z. Administrator Guide

How To Manage File Access On Data Ontap On A Pc Or Mac Or Mac (For A Mac) On A Network (For Mac) With A Network Or Ipad (For An Ipad) On An Ipa (For Pc Or

Renate Gömpel. Germany on Track for International Standards: RDA

Facing future users - the challenge of transforming a traditional online database into a Web service

BUILDING BLOCKS FOR THE NEW KB E-DEPOT

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

How to translate VisualPlace

A Service for Data-Intensive Computations on Virtual Clusters

Supplement No. 2 dated 25 September 2013 to the Base Prospectus for Equity linked Notes and Certificates dated 27 June 2013

Automatic updates for Websense data endpoints

The Institutional Repository at West Virginia University Libraries: Resources for Effective Promotion

Enabling a data management system to support the good laboratory practice Masterthesis Status Report Miriam Ney (13.01.

HDFS: Hadoop Distributed File System

Mass Digitization of Manuscripts and Rare Books: Challenges and Experiences at Bavarian State Library

Wer steuert die europäische Forschungspolitik?

Transcription:

Austrian Books Online Google Books based mass digitisation Stefan Majewski OPF Hackathon 2.12. - 4.12.2013 Austrian National Library, Vienna

Overview The project How the data is acquired, from carrying the book to storing the files. The delights and perils of mass digitisation Some challenges How to work with the data? Data organisation

Austrian Books Online The Project

Key Facts Scope: 600,000 200 Mio Pages Progress: 180,000 -> 5,500/3weeks Workforce: 20+ FTE -> 60+ P Areas Logistics Metadata Conservation Download & QA Online Presentation Storage PM

Material legal deposit >> wide variety of material from: 16th century 19th 2nd half of century _

Public Access Google Books Digital Library Austrian National Library

13 Libraries in Europe 5 National Libraries Italy Austria The Netherlands Czech Republic Great Britain

>20 Mio. books > 50% non-english ~ 75% from libraries ~ 2 Mio. books from European libraries > 3 Mio. books public domain

digitisation of the entire historical book holdings of the Austrian National Library 16th to 19th century

70+ staff members 20+ exclusively for project book logistics metadata adaptation cataloguing conservation / restoration quality control software implementation project management

48,8 person years

Austrian Books Online Jahrhunderte 2% 10% 16. Jh. 43% 31% 14% 17. Jh. 18. Jh. 19. Jh. no year

Austrian Books Online Sprachen 3% 8% 13% 31% 14% eng ita fre lat ger 31% others

70% 60% Austrian Books Online 50% 40% 30% eng ita fre lat ger 20% 10% 0% 16. Jh. 17. Jh. 18. Jh. 19. Jh.

Ende 2013 ~185.000 Bände digitalisiert

ÖNB Buch-Viewer

52+ Millionen Seiten 1+ Milliarde unterschiedliche Terme

Information

Weitere Bände

Austrian Books Online Delights and Perils

... und doch, verschiedene Qualitäten

OCR: Deutsch

OCR: Latein

OCR: Ungarisch OCR: Ungarisch

Beispiel Fraktur (schlechte Qualität): Dis ist das buch der wyszheit der alten wysen von geschlecht der welt.; Bidpai, Person der Antike oder des Mittelalters; Straßburg: Grüninger; 1501 Hainrich; 1618

Austrian Books Online www.onb.ac.at/ev/austrianbooksonline/

Austrian Books Online Working with the Data

Buchlogistik Digitization Daten-Download ADOCO (Austrian Books Online Download & Control) Storage QA Access

Workflow in ADOCO Download package via HTTP Decrypt with gnupg Unzip tarball Md5 sum Store to pairtree Unified Access Pairtree (Symlinks) Update metadata

Volume Average per Volume (~Book): 101 MB 101 MB * 600.000 = 60 TB

Image courtesy of The University of Pennsylvania and Michel T. Huber. www.fi.edu big data

Datenspeicherung & Access Datenspeicherung: inhouse Daten redundant gespeichert Access-Kopien on-the-fly generiert

Download und Speicherung ADOCO ABO NAS-Speicher Pair Tree-Algorithmus ca. 60 TB JPEG2000 HOCR METS TXT

Pair Tree: ABO NAS +Z156941203 ^2/ bz/ 15/ 69/ 41/ 20/ 3/abo/ ONB_+Z156941203.xml 00000001.html 00000001.jp2 00000001.txt https://confluence.ucop.edu/display/curation/pairtree

Datenorganisation METS (Metadata Encoding & Transmission Standard) http://loc.gov/standards/mets/ MARC/XML / MODS PREMIS GBS specific metadata Images (JPEG2000) OCR Daten Coordinated OCR plain TXT

" uod ſingular. contigit, ut _ - iungantur THARINGO RVM ~ multorum dio coñiunctum eſt. quae hinc Orta eſt, laetitia, RVM GENTIvM prouínciis, ac NI finibus, continetur,º ſed' et in ultimas usque terras terrarum, Data arrangement METS: ONB_+Z165967208.xml TEI text/xml ONB_+Z165967208.tei Manifest: checksum.md5 Images: JPEG2000 001.jp2 coordocr: hocr (xhtml) 001.html OCR: text/plain UTF-8 001.txt

METS Reference: http://www.loc.gov/standards/mets/metsoverview.v2.html Namespaces: xmlns:mets="http://www.loc.gov/mets/" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gbs="http://books.google.com/gbs" xmlns:premis="info:lc/xmlns/premis-v2" xmlns:marc="http://www.loc.gov/marc21/slim"

METS Structure METS:mets METS:metsHdr METS:dmdSec METS:amdSec METS:fileSec METS:structMap

METS:metsHdr

METS:dmdSec

METS:amdSec

METS:fileSec

METS:structMap

METS:amdSec METS:techMD production notes (badpages, missing Pages, tightboundpages) method of image production calibration target Definition of gbs:pagetag

METS:amdSec METS:digiprovMD production notes (badpages, missing Pages, tightboundpages) method of image production calibration target Definition of gbs:pagetag

METS:amdSec METS:sourceMD Source library information METS:digiprovMD PREMIS:premis representation scanning date processing date analyzed date rubbish

hocr https://docs.google.com/document/d/1qqniqtvdac_8n92- LhwPcjtAUFwBlzE8EWnKAxlgVf0/

Using the data, locally https://www.dropbox.com/s/zpb7jzti0f8gsxn/pairtree.sh

Using the data, cluster Paths: /user/onbfue/input/abo/paths/mets/abo_mets_file_paths.txt /user/onbfue/input/abo/paths/text/abo_text_file_paths.txt /user/onbfue/input/abo/paths/html/abo_html_file_paths.txt Data: /user/onbfue/input/abo/data/html/seqfiles (page level) /user/onbfue/input/abo/data/text/seqfiles (book level)