Planning a digitisation project: a rough guide Digiwiki seminar 12.3.2009, Helsinki, Finland edwin.klijn@kb.nl
Planning a digitisation project: a rough guide
Stepping stones... 1. Writing a project proposal (incl. business case ) 2. Acquiring finances 3. Writing a detailed project plan (incl. detailed specs) 4. Setting up a project organisation 5. Managing the project flow 6. Wrapping up the project (exploitation plan, longterm preservation)
Sample 1: ANP Radionews bulletins 1,5 million handtyped newsitems from the Dutch radio (1937-1995) June 2007 October 2008 Text mass digitisation project Budget: 0,5 million Euros Funded by Memory of the Netherlands programme URL: http://anp.kb.nl (in Dutch only)
Planning a digitisation project: a rough guide
Planning a digitisation project: a rough guide
Sample 2: Databank of Digital Daily Newspapers 8 million newspaper pages Selection of Dutch local, regional, national and colonial newspapers 1618-1995 2006 2011 Budget: 12,5 million Euros Funded by the National Programme Investments in Large- Scale Research Facilities 25 billion words,text mass digitisation URL: http://www.kb.nl/projectdagbladen/
Planning a digitisation project: a rough guide
Planning a digitisation project: a rough guide
Stepping stones... 1. Writing a project proposal (incl. business case ) 2. Acquiring finances 3. Writing a detailed project plan (incl. detailed specs) 4. Setting up a project organisation 5. Managing the project flow 6. Concluding the project (exploitation plan, longterm maintenance)
1. Writing a project proposal - Business case: benefit for our organisation to start digitising - Planning: how long will it take? - Which resources are needed? Staff, equipment, etc. Estimate of required budget. - Risk asessment: what are the potential risks that can prevent us from reaching our goals? - Very down-to-earth description of final deliverables - Who will be the owner of the digital collection after the project?
Sample case 1: ANP news bulletins - Goal: make the collection online accessible for researchers and the general public - Output-oriented = access project - Why? Task of our organization - Deliverables: 1,5 million JPEG q. 10 files, 1,5 million ALTO xml-files, a website with fulltext search-and-retrieval functionality - Estimated budget: 0,5 million Euro, easy material so 0,33 Ct per page
Sample case 2: DDD Dutch newspapers - Goal: make the collection online accessible for researchers and the general public. - No preservation project but because of vulnerability of material re-scanning in the future no option, thus production project - Why? Task of the KB. - Article-level access
Sample case 2: DDD Dutch newspapers (2) - Deliverables: 8 million JP2 master files, 8 million JP2 access images, PDF per issue, text file per article, MPEG21 per issue, 8 million MIX-files and website with fulltext and advanced search - Estimated budget: 12,5 million Euro
I digitise because - my users frequently use this material - I think my users will frequently use this material - I want to save and protect my vulnerable originals - my users give me money to do so
Digitisation on demand: Stadsarchief Amsterdam Threshold: information in original should be readible 1 copy customer, 1 copy reading room No separate, uncompressed master files (JPEG 10) Now: 32 kilometers of archives digitally available 0,50 per scan
Cost estimate Be realistic Calculate all costs Use realisation data from other projects Beware of all the work BESIDES the actual scanning
Sample: ANP news bulletins and DDD newspapers Staff Hard-and software = 1,50 Euro Research and development Scanning, OCR, metadata Staff = 0,33 Euro Hard- and software Research and development Scanning, OCR, metadata
Outsourcing digitisation: different prices
Outsourcing?
Pitfall: intellectual property rights - Three copyright moments: 1. Making a (digital) copy 2. Making a copy for an internal network 3. Making a copy for the internet - 70 years after death of author and/or date of publication - Legal obligation to retrieve all rightholders, time consuming activity in (mass)digitisation projects - Commission Digiti e: agency responsible to deal with claims and retrieving rightholders?
Privacy laws?
Pitfall: know thy originals! How many? What condition? Where are they? Available metadata Dimensions Colour/greyscale/B&W Available alternatives (eg. microfilm vs originals)
Case 1: ANP news bulletins
Case 1: ANP news bulletins
Case 2: DDD newspapers
Case 2: DDD newspapers The Hague Stockholm Vatican Secret Archives Dresden Parimaribo
Acquiring finances - Resources within own institution - National government - European Union - Private funds - Current figures: 2 to 3% of all Dutch institutionalized cultural heritage is currently available in digital format.
Planning a digitisation project: a rough guide
Planning a digitisation project: a rough guide
Planning a digitisation project: a rough guide
National government Netherlands: Dutch government encourages: - crosssector cooperation between heritage institutions - open standards - service-oriented architecture (SOA) - mass digitisation - digitisation incorporated into overall policy ( Digitaliseren met beleid ) - more uniformity of digitisation activities - main target groups: education and research
National government Netherlands - Dutch ministry of Education, Culture and Science but also other ministeries. - Digitisation programmes: * Erfgoed van de Oorlog * Memory of the Netherlands http://www.geheugenvannederland.nl * Images for the future http://www.beeldenvoordetoekomst.nl
National preservation programme: Metamorfoze - KB, National Archives - 1997- - Funded by Ministry of Education, Culture and Science - Preservation of Dutch paper-based heritage - Projectbureau - 30% own contribution - Before 2007: microfilming, after 2007: preservation imaging - http://www.metamorfoze.nl/
Planning a digitisation project: a rough guide
Future: project Nederlands Erfgoed: Digitaal! - Consortium of 10 Dutch heritage institutions - Combining parts of their collections - Cross-media, subject-oriented focus - Target groups: education, research, tourism and creative industry - Canon van Nederland : highlights of Dutch history - 2009-2014 - Overall estimated cost M186 Euro, estimated benefits M172-223 Euro (!)
Koninklijke Bibliotheek: Digital Library programme - Current digitisation projects: (to 2011): M 41 pages, budget M 54 - Digital Library programme 2009-2013 - Target: 20% of all books, newspapers and journals published in the Netherlands digitised in 2013
European projects - Financial commitment institution, matching principle - Aimed at innovative initiatives - Aimed at combining collection on an international level, e.g. Europeana (http://www.europeana.eu) and European Digital Library (http://www.theeuropeanlibrary.org/portal/index.htm)
Stepping stones... 1. Writing a project proposal (incl. business case ) 2. Acquiring finances 3. Writing a detailed project plan (incl. detailed specs) 4. Setting up a project organisation 5. Managing the project flow 6. Concluding the project (exploitation plan, longterm maintenance)
Project plan - Detailed timeschedule with milestones and deliverables - Division into workpackages (in case of large projects) - Clear overview of dependencies - Detailed risk assessment - Business case : should be checked during project - Translation of project aim into detailed specifications
Planning a digitisation project: a rough guide
http://www.dbnl.org
Planning a digitisation project: a rough guide
Planning a digitisation project: a rough guide
Planning a digitisation project: a rough guide
Planning a digitisation project: a rough guide
Planning a digitisation project: a rough guide
Planning a digitisation project: a rough guide
Technical specifications affected by project aim - Image quality: resolution, tonal range, detail reproduction, polarity (B/W, greyscale, colour) - Image format: lossy vs. lossless, compressed vs. uncompressed - Metadata: a lot vs a little vs somewhere in between - Image manipulation: yesvsnovsa little - Good technical manuals available: * JISC Digital Media at http://www.jiscdigitalmedia.ac.uk/ * Cornell University at http://www.library.cornell.edu/preservation/tutorial/
Search-and-retrieval problems - Large amounts of data: how to find your way - Limited capacity of search engine - Limitations of Optical Character Recognition (OCR) software
Optical character recognition (OCR) Blijkens verschillende mededeeelingen in de dagbladen is de Indische regeering den laatsten tijd regelend opgetreden ten aanzien van het Indische handelsverkeer, in het bijzonder ten aanzien van den uitvoer van Indische producten.
Optical character recognition (OCR) Word accuracy: 7/33=79% Character accuracy: 7/202=97% Blijkena verachillende mededeeelingen in de dagbladen is de Indische regeering den 1aatsten tijd regelond opgetreden ten aanzien van het Indische handelsverkeer, in het lijzonder ten a3nzien van den uitvoar van Indische producten.
Optical character recognition (OCR) IINCOLXis strangely forgotten by b visitors to in Washington Washingt onthe Thesightseers who whotluck flock to the National ntionnl Capital at all sea seasons scaon8 seasons Lsons on8 of the year for som e som unknown reason jeeni to find more moreinteresting moreintne8t ing moreinteresting interestingthe thing things of less historic importane than the therelics thcrelic therelics relicspertaining pertainh g iu ti > the fmt martyred President whose un untimely untimely untimely timelydeath was as mourned by the entire oitihzed world Source: http://www.loc.gov/chroniclingamerica/
Optical character recognition (OCR)
Automated OCR - Pilot project Historische kranten (bitonal, from microfilm): between 60% and 70% word accuracy - Results for historical texts very low. EU-project IMPACT (Improving Access to Texts, URL: http://www.impact-project.eu/)
Stepping stones... 1. Writing a project proposal (incl. business case ) 2. Acquiring finances 3. Writing a detailed project plan (incl. detailed specs) 4. Setting up a project organisation 5. Managing the project flow 6. Concluding the project (exploitation plan, longterm maintenance)
Steering group - representative of end-users - representative of initiating party - representative responsible for quality assurance
Project manager - reports to steering group - responsible for day-to-day work (budget etc.) -management by exception
Project leader - reports to project manager - responsible for workpackage
Selection: Scientific Advisory Committee - Advises on titles to be selected - Advises on search functionality on the website (userperspective) - Advises on content on the website
Stepping stones... 1. Writing a project proposal (incl. business case ) 2. Acquiring finances 3. Writing a detailed project plan (incl. detailed specs) 4. Setting up a project organisation 5. Managing the project flow 6. Concluding the project (exploitation plan, longterm maintenance)
Managing the project flow - Are the specifications met within timeframe and budget? - Are there any new developments that affect the business case of the project?
Stepping stones... 1. Writing a project proposal (incl. business case ) 2. Acquiring finances 3. Writing a detailed project plan (incl. detailed specs) 4. Setting up a project organisation 5. Managing the project flow 6. Concluding the project (exploitation plan, longterm maintenance, lessons learned)
Exploitation plan - Who will be responsible for maintaining the website? - What possible future purposes can be served? - How much costs are involved in maintaining the website and longterm preservation of the deliverables?
Pitfall: It ain t over when it s over... - Koninklijke Bibliotheek: 41 million pages to be digitized up to 2011. - Required storage space 1 petabyte - Current estimated storage costs: longterm preservation system (e-depot) 1 TB = 8,500 Euro a year - Current estimated storage costs: webserver 1 TB = 7,500 Euro a year - Structural costs in the long run: millions!
Koninklijke Bibliotheek- big issues - Expensive scanning price of 1,3 Euro per page - Intellectual property rights - Quality control of files delivered by suppliers (> 2 million files a month) - Storage - Longterm preservation of all files produced - Inefficient search-and-retrieval software but
we have already come a long way since 1999!