Archiving the Web: the mass preservation challenge
|
|
|
- Leona Melton
- 10 years ago
- Views:
Transcription
1 Archiving the Web: the mass preservation challenge Catherine Lupovici Chargée de Mission auprès du Directeur des Services et des Réseaux Bibliothèque nationale de France 1-, Koninklijke Bibliotheek, Den Haag 1
2 Preservation of the Web Initial concern has been to capture the Web to be able to save it. Emphasis put on the crawlers capabilities NEDLIB harvester Heritrix developed by International Internet Preservation consortium (IIPC) members Second concern has been to provide access to users: search and navigate through time Rendering problems identified Long term preservation is now coming on the front, with the web archives to be ingested in the Digital preservation repositories Denmark started to work on its trusted Digital repository with the Web archiving 1-, Koninklijke Bibliotheek, Den Haag 2
3 The Web content specificity Web content Classical types from different previous domains (Publishing, Institutional records, Audiovisual ) Continuous emerging types: blogs Web content technical characteristics & challenges Mass, continuing resources, dynamic content and streaming Surface web or visible web (open web content) and deep web (restricted access for the robots or non harvestable contents for instance accessible via data bases) Interlinking 1-, Koninklijke Bibliotheek, Den Haag 3
4 Links are part of the web content 1-, Koninklijke Bibliotheek, Den Haag 4
5 Web size estimates OCLC Web Characterization study: Unique sites Public sites Indexable Web 200 million pages in 1997 (K.Bharat and A.Broder) 11,5 billion pages end January 2005 (A.Gulli and A.Signorini) A snapshot of.fr domain November million files, hosts, 7.23 Tb 1-, Koninklijke Bibliotheek, Den Haag 5
6 Web archiving models Whole domain approach through broad crawl Internet Archive, National Library Sweden Selection: collection of specified site snapshots National Library of Australia, UKWAC Specific selection through thematic approach Presidential elections at LoC Deposit (legal or voluntary deposit codes) Netherlands e-deposit Combine approaches Denmark, France 1-, Koninklijke Bibliotheek, Den Haag 6
7 Web preservation characteristics Scale. Web archiving produces massive and technically heterogeneous digital collections even with a selective policy Packaging of data ARC files format that is used to store web crawls as sequences of content blocks harvested. Each block has a header with some metadata (mime file type) WARC which is an extension of ARC. Currently at the ISO Committee Draft status 1-, Koninklijke Bibliotheek, Den Haag 7
8 WARC format Ability to store arbitrary metadata linked to other stored data (e.g., subject classifier, discovered language, encoding) Support for data compression and maintenance of data record integrity Ability to store all control information from the harvesting protocol (e.g., request headers), not just response information Ability to store the results of data migrations linked to other stored data Ability to store a duplicate detection event Ability to store globally unique record identifiers Support for deterministic handling of long records (e.g., truncation, segmentation). 1-, Koninklijke Bibliotheek, Den Haag 8
9 International Internet Preservation Consortium IIPC launched in July Technical standardisation at the domain scale: functional architecture and standard APIs, archival format, metadata, permanent identification Strong basic tools for all the chain from acquisition to access processes Open source, free licence tools Provide a forum for sharing knowledge about Internet content archiving both within the Consortium and beyond Working group on Preservation created in January 1-, Koninklijke Bibliotheek, Den Haag 9
10 IIPC preservation WG program of work Identification of applicable standards and tools not necessarily developed solely for web archives Test the use of existing format tools on ARC and WARC files Liaison with other related efforts Planets (EU funded) Web at Risk, California Digital Library (NDIIPP) Report by end of December 1-, Koninklijke Bibliotheek, Den Haag 10
11 Conclusion The work achieved so far leads to post process the collections to ingest them in the trusted repositories (a large scale migration test) The main question for the future is how to do it in the workflow of collecting, pre-ingest and ingest at the Web scale. 1-, Koninklijke Bibliotheek, Den Haag 11
Web Archiving Tools: An Overview
Web Archiving Tools: An Overview JISC, the DPC and the UK Web Archiving Consortium Workshop Missing links: the enduring web Helen Hockx-Yu Web Archiving Programme Manager July 2009 Shape of the Web: HTML
Web Archiving at BnF September 2006
Hosting the IIPC Steering Committee gives us the opportunity to give an update on BnF s organisation and projects. In 2006, we have been mostly focusing on building our own organisation and internal dissemination
Long-term archiving and preservation planning
Long-term archiving and preservation planning Workflow in digital preservation Hilde van Wijngaarden Head, Digital Preservation Department National Library of the Netherlands The Challenge: Long-term Preservation
How To Understand Web Archiving Metadata
Web Archiving Metadata Prepared for RLG Working Group The following document attempts to clarify what metadata is involved in / required for web archiving. It examines: The source of metadata The object
Web Harvesting and Archiving
Web Harvesting and Archiving João Miranda Instituto Superior Técnico Technical University of Lisbon [email protected] Keywords: web harvesting, digital archiving, internet preservation Abstract. This
Evaluating File Formats for Long-term Preservation
Evaluating File Formats for Long-term Preservation Abstract Judith Rog, Caroline van Wijk National Library of the Netherlands; The Hague, The Netherlands [email protected], [email protected] National
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Design and selection criteria for a national web archive
Design and selection criteria for a national web archive Daniel Gomes Sérgio Freitas Mário J. Silva University of Lisbon Daniel Gomes http://xldb.fc.ul.pt/daniel/ 1 The digital era has begun The web is
Kris Carpenter Negulescu, Director The Internet Archive, Web Group
Opportunities for Global Cooperation & Collaboration in Web Archiving Kris Carpenter Negulescu, Director The Internet Archive, Web Group Agenda Welcome & Overview The Internet Archive (IA) The International
Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data
Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data David Minor 1, Reagan Moore 2, Bing Zhu, Charles Cowart 4 1. (88)4-104 [email protected] San Diego Supercomputer Center
Scholarly Use of Web Archives
Scholarly Use of Web Archives Helen Hockx-Yu Head of Web Archiving British Library 15 February 2013 Web Archiving initiatives worldwide http://en.wikipedia.org/wiki/file:map_of_web_archiving_initiatives_worldwide.png
SEVENTH FRAMEWORK PROGRAMME THEME ICT -1-4.1 Digital libraries and technology-enhanced learning
Briefing paper: Value of software agents in digital preservation Ver 1.0 Dissemination Level: Public Lead Editor: NAE 2010-08-10 Status: Draft SEVENTH FRAMEWORK PROGRAMME THEME ICT -1-4.1 Digital libraries
Web Archiving and Scholarly Use of Web Archives
Web Archiving and Scholarly Use of Web Archives Helen Hockx-Yu Head of Web Archiving British Library 15 April 2013 Overview 1. Introduction 2. Access and usage: UK Web Archive 3. Scholarly feedback on
Bibliothèque numérique de l enssib
Bibliothèque numérique de l enssib Turning the library inside out, 4 au 7 juillet 2006 35 e congrès LIBER Permanent access to the record of science: the international role of the e-depot at the KB, National
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web NLA Gordon Mohr March 28, 2012 Overview The tools: Heritrix crawler Wayback browse access Lucene/Hadoop utilities:
Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library
Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library Web Archiving Austrian Books Online SCAPE at the Austrian National
Oxford Digital Asset Management System (DAMS) Update
Oxford Digital Asset Management System (DAMS) Update Neil Jefferies R&D Project Manager Systems & eresearch Services (SERS) Oxford University Library Services (OULS) Agenda Overview Fedora-Commons Honeycomb/ST5800
Digital Preservation Strategy, 2012-2015
Digital Preservation Strategy, 2012-2015 Preface This digital preservation strategy sets out what the National Library of Wales (NLW) intends to do to preserve digital materials over the next three years.
Building a master s degree on digital archiving and web archiving. Sara Aubry (IT department, BnF) Clément Oury (Legal Deposit department, BnF)
Building a master s degree on digital archiving and web archiving Sara Aubry (IT department, BnF) Clément Oury (Legal Deposit department, BnF) Objectives of the presentation > Present and discuss an experiment
Harvard Library Preparing for a Trustworthy Repository Certification of Harvard Library s DRS.
National Digital Stewardship Residency - Boston Project Summaries 2015-16 Residency Harvard Library Preparing for a Trustworthy Repository Certification of Harvard Library s DRS. Harvard Library s Digital
Archival Data Format Requirements
Archival Data Format Requirements July 2004 The Royal Library, Copenhagen, Denmark The State and University Library, Århus, Denmark Main author: Steen S. Christensen The Royal Library Postbox 2149 1016
Facilitating the discovery of free online content: the librarian perspective. June 2013. Open Access Author Survey March 2013 1
Facilitating the discovery of free online content: the librarian perspective June 2013 Open Access Author Survey March 2013 1 Acknowledgements This research was carried out on behalf of Taylor & Francis
Statement of Work (SOW) for Web Harvesting U.S. Government Printing Office Office of Information Dissemination
Statement of Work (SOW) for Web Harvesting U.S. Government Printing Office Office of Information Dissemination Scope The U.S. Government Printing Office (GPO) requires the services of a vendor that can
The challenges of digital preservation to support research in the digital age
DRAFT FOR DISCUSSION WITH ADVISORY COUNCIL MEMBERS ONLY The challenges of digital preservation to support research in the digital age Lynne Brindley CEO, The British Library November 2005 Agenda UK developments
Standard Big Data Architecture and Infrastructure
Standard Big Data Architecture and Infrastructure Wo Chang Digital Data Advisor Information Technology Laboratory (ITL) National Institute of Standards and Technology (NIST) [email protected] May 20, 2016
How To Manage Pandora
PANDORA - past, present, and future National web archiving in Australia Dr Paul Koerbin Manager Web Archiving National Library of Australia National Conference on eresources in Malaysia Penang, Malaysia,
Recommendations for the Implementation of Article 37 of the Spanish Science, Technology and Innovation Act: Open Access Dissemination SUMMARY
Recommendations for the Implementation of Article 37 of the Spanish Science, Technology and Innovation Act: Open Access Dissemination SUMMARY CC BY - 1 Publication date: October 2014 Index 1. Introduction
Collecting and Providing Access to Large Scale Archived Web Data. Helen Hockx-Yu Head of Web Archiving, British Library
Collecting and Providing Access to Large Scale Archived Web Data Helen Hockx-Yu Head of Web Archiving, British Library Web Archives key characteristics Snapshots of web resources, taken at given point
A Service for Data-Intensive Computations on Virtual Clusters
A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King [email protected] Planets Project Permanent
Summary Report T2 Migration Test
Summary Report T2 Migration Test Version: 1.0 Author: Caroline van Wijk Date: 31 October 2006 National Library of the Netherlands (Koninklijke Bibliotheek) Digital Preservation Department Version 1.0 Pag:
SURFsara Data Services
SURFsara Data Services SUPPORTING DATA-INTENSIVE SCIENCES Mark van de Sanden The world of the many Many different users (well organised (international) user communities, research groups, universities,
IBM Unstructured Data Identification and Management
IBM Unstructured Data Identification and Management Discover, recognize, and act on unstructured data in-place Highlights Identify data in place that is relevant for legal collections or regulatory retention.
The challenges of becoming a Trusted Digital Repository
The challenges of becoming a Trusted Digital Repository Annemieke de Jong is Preservation Officer at the Netherlands Institute for Sound and Vision (NISV) in Hilversum. She is responsible for setting out
The Deposit System for Electronic Publications. A Process Model. Titia van der Werf. Koninklijke Bibliotheek. Report Series 6
The Deposit System for Electronic Publications A Process Model Titia van der Werf Koninklijke Bibliotheek Report Series 6 DSEP: A Process Model The Deposit System for Electronic Publications: A Process
WEB ARCHIVING AT SCALE
WEB ARCHIVING AT SCALE (updated 12/19/14) by Rosalie Lack, Stephen Abrams, Trisha Cruse THE VALUE OF WEB CONTENT Web content holds a critical place in modern library collections. Web archiving is an essential
Ex Libris Rosetta: A Digital Preservation System Product Description
Ex Libris Rosetta: A Digital Preservation System Product Description CONFIDENTIAL INFORMATION The information herein is the property of Ex Libris Ltd. or its affiliates and any misuse or abuse will result
Webarchiving: Legal Deposit of Internet in Denmark. A Curatorial Perspective
MDR, Vol. 41, pp. 110 120, December 2012 Copyright by Walter de Gruyter Berlin Boston. DOI 10.1515/mir-2012-0018 Webarchiving: Legal Deposit of Internet in Denmark. A Curatorial Perspective Sabine Schostag
E-Content Service Group Virtual Meeting. Digital Preservation: How to Get Started
E-Content Service Group Virtual Meeting Digital Preservation: How to Get Started Slide 2 of 17 Agenda Committee Members E-Content Purpose Presentation by Greg Zick Questions Discussion for May Meeting
DIGITAL ARCHIVES & PRESERVATION SYSTEMS
DIGITAL ARCHIVES & PRESERVATION SYSTEMS Part 1 Overview (part 1 of 7) Kari R. Smith, MIT Institute Archives Session Overview Digital archives and digital preservation systems. These open source tools are
of the public interface service and will also act as the national aggregator for Europeana.
The Finnish National Digital Library: National Library of Finland developing a national infrastructure in collaboration with libraries, archives and museums Kristiina Hormia-Poutanen Deputy National Librarian
Chapter 434-662 WAC PRESERVATION OF ELECTRONIC PUBLIC RECORDS
Chapter 434-662 WAC PRESERVATION OF ELECTRONIC PUBLIC RECORDS WAC 434-662-010 Purpose. Pursuant to the provisions of chapters 40.14 and 42.56 RCW, and RCW 43.105.250, the rules contained in this chapter
The Australian War Memorial s Digital Asset Management System
The Australian War Memorial s Digital Asset Management System Abstract The Memorial is currently developing an Enterprise Content Management System (ECM) of which a Digital Asset Management System (DAMS)
Response to Invitation to Tender: requirements and feasibility study on preservation of e-prints
Response to Invitation to Tender: requirements and feasibility study on preservation of e-prints A proposal to the JISC from the Arts and Humanities Data Service and the University of Nottingham, Project
One System to rule them all - an epic journey to manage our collections
One System to rule them all - an epic journey to manage our collections Lynne Billington Manager, Data Quality, Systems and Standards Euwe Ermita Systems and Applications Leader One system to rule them
Image Data, RDA and Practical Policies
Image Data, RDA and Practical Policies Rainer Stotzka and many others KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu Data Life Cycle Lab
Digital Preservation. OAIS Reference Model
Digital Preservation OAIS Reference Model Stephan Strodl, Andreas Rauber Institut für Softwaretechnik und Interaktive Systeme TU Wien http://www.ifs.tuwien.ac.at/dp Aim OAIS model Understanding the functionality
Questionnaire on Digital Preservation in Local Authority Archive Services
Questionnaire on Digital Preservation in Local Authority Archive Services A - Digital Preservation Planning 1. Would you describe your Archive Service as: Actively seeking digital material Reacting to
EUDAT. Towards a pan-european Collaborative Data Infrastructure. Willem Elbers
EUDAT Towards a pan-european Collaborative Data Infrastructure Willem Elbers EUDAT / MPI-TLA Focus meeting: Data repositories SURF, Utrecht March 3, 2014 Outline EUDAT project EUDAT services Summary and
Management of Storage Devices and File Formats in Web Archive Systems
The Ninth International Symposium on Operations Research and Its Applications (ISORA 10) Chengdu-Jiuzhaigou, China, August 19 23, 2010 Copyright 2010 ORSC & APORC, pp. 356 361 Management of Storage Devices
OpenAIRE Research Data Management Briefing paper
OpenAIRE Research Data Management Briefing paper Understanding Research Data Management February 2016 H2020-EINFRA-2014-1 Topic: e-infrastructure for Open Access Research & Innovation action Grant Agreement
Columbia University Libraries / Information Services
Stephen Davis, October 28, 2010 Columbia University Libraries / Information Services Digital Asset Management Digital Preservation Digital Publishing Introductions Stephen Paul Davis Director, Libraries
METADATA STANDARDS AND GUIDELINES RELEVANT TO DIGITAL AUDIO
This chart provides a quick overview of metadata standards and guidelines that are in use with digital audio, including metadata used to describe the content of the files; metadata used to describe properties
Archiving the Web and Beyond: A Look at Twi8er and Facebook (and some other things too)
Archiving the Web and Beyond: A Look at Twi8er and Facebook (and some other things too) July 23, 2014 Benn Joseph Manuscript Librarian Northwestern University Library Outline of discussion Digital preservanon
Preserving digital data - risk assessment and digital preservation strategies
Preserving digital data - risk assessment and digital preservation strategies Martin Iordanidis, M.A. epublishing systems Exponential growth of digital data Exponential growth of digital data tenfold growth
Functional Requirements for Digital Asset Management Project version 3.0 11/30/2006
/30/2006 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 20 2 22 23 24 25 26 27 28 29 30 3 32 33 34 35 36 37 38 39 = required; 2 = optional; 3 = not required functional requirements Discovery tools available to end-users:
ProLibis Solutions for Libraries in the Digital Age. Automation of Core Library Business Processes
ProLibis Solutions for Libraries in the Digital Age Automation of Core Library Business Processes We see a modern library not only as a book repository, but also and most importantly as a state-of-the-art,
STANDARDS FOR AGENTS AND AGENT BASED SYSTEMS (FIPA)
Course Number: SENG 609.22 Session: Fall, 2003 Course Name: Agent-based Software Engineering Department: Electrical and Computer Engineering Document Type: Tutorial Report STANDARDS FOR AGENTS AND AGENT
STEPPING UP TO THE ELECTRONIC ARCHIVING CHALLENGE: OCLC S ROLE. Andrea Keyhani Director, Licensing & Publisher Relations
STEPPING UP TO THE ELECTRONIC ARCHIVING CHALLENGE: OCLC S ROLE Andrea Keyhani Director, Licensing & Publisher Relations OCLC.org Nonprofit, membership organization 36,000 libraries in 76 countries Mission:
eifl-ip Handbook on Copyright and Related Issues for Libraries LEGAL DEPOSIT
LEGAL DEPOSIT Legal deposit is a legal obligation that requires publishers to deposit a copy (or copies) of their publications within a specified period of time in a designated national institution. The
Bibliothèque numérique de l enssib
Bibliothèque numérique de l enssib European integration: conditions and challenges for libraries, 3 au 7 juillet 2007 36 e congrès LIBER How much does it cost? The LIFE Project Davies, Richard LIFE 2 Project
Audiovisual Archive Management System (AVAMS) Project
Audiovisual Archive Management System (AVAMS) Project Overview of Achievements Rose Holley Project Manager 22 August 2014 Overview of: NAA Audiovisual Archive AVAMS project scope and budget Achievements
Archiving Social Media in the Context of Non-print Legal Deposit
Submitted on: 30/07/2013 Archiving Social Media in the Context of Non-print Legal Deposit Helen Hockx-Yu Head of Web Archiving, British Library, London, United Kingdom. E-mail address: [email protected]
Digital preservation: an introduction
Digital preservation: an introduction Michael Day UKOLN, University of Bath, UK [email protected] University of the West of England, MSc in Information and Library Management, Advanced Information Systems
nestor - Network of Expertise in Long-Term Storage and Long-Term availability of Digital Resources in Germany
nestor - Network of Expertise in Long-Term Storage and Long-Term availability of Digital Resources in Germany 1. The challenge We still do not know how to archive digital scientific publications, works
How To Build A Map Library On A Computer Or Computer (For A Museum)
Russ Hunt OCLC Tools Managing & Preserving Digitised Map Libraries Keywords: Digital collection management; digital collection access; digital preservation; long term preservation. Summary This paper explains
AHDS Digital Preservation Glossary
AHDS Digital Preservation Glossary Final version prepared by Raivo Ruusalepp Estonian Business Archives, Ltd. January 2003 Table of Contents 1. INTRODUCTION...1 2. PROVENANCE AND FORMAT...1 3. SCOPE AND
THE BRITISH LIBRARY BOARD BLB 12/29
IN CONFIDENCE THE BRITISH LIBRARY BOARD BLB 12/29 BRITISH LIBRARY DIGITAL STRATEGY TO 2015 1. PURPOSE OF THE PAPER To respond to members request for an overarching overview of how digital activities in
