Archiving the Web: the mass preservation challenge



Similar documents
Web Archiving Services in The British Library: An Update

Web Archiving Tools: An Overview

Web Archiving at BnF September 2006

Long-term archiving and preservation planning

CREATING WEB ARCHIVING SERVICES IN THE BRITISH LIBRARY

Archiving the Internet

The International Internet Preservation Consortium (IIPC)

How To Understand Web Archiving Metadata

The KB e-depot in development Integrating research results in the library organisation

Web Harvesting and Archiving

PASIG San Diego March 13, 2015

Evaluating File Formats for Long-term Preservation

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Design and selection criteria for a national web archive

Archiving the Social Web MARAC Spring 2013 Conference

Kris Carpenter Negulescu, Director The Internet Archive, Web Group

Web and Twitter Archiving at the Library of Congress

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Libraries and Disaster Recovery

Scholarly Use of Web Archives

The NDSA Content Working Group s National Agenda Digital Content Area Discussions: Web and Social Media

SEVENTH FRAMEWORK PROGRAMME THEME ICT Digital libraries and technology-enhanced learning

Web Archiving and Scholarly Use of Web Archives

Bibliothèque numérique de l enssib

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library

Oxford Digital Asset Management System (DAMS) Update

Digital Preservation Strategy,

Building a master s degree on digital archiving and web archiving. Sara Aubry (IT department, BnF) Clément Oury (Legal Deposit department, BnF)

Harvard Library Preparing for a Trustworthy Repository Certification of Harvard Library s DRS.

Archival Data Format Requirements

Meeting Increasing Demands for Metadata

Design and Selection Criteria for a National Web Archive

Press Dossier April 2010

Facilitating the discovery of free online content: the librarian perspective. June Open Access Author Survey March

Statement of Work (SOW) for Web Harvesting U.S. Government Printing Office Office of Information Dissemination

The challenges of digital preservation to support research in the digital age

Tools and Services for the Long Term Preservation and Access of Digital Archives

Standard Big Data Architecture and Infrastructure

How To Manage Pandora

Collaborative Open Source Software Production & APIs. Tom Cramer Stanford

Recommendations for the Implementation of Article 37 of the Spanish Science, Technology and Innovation Act: Open Access Dissemination SUMMARY

Collecting and Providing Access to Large Scale Archived Web Data. Helen Hockx-Yu Head of Web Archiving, British Library

Report on the Crawl and Harvest of the Whole Australian Web Domain Undertaken during June and July 2005

A Service for Data-Intensive Computations on Virtual Clusters

Summary Report T2 Migration Test

ERA Challenges. Draft Discussion Document for ACERA: 10/7/30

Scalable Services for Digital Preservation

SURFsara Data Services

IBM Unstructured Data Identification and Management

The challenges of becoming a Trusted Digital Repository

The Deposit System for Electronic Publications. A Process Model. Titia van der Werf. Koninklijke Bibliotheek. Report Series 6

WEB ARCHIVING AT SCALE

Advertising that targets children: Ensuring the best protection possible

Ex Libris Rosetta: A Digital Preservation System Product Description

Webarchiving: Legal Deposit of Internet in Denmark. A Curatorial Perspective

E-Content Service Group Virtual Meeting. Digital Preservation: How to Get Started

DIGITAL ARCHIVES & PRESERVATION SYSTEMS

ALEX THURMAN. PCC Participants Meeting ALA Annual June 30, Columbia University Libraries

Second EUDAT Conference, October 2013 Workshop: Digital Preservation of Cultural Data Scalability in preservation of cultural heritage data

of the public interface service and will also act as the national aggregator for Europeana.

Information and documentation Statistics and Quality Indicators for Web Archiving

Chapter WAC PRESERVATION OF ELECTRONIC PUBLIC RECORDS

The International e-depot

The Australian War Memorial s Digital Asset Management System

Response to Invitation to Tender: requirements and feasibility study on preservation of e-prints

University of California Libraries. Shared Print Program. Strategic Plan

One System to rule them all - an epic journey to manage our collections

Image Data, RDA and Practical Policies

Digital Preservation. OAIS Reference Model

Questionnaire on Digital Preservation in Local Authority Archive Services

EUDAT. Towards a pan-european Collaborative Data Infrastructure. Willem Elbers

Management of Storage Devices and File Formats in Web Archive Systems

OpenAIRE Research Data Management Briefing paper

WRANGLING DIGITAL CHAOS: CHARACTERIZATION & INGEST

Europeana: a shared resource

Using Business Vocabulary to build a strong Document Management Solution

Executive has authority to determine the above recommendation.

The DK domain: in words and figures

Columbia University Libraries / Information Services

METADATA STANDARDS AND GUIDELINES RELEVANT TO DIGITAL AUDIO

National Digital Preservation Initiatives:

Archiving the Web and Beyond: A Look at Twi8er and Facebook (and some other things too)

Preserving digital data - risk assessment and digital preservation strategies

Capturing the Web WEB ARCHIVING. Tools for the Capture of Digital Assets on Websites March 27, 2006 Kelly Eubank

Combining Technologies to Create New Solutions

Functional Requirements for Digital Asset Management Project version /30/2006

ProLibis Solutions for Libraries in the Digital Age. Automation of Core Library Business Processes

STANDARDS FOR AGENTS AND AGENT BASED SYSTEMS (FIPA)

STEPPING UP TO THE ELECTRONIC ARCHIVING CHALLENGE: OCLC S ROLE. Andrea Keyhani Director, Licensing & Publisher Relations

eifl-ip Handbook on Copyright and Related Issues for Libraries LEGAL DEPOSIT

Bibliothèque numérique de l enssib

Audiovisual Archive Management System (AVAMS) Project

Archiving Social Media in the Context of Non-print Legal Deposit

Digital preservation: an introduction

nestor - Network of Expertise in Long-Term Storage and Long-Term availability of Digital Resources in Germany

How To Build A Map Library On A Computer Or Computer (For A Museum)

AHDS Digital Preservation Glossary

THE BRITISH LIBRARY BOARD BLB 12/29

Transcription:

Archiving the Web: the mass preservation challenge Catherine Lupovici Chargée de Mission auprès du Directeur des Services et des Réseaux Bibliothèque nationale de France 1-, Koninklijke Bibliotheek, Den Haag 1

Preservation of the Web Initial concern has been to capture the Web to be able to save it. Emphasis put on the crawlers capabilities NEDLIB harvester Heritrix developed by International Internet Preservation consortium (IIPC) members Second concern has been to provide access to users: search and navigate through time Rendering problems identified Long term preservation is now coming on the front, with the web archives to be ingested in the Digital preservation repositories Denmark started to work on its trusted Digital repository with the Web archiving 1-, Koninklijke Bibliotheek, Den Haag 2

The Web content specificity Web content Classical types from different previous domains (Publishing, Institutional records, Audiovisual ) Continuous emerging types: blogs Web content technical characteristics & challenges Mass, continuing resources, dynamic content and streaming Surface web or visible web (open web content) and deep web (restricted access for the robots or non harvestable contents for instance accessible via data bases) Interlinking 1-, Koninklijke Bibliotheek, Den Haag 3

Links are part of the web content 1-, Koninklijke Bibliotheek, Den Haag 4

Web size estimates OCLC Web Characterization study: http://wcp.oclc.org 1998 2000 2002 Unique sites 2 636 000 7 128 000 8 712 000 Public sites 1 457 000 2 942 000 3 080 000 Indexable Web 200 million pages in 1997 (K.Bharat and A.Broder) 11,5 billion pages end January 2005 (A.Gulli and A.Signorini) A snapshot of.fr domain November 2006 271.7 million files, 2 928 000 hosts, 7.23 Tb 1-, Koninklijke Bibliotheek, Den Haag 5

Web archiving models Whole domain approach through broad crawl Internet Archive, National Library Sweden Selection: collection of specified site snapshots National Library of Australia, UKWAC Specific selection through thematic approach Presidential elections at LoC Deposit (legal or voluntary deposit codes) Netherlands e-deposit Combine approaches Denmark, France 1-, Koninklijke Bibliotheek, Den Haag 6

Web preservation characteristics Scale. Web archiving produces massive and technically heterogeneous digital collections even with a selective policy Packaging of data ARC files format that is used to store web crawls as sequences of content blocks harvested. Each block has a header with some metadata (mime file type) WARC which is an extension of ARC. Currently at the ISO Committee Draft status 1-, Koninklijke Bibliotheek, Den Haag 7

WARC format Ability to store arbitrary metadata linked to other stored data (e.g., subject classifier, discovered language, encoding) Support for data compression and maintenance of data record integrity Ability to store all control information from the harvesting protocol (e.g., request headers), not just response information Ability to store the results of data migrations linked to other stored data Ability to store a duplicate detection event Ability to store globally unique record identifiers Support for deterministic handling of long records (e.g., truncation, segmentation). 1-, Koninklijke Bibliotheek, Den Haag 8

International Internet Preservation Consortium IIPC launched in July 2003 http://netpreserve.org Technical standardisation at the domain scale: functional architecture and standard APIs, archival format, metadata, permanent identification Strong basic tools for all the chain from acquisition to access processes Open source, free licence tools Provide a forum for sharing knowledge about Internet content archiving both within the Consortium and beyond Working group on Preservation created in January 1-, Koninklijke Bibliotheek, Den Haag 9

IIPC preservation WG program of work Identification of applicable standards and tools not necessarily developed solely for web archives Test the use of existing format tools on ARC and WARC files Liaison with other related efforts Planets (EU funded) Web at Risk, California Digital Library (NDIIPP) Report by end of December 1-, Koninklijke Bibliotheek, Den Haag 10

Conclusion The work achieved so far leads to post process the collections to ingest them in the trusted repositories (a large scale migration test) The main question for the future is how to do it in the workflow of collecting, pre-ingest and ingest at the Web scale. 1-, Koninklijke Bibliotheek, Den Haag 11