How To Understand Web Archiving Metadata



Similar documents
Archiving the Web: the mass preservation challenge

Analysis of Web Archives. Vinay Goel Senior Data Engineer

User Guide to the Content Analysis Tool

How to create database in GlycomcsPortal?

Archive-IT Services Andrea Mills Booksgroup Collections Specialist

WEB ARCHIVING AT SCALE

Indexing big data with Tika, Solr, and map-reduce

Web Archiving Tools: An Overview

Statement of Work (SOW) for Web Harvesting U.S. Government Printing Office Office of Information Dissemination

Scholarly Use of Web Archives

How To Use The Web Curator Tool On A Pc Or Macbook Or Ipad (For Macbook)

Nevada NSF EPSCoR Track 1 Data Management Plan

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Accessing the Illinois CRM Report Archive Database

DIGITAL ARCHIVES & PRESERVATION SYSTEMS

Harvard Library Preparing for a Trustworthy Repository Certification of Harvard Library s DRS.

Practical Options for Archiving Social Media

North Carolina Digital Preservation Policy. April 2014

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012

Building a master s degree on digital archiving and web archiving. Sara Aubry (IT department, BnF) Clément Oury (Legal Deposit department, BnF)

Helpdesk manual. Version: 1.1

Baidu: Webmaster Tools Overview and Guidelines

FF/EDM Intro Industry Goals/ Purpose Related GISB Standards (Common Codes, IETF) Definitions d 4 d 13 Principles p 6 p 13 p 14 Standards s 16 s 25

Phishing by data URI

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Making SAP Information Steward a Key Part of Your Data Governance Strategy

graphical Systems for Website Design

Research Data Store User Guide

Content Manager User Guide Information Technology Web Services

Cataloging: Save Bibliographic Records

Using Microsoft Windows Authentication for Microsoft SQL Server Connections in Data Archive

Web Archiving and Scholarly Use of Web Archives

Web Archiving for ediscovery

Design and selection criteria for a national web archive

Content Manager User Guide Information Technology Web Services

MMLIST Listserv User's Guide for ICORS.ORG

Management of Storage Devices and File Formats in Web Archive Systems

JBoss Portal 2.4. Quickstart User Guide

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library

Kris Carpenter Negulescu, Director The Internet Archive, Web Group

Archival Data Format Requirements

How To Manage Pandora

1 How to Monitor Performance

Using Internet Archive: A guide created by the Digital POWRR Project

STEPPING UP TO THE ELECTRONIC ARCHIVING CHALLENGE: OCLC S ROLE. Andrea Keyhani Director, Licensing & Publisher Relations

Hosted Mail Archiving (HMA) User Guide

Snow Active Directory Discovery

A Platform for Large-Scale Machine Learning on Web Design

Digital Preservation Recorder 6.0.0

User Guide. DocAve Lotus Notes Migrator for Microsoft Exchange 1.1. Using the DocAve Notes Migrator for Exchange to Perform a Basic Migration

Web Archiving at BnF September 2006

NetWrix File Server Change Reporter. Quick Start Guide

WESTERNACHER OUTLOOK -MANAGER OPERATING MANUAL

Ex Libris Rosetta: A Digital Preservation System Product Description

UH CMS Basics. Cascade CMS Basics Class. UH CMS Basics Updated: June,2011! Page 1

Xtreeme Search Engine Studio Help Xtreeme

DIGITAL MARKETING BASICS: SEO

JOBS PORTAL v1.1. What is Jobs Portal? How does it work? SUMMARY:

SEO FOR VIDEO: FIVE WAYS TO MAKE YOUR VIDEOS EASIER TO FIND

User Manual. Document Management System

How To Write A Request For Information (Rfi)

Usage Analysis Tools in SharePoint Products and Technologies

Specifying the content and formal specifications of document formats for QES

Drupal Training. Create Content Creating content is the fundamental basis for building the UCSD School of Medicine's website.

Updating Device Firmware Via FTP

This document is for informational purposes only. PowerMapper Software makes no warranties, express or implied in this document.

Search and Information Retrieval

PISA 2015 MS Online School Questionnaire: User s Manual

INTRODUCING ORACLE APPLICATION EXPRESS. Keywords: database, Oracle, web application, forms, reports

9.1 eperformance Completing the Evaluation Process Job Aid for Employees

APA On-Line Fellows Application Platform Instructions for Endorsers

WEB ARCHIVING IN THE UNITED STATES: A 2013 SURVEY AN NDSA REPORT

RHYTHMYX USER MANUAL EDITING WEB PAGES

11 ways to migrate Lotus Notes applications to SharePoint and Office 365

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Turnitin User Guide. Includes GradeMark Integration. January 2014 (revised)

Graphviz Website Installation, Administration and Maintenance

Nonprofit Technology Collaboration. Web Analytics

Transcription:

Web Archiving Metadata Prepared for RLG Working Group The following document attempts to clarify what metadata is involved in / required for web archiving. It examines: The source of metadata The object the metadata applies to Existing standards the metadata may need to conform to Some of the data listed in this document assume that the Heritrix crawler is being used. A page for web archiving metadata is on the CDL Web-at-Risk wiki: (note that the web archiving objects diagram is simplified and should perhaps be expanded; a file can be a component of a page, and W/ARC files can be components of a crawl.) NOTE: this is the source much of the information that displays when you click the metadata button in the WERA display tool (see below). Sample data: ARC Files (current Heritrix format)

url == e.g., "http://www.alexa.com:80/" ip_address == e.g. 192.216.46.98 archive-date == date archived content-type == MIME type of data (e.g., "text/html") length == ascii representation of size of network doc in bytes date == YYYYMMDDhhmmss (Greenwich Mean Time) result-code == result code or response code, (e.g. 200 or 302) checksum == ascii representation of a checksum of the data. WERA Metadata Screen (Add section detailing what metadata will be available with forthcoming WARC format.)

Sample: Archive-It Collection level metadata. Suggestion: poll the following to see what metadata they currently allow curators to enter, what they plan: Archive-It Web Archiving Service Web Curator Tool (IIPC, New Zealand) NetArchiveDK (Denmark) ECHO DEPository (OCLC)

Possibilities: html <title>, as distinct from what the curator determines as the site name. html <meta> tag contents. topics, keywords derived from text analysis tools. recurrence of terms can imply different things on page level vs site level. Some samples, and some of the data they contain. Descriptions taken from the Heritrix User Manual: http://crawler.archive.org/articles/user_manual.html order.xml Heritrix report specifying exactly what capture settings were used to gather material. This is critical for determining whether the copy you have in your archive might be a complete capture of the site. hosts-report.txt Contains an overview of what hosts were crawled and how many documents and bytes were downloaded from each. mimetype-report.txt Contains on overview of the number of documents downloaded per mime type (i.e. pdf, html). Also has the amount of data downloaded per mime type. responsecode-report.txt Contains on overview of the number of documents downloaded per Heritrix status code (see Status codes) list for further information. crawl.log The crawl log is the most detailed account of capture activity, providing a separate line of information for every URL attempted. This includes a timestamp for the moment the capture was attempted, a status code indicating whether the capture was successful or encountered errors, the document size, the URL of the document, a discovery path code explaining how the document was captured and more. For a guide to interpreting the capture log, see the Heritrix documentation for log files; scroll to section 8.2.1: Crawl.log. Note: there are some gems lurking in the crawl log in terms of useful metadata for helping curators understand why certain pages were captured, particularly the discovery path code. These are more likely to be sources of preservation metadata pertaining to file formats, HTML validation etc.

Relevant Standards METS The METS profile in use by CDL s Web Archiving Service is posted on the wiki page. The NDIIPP ECHO DEPository group is also currently developing a METS profile for web archiving. While both profiles are works in progress, they already represent different approaches to describing the results of a capture. The CDL approach is a lightweight one, using the METS file to link to the Heritrix logs and reports that provide details about capture results. CDL does not attempt to repeat or describe the capture results within the METS file. Captures are preserved as intact.arc files in the CDL repository. The ECHO Depository approach is more finely-grained, breaking the.arc file into separate components and using the METS struct map to link each of the individual files.