Web Archiving Metadata Prepared for RLG Working Group The following document attempts to clarify what metadata is involved in / required for web archiving. It examines: The source of metadata The object the metadata applies to Existing standards the metadata may need to conform to Some of the data listed in this document assume that the Heritrix crawler is being used. A page for web archiving metadata is on the CDL Web-at-Risk wiki: (note that the web archiving objects diagram is simplified and should perhaps be expanded; a file can be a component of a page, and W/ARC files can be components of a crawl.) NOTE: this is the source much of the information that displays when you click the metadata button in the WERA display tool (see below). Sample data: ARC Files (current Heritrix format)
url == e.g., "http://www.alexa.com:80/" ip_address == e.g. 192.216.46.98 archive-date == date archived content-type == MIME type of data (e.g., "text/html") length == ascii representation of size of network doc in bytes date == YYYYMMDDhhmmss (Greenwich Mean Time) result-code == result code or response code, (e.g. 200 or 302) checksum == ascii representation of a checksum of the data. WERA Metadata Screen (Add section detailing what metadata will be available with forthcoming WARC format.)
Sample: Archive-It Collection level metadata. Suggestion: poll the following to see what metadata they currently allow curators to enter, what they plan: Archive-It Web Archiving Service Web Curator Tool (IIPC, New Zealand) NetArchiveDK (Denmark) ECHO DEPository (OCLC)
Possibilities: html <title>, as distinct from what the curator determines as the site name. html <meta> tag contents. topics, keywords derived from text analysis tools. recurrence of terms can imply different things on page level vs site level. Some samples, and some of the data they contain. Descriptions taken from the Heritrix User Manual: http://crawler.archive.org/articles/user_manual.html order.xml Heritrix report specifying exactly what capture settings were used to gather material. This is critical for determining whether the copy you have in your archive might be a complete capture of the site. hosts-report.txt Contains an overview of what hosts were crawled and how many documents and bytes were downloaded from each. mimetype-report.txt Contains on overview of the number of documents downloaded per mime type (i.e. pdf, html). Also has the amount of data downloaded per mime type. responsecode-report.txt Contains on overview of the number of documents downloaded per Heritrix status code (see Status codes) list for further information. crawl.log The crawl log is the most detailed account of capture activity, providing a separate line of information for every URL attempted. This includes a timestamp for the moment the capture was attempted, a status code indicating whether the capture was successful or encountered errors, the document size, the URL of the document, a discovery path code explaining how the document was captured and more. For a guide to interpreting the capture log, see the Heritrix documentation for log files; scroll to section 8.2.1: Crawl.log. Note: there are some gems lurking in the crawl log in terms of useful metadata for helping curators understand why certain pages were captured, particularly the discovery path code. These are more likely to be sources of preservation metadata pertaining to file formats, HTML validation etc.
Relevant Standards METS The METS profile in use by CDL s Web Archiving Service is posted on the wiki page. The NDIIPP ECHO DEPository group is also currently developing a METS profile for web archiving. While both profiles are works in progress, they already represent different approaches to describing the results of a capture. The CDL approach is a lightweight one, using the METS file to link to the Heritrix logs and reports that provide details about capture results. CDL does not attempt to repeat or describe the capture results within the METS file. Captures are preserved as intact.arc files in the CDL repository. The ECHO Depository approach is more finely-grained, breaking the.arc file into separate components and using the METS struct map to link each of the individual files.