Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web NLA Gordon Mohr March 28, 2012
Overview The tools: Heritrix crawler Wayback browse access Lucene/Hadoop utilities: JBs (indexing) and TNH (searching) Example uses CDL Web Archive Service NetArchive Suite (DK, FR, AT) archive.org worldwide archive & Archive-It
Internet Archive Established in 1996 501(c)(3) non profit organization Over seven petabytes (compressed) of publicly accessible archival material Technology partner to libraries, archives, museums, universities, research institutes, and memory institutions Archiving books, texts, film, video, audio, images, software, educational content and
> 175 billion captures (URL+datetime) > 2+ petabytes compressed > 15 years (1996-)
Collects anything accessible to public Obeys robots.txt restrictions Respects rightsholder/site-owner takedown requests
Web Archiving Partners
Heritrix crawling
What is Heritrix? Open-source Archival-quality Flexible Extensible Web-scale Web crawling software http://crawler.archive.org
Heritrix major components Scope / DecideRules URIs in or out Frontier URI queues, queues of queues, seen-set Processors Prep, Fetch, Extract, Write, Schedule, etc.
Heritrix writes ARCs or WARCs Both: sequence of content blocks, each introduced by a small text header ARCs: 1-line header verbatim protocol response WARCs add: multi-line header with extensible fields New record types: Request, Response, Resource Metadata, Revisit, Conversion, Warcinfo, Continuation ISO standardization
Heritrix vs. with other web copiers Powerful (but complicated) config - pluggable extractors, fetchers Not optimized for site/hostname-centric - bulk content mixed-together Content never unrolled or rewritten - requires access tools (wayback) Good options for giant crawls - millons of sites, 100s of TB 11
Wayback browsing
What is Wayback? Open Source Java Modular Scalable Customizable Web Archive Access Tool http://archive-access.sourceforge.net/projects/wayback
Wayback Features Starting with an URL: See list of captures by date See extension URLs (same site) View a capture Once browsing ( replay ): Browse web as it was Best-match clickthroughs
Wayback: Modular Components Query User Interface Calendar, Search Engine, XML Replay User Interface Archival URL, Timeline, Proxy Resource Index CDX, BDB, Remote, Aggregated Resource Store Local ARC, HTTP 1.1 Remote ARC
Wayback vs. other access Many deployment configurations All replay handled at browse-time - issues fixed in code or tolerated Many UI customizations 16
Wayback: Memento http://www.mementoweb.org/ Collaboration Los Alamos National Lab Old Dominion University Library of Congress APIs for time dimension not just external archives API for Wayback 17
Formats ARC/WARC CDX simple, flat file indexes WAT web-capture specific metadata Data exchange and analysis Less than full WARC, more than CDX JSON Minimizes data exchange worries: copyright, privacy 18
Lucene/Hadoop-based utils: JBs (indexing) TNH (searching)
Lucene & Hadoop Open Source Java Full-Text Indexing Bulk Processing (Map-Reduce) Bulk Storage (HDFS) Large ecosystem
Hadoop HDFS Distributed storage Durable, default 3x replication Scalable: Yahoo! 60+PB HDFS MapReduce Distributed computation, Java jobs Hadoop distributes work across cluster Tolerates & retries failures & more Pig, HBase, Mahout, Hue 21
JBS/TNH Background Lucene Open-source Java full-text indexing Popular, mature Nutch Extensions to Lucene For web content, access, scale Hadoop Spun off from Nutch Inspired by Google s Map-Reduce
JBs/TNH Replaces an earlier NutchWax JBS: utilities for bulk Lucene indexing - ARCs/WARCs - dates, duplicates TNH: OpenSearch service - efficient collapsing - query reformulation
The Ecosystem Each tool stewarded at IA Sponsorship by partners Driven by projects-of-the-moment Use by many institutions CDL Web Archive Service NetArchive Suite archive.org Wayback Machine, Archive-It
Thank You Gordon Mohr Internet Archive Web Group gojomo@archive.org