Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012

Size: px

Start display at page:

Download "Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012"

Blake Glenn
10 years ago
Views:

1 Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web NLA Gordon Mohr March 28, 2012

2 Overview The tools: Heritrix crawler Wayback browse access Lucene/Hadoop utilities: JBs (indexing) and TNH (searching) Example uses CDL Web Archive Service NetArchive Suite (DK, FR, AT) archive.org worldwide archive & Archive-It

3 Internet Archive Established in (c)(3) non profit organization Over seven petabytes (compressed) of publicly accessible archival material Technology partner to libraries, archives, museums, universities, research institutes, and memory institutions Archiving books, texts, film, video, audio, images, software, educational content and

4 > 175 billion captures (URL+datetime) > 2+ petabytes compressed > 15 years (1996-)

5 Collects anything accessible to public Obeys robots.txt restrictions Respects rightsholder/site-owner takedown requests

6 Web Archiving Partners

7 Heritrix crawling

8 What is Heritrix? Open-source Archival-quality Flexible Extensible Web-scale Web crawling software

9 Heritrix major components Scope / DecideRules URIs in or out Frontier URI queues, queues of queues, seen-set Processors Prep, Fetch, Extract, Write, Schedule, etc.

10 Heritrix writes ARCs or WARCs Both: sequence of content blocks, each introduced by a small text header ARCs: 1-line header verbatim protocol response WARCs add: multi-line header with extensible fields New record types: Request, Response, Resource Metadata, Revisit, Conversion, Warcinfo, Continuation ISO standardization

11 Heritrix vs. with other web copiers Powerful (but complicated) config - pluggable extractors, fetchers Not optimized for site/hostname-centric - bulk content mixed-together Content never unrolled or rewritten - requires access tools (wayback) Good options for giant crawls - millons of sites, 100s of TB 11

12 Wayback browsing

13 What is Wayback? Open Source Java Modular Scalable Customizable Web Archive Access Tool

14 Wayback Features Starting with an URL: See list of captures by date See extension URLs (same site) View a capture Once browsing ( replay ): Browse web as it was Best-match clickthroughs

15 Wayback: Modular Components Query User Interface Calendar, Search Engine, XML Replay User Interface Archival URL, Timeline, Proxy Resource Index CDX, BDB, Remote, Aggregated Resource Store Local ARC, HTTP 1.1 Remote ARC

16 Wayback vs. other access Many deployment configurations All replay handled at browse-time - issues fixed in code or tolerated Many UI customizations 16

17 Wayback: Memento Collaboration Los Alamos National Lab Old Dominion University Library of Congress APIs for time dimension not just external archives API for Wayback 17

18 Formats ARC/WARC CDX simple, flat file indexes WAT web-capture specific metadata Data exchange and analysis Less than full WARC, more than CDX JSON Minimizes data exchange worries: copyright, privacy 18

19 Lucene/Hadoop-based utils: JBs (indexing) TNH (searching)

20 Lucene & Hadoop Open Source Java Full-Text Indexing Bulk Processing (Map-Reduce) Bulk Storage (HDFS) Large ecosystem

21 Hadoop HDFS Distributed storage Durable, default 3x replication Scalable: Yahoo! 60+PB HDFS MapReduce Distributed computation, Java jobs Hadoop distributes work across cluster Tolerates & retries failures & more Pig, HBase, Mahout, Hue 21

22 JBS/TNH Background Lucene Open-source Java full-text indexing Popular, mature Nutch Extensions to Lucene For web content, access, scale Hadoop Spun off from Nutch Inspired by Google s Map-Reduce

23 JBs/TNH Replaces an earlier NutchWax JBS: utilities for bulk Lucene indexing - ARCs/WARCs - dates, duplicates TNH: OpenSearch service - efficient collapsing - query reformulation

24 The Ecosystem Each tool stewarded at IA Sponsorship by partners Driven by projects-of-the-moment Use by many institutions CDL Web Archive Service NetArchive Suite archive.org Wayback Machine, Archive-It

25 Thank You Gordon Mohr Internet Archive Web Group

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner