A survey of web archive search architectures

Similar documents
A Survey of Web Archive Search Architectures

Design and selection criteria for a national web archive

Full Text Search of Web Archive Collections

LARGE SCALE INTERNET SERVICES

How To Build A Portuguese Web Search Engine

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012

Practical Options for Archiving Social Media

Archive-IT Services Andrea Mills Booksgroup Collections Specialist

TABLE OF CONTENTS THE SHAREPOINT MVP GUIDE TO ACHIEVING HIGH AVAILABILITY FOR SHAREPOINT DATA. Introduction. Examining Third-Party Replication Models

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution

Business Software Defined DMS DOCUMENT MANAGEMENT SYSTEM ZETA SOFTWARE

Web Search by the people, for the people Michael Christen,

Web Archiving and Scholarly Use of Web Archives

Technical White Paper: Clustering QlikView Servers

The XLDB Group at CLEF 2004

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Distributed Architecture of Oracle Database In-memory

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Plagiarism. Dr. M.G. Sreekumar UNESCO Coordinator, Greenstone Support for South Asia Head, LRC & CDDL, IIM Kozhikode

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Hypertable Architecture Overview

Diagram 1: Islands of storage across a digital broadcast workflow

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

DMS Document Management System

Introduction to Hadoop

Search and Information Retrieval

Scholarly Use of Web Archives

SharePoint Server 2010 Capacity Management: Software Boundaries and Limits

EMC APPLICATIONXTENDER 8.0 Real-Time Document Management

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

A Peer-to-Peer Approach to Content Dissemination and Search in Collaborative Networks

Service Description Cloud Storage Openstack Swift

PI Cloud Connect Overview

How To Scale Out Of A Nosql Database

Hadoop and Map-Reduce. Swati Gore

Next Generation. School Management System. EduSwift. School Management System.

Cassandra A Decentralized, Structured Storage System

Search and Real-Time Analytics on Big Data

SCALABLE DATA SERVICES

FEDERATED DATA SYSTEMS WITH EIQ SUPERADAPTERS VS. CONVENTIONAL ADAPTERS WHITE PAPER REVISION 2.7

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

A1 and FARM scalable graph database on top of a transactional memory layer

How To Create A Multi Disk Raid

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

CLOUD COMPUTING. When it's smarter to rent than to buy.. Presented by Anand Tirumani

Invenio: A Modern Digital Library for Grey Literature

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

Intelligent Dashboards made Simple! Using Excel Services

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Understanding Object Storage and How to Use It

Web Archiving Tools: An Overview

How To Build A Connector On A Website (For A Nonprogrammer)

Bigtable is a proven design Underpins 100+ Google services:

Get More Hits to Your Website

Digital Libraries and Content Management

Document Management System (DMS) Release 4.5 User Guide

Analysis of Web Archives. Vinay Goel Senior Data Engineer

The Laserfiche Rio Advantage. Automate, Optimize and Transform Business Processes. Unlimited document repositories and servers

Green Code Lab Challenge : First step for the challenge

I-Motion SQL Server admin concerns

Audit compliance and long-term archiving for SharePoint

NoSQL document datastore as a backend of the visualization platform for ECM system

The Hybrid Cloud Advantage White Paper

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

AlienVault Unified Security Management (USM) 4.x-5.x. Deployment Planning Guide

The Role and uses of Peer-to-Peer in file-sharing. Computer Communication & Distributed Systems EDA 390

SEO Techniques for various Applications - A Comparative Analyses and Evaluation

On the features and challenges of security and privacy in distributed internet of things. C. Anurag Varma CpE /24/2016

Monitoring Elastic Cloud Services

White Paper. Optimizing the Performance Of MySQL Cluster

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

Availability Digest. HPE Helion Private Cloud and Cloud Broker Services February 2016

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

DFID Research Open and Enhanced Access Policy: Implementation guide

Real Time Performance Dashboard for SOA Web Services ORION SOA

Easier - Faster - Better

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Information access through information technology

Introduction to Hadoop

1/12/11 THE VALUE OF OPERATING IN THE CLOUD INTRODUCTIONS THE VALUE OF OPERATING IN THE CLOUD

Transcription:

A survey of web archive search architectures Miguel Costa, Daniel Gomes (Portuguese Web Archive@FCCN) Francisco Couto, Mário J. Silva (University of Lisbon)

The Internet Archive was founded in 1996 Web-archived page of the first Web Archive

Web archiving has been growing 77 web archiving initiatives 282 billion web-archived files

Web archives must be searchable Users demand Google-like search Searchable means at least full-text search Unsearchable=Useless

How to enable web archive search? Our pursued answer since 2001

Research on web archiving (2001) A web archive of online publications Project between the University of Lisbon and the National Library of Portugal

Research on search engines (2001) Portuguese-web search engine

Portuguese Web Archive = Web search + Web archiving 2008 2010 2013

Survey about web archiving initiatives (2011) URL search: 89% Meta-data search: 79% Full-text search: 67% (28 initiatives) The knowledge is out there

For this survey of web archive search architectures Identified prevalent search architectures Compared main features based on: Available publications (still few) Our experience

Portuguese Web Archive Based on NutchWAX Archive-access tools are widely used to support search Full-text search over 1.2B docs at archive.pt

Search workflow Machine 1 Query Server Index partition 1 Query Query Broker Machine 2 Query Server Machine N Query Server Index partition 2 Index partition N For large collections, indexes must be partitioned across several machines

Time, document partitioning (PWA) document identifiers Advantages Selects time partitions according to query timespan Progressive degradation All computers have all terms One partition fails, remaining respond to query term No index rebuilding required to add new collections lexicon after boat car... you zebra 1 2 3 4 5 <3,1> Disadvantages <2,1> <5,2> <1,3> <4,2> <2,1> <4,1> xbox <5,1> document-based partition (1-2) document-based partition (3-4) High workload: all document partitions within timespan must be scanned for each query Centralized data center approach

Everlast P2P architecture Different types of nodes Crawlers Version directories Indexes Low cost nodes Full-text search Unlimited scalability Tested in laboratory

Term, time partitioning (Everlast) document identifiers lexicon after boat car dad 1 2 3 4 5 <3,1> <2,1> <5,2> <1,3> <4,2> <2,1> <4,1>... <5,1> term-based partition (a-b) term-based partition (c-d) Advantages Robustness of decentralized architecture Lower workload: only one term partition is contacted for each query Disadvantages Index updates to add new collections Term partition unreachable may prevent response to query term Redundancy required Latency due to network

Wayback Machine (URL search) Doesn t use inverted indexes Flat sorted files of URLs URL partitioning Advantages High throughput with millions of queries daily Easy to manage: no phd required Disadvantages High communication workload because all queries are broadcasted to all index partitions Limited search features

Overall comparison Search requirement Storage and workload scalability Wayback Machine Portuguese Web Archive Everlast High High Very High Service reliability High High Medium Time-aware indexing Performance of response times and throughput No Yes Yes High Very High Medium None is the best, just different. Our objective was to improve documentation about web archive search

Food-for-thought

Problem for existing web archives Users don t know where to search for past web content Page unavailable means lost forever Dissemination of web archive services is expensive

It would be nice to have a single portal for cross-web archive search but Web-archived data is spread Search architectures are different Search technologies are different Interoperability is required

How to design a cross-web archive search architecture?

Our proposal: Web archive metasearch based on OpenSearch MyWebArchive search OpenSearch Web archive Metasearch Portal OpenSearch client YourWebArchive search OpenSearch Live-web search OpenSearch OpenSearch is a widely supported and simple technology Most web archives use NutchWAX and it supports OpenSearch Portal would be simple and cheap to implement Extremely useful to web users Increase visibility of web archiving initiatives Easily combines live-web with past-web search results

Successfully tested by Computer Science students Web applications that gather information about politicians from several sources: Wikipedia, Youtube, Twitter, Portuguese Web Archive

Web archive search easily integrated on web browsers

Required research to cross-web archive search Cross-web archive ranking algorithms How to rank search results? User interface design How to adequately present results from different sources?

Conclusions Web archives must support full-text search Web archive search architectures are different but search interoperability should be a requirement OpenSearch has potential to quickly enable cross-web archive search What do you think?

Contact us whenever you like Thanks. www.archive.pt daniel.gomes@fccn.pt