How To Build A Portuguese Web Search Engine



Similar documents
The XLDB Group at CLEF 2004

Design and selection criteria for a national web archive

DataStorm: Large-Scale Data Management in Cloud Environments

A survey of web archive search architectures

The University of Lisbon at CLEF 2006 Ad-Hoc Task

Managing duplicates in a web archive

CiteSeer x in the Cloud

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Application Note 116: Gauntlet System High Availability Using Replication

Backup and Recovery 1

The Viuva Negra crawler

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Scalable Internet Services and Load Balancing

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Analysis of Web Archives. Vinay Goel Senior Data Engineer

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Hadoop Architecture. Part 1

Object Storage: Out of the Shadows and into the Spotlight

Europass Curriculum Vitae

A comprehensive guide to XML Sitemaps:

NexentaConnect for VMware Virtual SAN

Unifying Search for the Desktop, the Enterprise and the Web

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

REACTION Workshop Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION

Long term retention and archiving the challenges and the solution

SharePoint Server 2010 Capacity Management: Software Boundaries and Limits

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

EMC BACKUP MEETS BIG DATA

Introduction to Gluster. Versions 3.0.x

Solaris For The Modern Data Center. Taking Advantage of Solaris 11 Features

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Wikimedia architecture. Mark Bergsma Wikimedia Foundation Inc.

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

How To Virtualize A Storage Area Network (San) With Virtualization

Web Archiving and Scholarly Use of Web Archives

Collecting and Providing Access to Large Scale Archived Web Data. Helen Hockx-Yu Head of Web Archiving, British Library

EVILSEED: A Guided Approach to Finding Malicious Web Pages

Bringing Big Data into the Enterprise

New Features... 1 Installation... 3 Upgrade Changes... 3 Fixed Limitations... 4 Known Limitations... 5 Informatica Global Customer Support...

REDUCE COSTS AND COMPLEXITY WITH BACKUP-FREE STORAGE NICK JARVIS, DIRECTOR, FILE, CONTENT AND CLOUD SOLUTIONS VERTICALS AMERICAS

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

In Memory Accelerator for MongoDB

LOAD BALANCING IN WEB SERVER

Diagram 1: Islands of storage across a digital broadcast workflow

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

CIGRE 2014: Udaljena zaštita podataka

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Search and Information Retrieval

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems

Oracle Database 11g: New Features for Administrators DBA Release 2

Scalable Internet Services and Load Balancing

Data Warehouses in the Path from Databases to Archives

Search Engines. Stephen Shaw 18th of February, Netsoc

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

A Deduplication File System & Course Review

Survey of Filesystems for Embedded Linux. Presented by Gene Sally CELF

2. Metadata Modeling Best Practices with Cognos Framework Manager

From SDN to SDC. Requirements for the Next Generation Cloud. Lisboa, Junho 2014

Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

San Jose State University

Data Backup and Archiving with Enterprise Storage Systems

Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015

DATA AND LOG FILES FOR CENTRAL MANAGEMENT STORE

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Trends in Enterprise Backup Deduplication

Module 14: Scalability and High Availability

Scholarly Use of Web Archives

Cloud-integrated Enterprise Storage. Cloud-integrated Storage What & Why. Marc Farley

Building Views and Charts in Requests Introduction to Answers views and charts Creating and editing charts Performing common view tasks

IBM Software Information Management. Scaling strategies for mission-critical discovery and navigation applications

CURRICULUM VITAE FERNANDO LUÍS TODO-BOM FERREIRA DA COSTA

The software platform for storing, preserving and sharing very large data sets.

IST/INESC-ID. R. Alves Redol 9 Sala Lisboa PORTUGAL

Hadoop and Map-Reduce. Swati Gore

IAF Business Intelligence Solutions Make the Most of Your Business Intelligence. White Paper November 2002

VMware vsphere Data Protection 6.0

Oracle Warehouse Builder 10g

Distributed Computing and Big Data: Hadoop and MapReduce

Transcription:

The Case for a Portuguese Web Search Engine Mário J. Gaspar da Silva FCUL/DI and LASIGE/XLDB mjs@di.fc.ul.pt Community Web Community webs have distint social patterns (Gibson, 1998) Community webs can be identified through the Web Linkage (Kumar, 1999)

Portuguese Web There is an identifiable community Web, that we call the Portuguese Web The web of the people directly related to Portugal This is NOT a small community web 10M population PT 3+ M users No identifiable topic Community Identification Flake, Lawrence & Giles: A Community Web is the sub-set of all web pages which are referenced mostly from that community web. Seeds + Expectation-Maximization iterative procedure Max-flow/min-cut algorithm performs the E step Focused crawler performs the M step. National web: some tweaking required Some sites may be too close to the web graph center Many not have incoming links at all More identification features (language,...)

Portuguese Web Seeds: All sites registered under the.pt TLD Sites hinted by users and verified M Step: Web pages hosted linked from a.pt site E Step: written in Portuguese Under.COM.NET.ORG.TV and.tk. Tumba! (Temos um Motor de Busca Alternativo!) Public service Community Web Search Engine Web Archive Research infrastructure See it in action at http://tumba.pt

Motivations for a Portuguese Web Search Engine Sociologic What kind of stuff do the Portuguese people look for on the web? Cultural Information Preservation Linguistic What language does this people use to communicate? What vocabulary? Easy way to build corpora Security & Protection Tumba! Modest effort: 1 Prof., 4-5 graduate students, 4-5 servers for 2 years Still beta! Fault-tolerance will require substantially more hardware (replication) Periodic update willl demand more storage Full-time operators? Encouraging feedback http://tumba.pt

Statistics Up to 20,000 queries/day 3,5 million documents under.pt the deepest crawl! 95% responses under 0.5 sec Tumba! Web Crawlers Repository Indexing Engine Ranking Engine Presentation Engine

Tumba! Web Crawlers Repository Indexing Engine Ranking Engine Presentation Engine.PT DNS Authority crawling+archiving Seed URLs User Input Versus (Meta-data Repository) Web ViúvaNegra (Crawling Engine) WebStore (Contents Repository)

Versus Basic Idea Combine idea of Versions & Workspaces model for engineering data management with parallel processing techniques Designed for web data warehousing applications in general Web data repository with a time dimension Ability to see a web as it was somewhere in the past Versus - Class Model <<abstract>>source PartitionKey 1 * externid name value A source is a reference to a Web document; A version is a snapshot of a source at a given instant; Layer externaltime 1 * Version content 1 1 1...* * VersionProperty name value Facet A layer represents a time unit in the repository; An partitionkey is a property associated to a source and therefore to every version of it, used for partitioning; A versionproperty is a property associated to a certain version; A Facet holds a reference to a content associated with a version.

Layers & Resolvers Layer agregates a set of versions created within a crawling period; Resolver a function that defines the layers upon which a client operates; Access to a specific version of a contents is resolved based on the layers (and search order) to be considered. Incremental crawls merged with stored data through this mechanism Facets Alternative views of crawled and archived contents Text view Links view Converted views Serves both search engine and preservation requirements

WebStore - Architecture Client VCR API Java Library NFS Volume READ_ONLY Volume READ_ONLY Volume WRITABLE Volume WRITABLE host1 host2 host3 hostn RAID Disks Contents spread by multiple volumes. When storage capacity of a volume is exhausted, it becomes read-only. New volumes added as needed to augment capacity or improve performance WebStore Software Architecture Clients WebStore API WebStore Volume Management (Content Keys, Duplicates Managent, Compression) Network File System, Lustre, IBM SAN File System,...

Content Addressing Webstore receives content and returns key Duplicates receive already assigned key About 30% of URLs are duplicates Key codes: volume + location of contents in volume + checksum Preservation Issues No backups! Volumes implemented with low cost hard disks as network appliances. Self-describing volumes accessible to standard protocols Assumes human infrastructure Historical archives also do!

Tumba! Web Crawlers Repository Indexing Engine Ranking Engine SIDRA Presentation Engine Query Processing Architecture (indexing phase) Versus (Meta-data Repository) Index DataStructs Generator WebStore (Contents Repository) Page Attributes (Authority) Word Index

SIDRA - Word Index Data Structure 2 files Term {docid} <Term,docID> {hit} Hit = position + attrib DocID assigned in Static Rank order SIDRA Index Range Partitioning docids index hits index

SIDRA - Ranking Engine Word Word Index Word Index Index Query Server Query Broker Page Attributes Addressing Multi-dimensionality Generalization: page-rank (page importance measure) isn t but one of possible ranking contexts. Query Servers may index data according to other dimensions time Location... Query Brokers perform the results fusion

Tumba! Web Crawlers Repository Indexing Engine Ranking Engine Presentation Engine Query processing architecture (run-time phase) Page Attributes Word Index Query Processing & Ranking Engine Presentation Engine Page Attributes WebStore (Contents Repository)

Tumba! User Interface Time Navigation See previous versions of any page Navigate on previous date Search at a past time Internet Archive Yes experimental Not supported Tumba! (www.tumba.pt) Yes, query to versus gives previous versions stored Requires definition of a resolve function for each period. Enabled by appropariate choice of indexes

Conclusion Tumba! demonstrates a running search engine for a relatively large (national, 10M ppl) community. Integrates tools specific to this community: Term tools (spell checking, pronunciation, dictionary,...) Integration with deep webs Supports archiving for preservation Scalable software architecture that operates on low-cost hardware Testbed for research activities in Compuational Processing of the Portuguese. Q? http://tumba.pt

XMLBASE João Campos, Mário J. Silva, Versus: A Model for a Web Repository, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de 2001. Daniel Gomes, Mário J. Silva, Tarântula - Sistema de Recolha de Documentos da Web, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de 2001. Daniel Gomes, João Campos, Mário J. Silva, Versus: A Web Repository, WDAS-2002: Workshop on Distributed Data & Structures, Paris, 2002. References http://xldb.fc.ul.pt Tumba! Bruno Martins, Mário J. Silva, Is it Portuguese? Language detection in large document collections, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de 2001. Miguel Costa, Mário J. Silva, Ranking no Motor de Busca TUMBA, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de 2001. Mário J. Silva, Tumba!: Relatório de Actividades de 2003 e Plano para 2003, Relatório Técnico TUMBA TR-02-1, Grupo XLDB da Faculdade de Ciências da Universidade de Lisboa, Dezembro de 2002. Mário J. Silva, The Case for a Portuguese Web Search Engine, Relatório Técnico DI/FCUL TR-03-3, Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa, Março de 2003. Rachel Aires, Sandra Aluísio, Paulo Quaresma, Diana Santos, Mário J. Silva. An initial proposal for cooperative evaluation on information retrieval in Portuguese. Propor'2003 - VI Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada, Faro, Junho de 2003. (accepted) José Borbinha, Nuno Freire, Mário J. Silva, Bruno Martins. Internet Search Engines and OPACs: Getting the best of two worlds. ElPub 2003 - ICCC/IFIP 7th International Conference on Electronic Publishing (accepted). Mário J. Silva, Tumba! A Web Search and Archive Combination. Mário J. Silva, The Case for a Portuguese Web Search Engine, 2003. Language Identification Problem: identify pages in domains other than.pt that are written in Portuguese (during crawl) Approach: use categorization technique based in n-gram analysis (Dunning 1994, Adams 1997). Open problem: how to identify BRASILIAN Portuguese pages?

Crawling Policy Search engine requirements Home page and all important pages Up to X pages, Anti-Spam protection Frequent updating Archival system requirements Anything that is worth preserving Images We can support both within our community, with incremental crawling and indexing Multiple indexes and index selection. Matching & Ranking Algorithm Phase 1: Query Matching QueryServers fetch matching docids (pre-sorted in static ranking order) QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order) Phase 2: Ranking Pick N (1000) first results from phase 1 Compute final rank using hits data Are terms also in title? What is the distance among query terms in the page? Terms in Bold, Italic?

Scalability Analysis Word Word Index Word Index Index Query Server Query Broker Presentation Engine User requests may be balanced among multiple Presentation Engines Contents may be replicated Requests may be balanced among multiple Query Brokers Page Attributes may be replicated Page Attributes VCR (Contents Repository) Query Brokers may balance requests to multiple Query Servers Multiple Query servers for a Word Index Word indexes may be replicated