How To Build A Portuguese Web Search Engine

Transcription

1 The Case for a Portuguese Web Search Engine Mário J. Gaspar da Silva FCUL/DI and LASIGE/XLDB [email protected] Community Web Community webs have distint social patterns (Gibson, 1998) Community webs can be identified through the Web Linkage (Kumar, 1999)

2 Portuguese Web There is an identifiable community Web, that we call the Portuguese Web The web of the people directly related to Portugal This is NOT a small community web 10M population PT 3+ M users No identifiable topic Community Identification Flake, Lawrence & Giles: A Community Web is the sub-set of all web pages which are referenced mostly from that community web. Seeds + Expectation-Maximization iterative procedure Max-flow/min-cut algorithm performs the E step Focused crawler performs the M step. National web: some tweaking required Some sites may be too close to the web graph center Many not have incoming links at all More identification features (language,...)

3 Portuguese Web Seeds: All sites registered under the.pt TLD Sites hinted by users and verified M Step: Web pages hosted linked from a.pt site E Step: written in Portuguese Under.COM.NET.ORG.TV and.tk. Tumba! (Temos um Motor de Busca Alternativo!) Public service Community Web Search Engine Web Archive Research infrastructure See it in action at

4 Motivations for a Portuguese Web Search Engine Sociologic What kind of stuff do the Portuguese people look for on the web? Cultural Information Preservation Linguistic What language does this people use to communicate? What vocabulary? Easy way to build corpora Security & Protection Tumba! Modest effort: 1 Prof., 4-5 graduate students, 4-5 servers for 2 years Still beta! Fault-tolerance will require substantially more hardware (replication) Periodic update willl demand more storage Full-time operators? Encouraging feedback

5 Statistics Up to 20,000 queries/day 3,5 million documents under.pt the deepest crawl! 95% responses under 0.5 sec Tumba! Web Crawlers Repository Indexing Engine Ranking Engine Presentation Engine

6 Tumba! Web Crawlers Repository Indexing Engine Ranking Engine Presentation Engine.PT DNS Authority crawling+archiving Seed URLs User Input Versus (Meta-data Repository) Web ViúvaNegra (Crawling Engine) WebStore (Contents Repository)

7 Versus Basic Idea Combine idea of Versions & Workspaces model for engineering data management with parallel processing techniques Designed for web data warehousing applications in general Web data repository with a time dimension Ability to see a web as it was somewhere in the past Versus - Class Model <<abstract>>source PartitionKey 1 * externid name value A source is a reference to a Web document; A version is a snapshot of a source at a given instant; Layer externaltime 1 * Version content * * VersionProperty name value Facet A layer represents a time unit in the repository; An partitionkey is a property associated to a source and therefore to every version of it, used for partitioning; A versionproperty is a property associated to a certain version; A Facet holds a reference to a content associated with a version.

8 Layers & Resolvers Layer agregates a set of versions created within a crawling period; Resolver a function that defines the layers upon which a client operates; Access to a specific version of a contents is resolved based on the layers (and search order) to be considered. Incremental crawls merged with stored data through this mechanism Facets Alternative views of crawled and archived contents Text view Links view Converted views Serves both search engine and preservation requirements

9 WebStore - Architecture Client VCR API Java Library NFS Volume READ_ONLY Volume READ_ONLY Volume WRITABLE Volume WRITABLE host1 host2 host3 hostn RAID Disks Contents spread by multiple volumes. When storage capacity of a volume is exhausted, it becomes read-only. New volumes added as needed to augment capacity or improve performance WebStore Software Architecture Clients WebStore API WebStore Volume Management (Content Keys, Duplicates Managent, Compression) Network File System, Lustre, IBM SAN File System,...

10 Content Addressing Webstore receives content and returns key Duplicates receive already assigned key About 30% of URLs are duplicates Key codes: volume + location of contents in volume + checksum Preservation Issues No backups! Volumes implemented with low cost hard disks as network appliances. Self-describing volumes accessible to standard protocols Assumes human infrastructure Historical archives also do!

11 Tumba! Web Crawlers Repository Indexing Engine Ranking Engine SIDRA Presentation Engine Query Processing Architecture (indexing phase) Versus (Meta-data Repository) Index DataStructs Generator WebStore (Contents Repository) Page Attributes (Authority) Word Index

12 SIDRA - Word Index Data Structure 2 files Term {docid} <Term,docID> {hit} Hit = position + attrib DocID assigned in Static Rank order SIDRA Index Range Partitioning docids index hits index

13 SIDRA - Ranking Engine Word Word Index Word Index Index Query Server Query Broker Page Attributes Addressing Multi-dimensionality Generalization: page-rank (page importance measure) isn t but one of possible ranking contexts. Query Servers may index data according to other dimensions time Location... Query Brokers perform the results fusion

14 Tumba! Web Crawlers Repository Indexing Engine Ranking Engine Presentation Engine Query processing architecture (run-time phase) Page Attributes Word Index Query Processing & Ranking Engine Presentation Engine Page Attributes WebStore (Contents Repository)

15 Tumba! User Interface Time Navigation See previous versions of any page Navigate on previous date Search at a past time Internet Archive Yes experimental Not supported Tumba! ( Yes, query to versus gives previous versions stored Requires definition of a resolve function for each period. Enabled by appropariate choice of indexes

16 Conclusion Tumba! demonstrates a running search engine for a relatively large (national, 10M ppl) community. Integrates tools specific to this community: Term tools (spell checking, pronunciation, dictionary,...) Integration with deep webs Supports archiving for preservation Scalable software architecture that operates on low-cost hardware Testbed for research activities in Compuational Processing of the Portuguese. Q?

17 XMLBASE João Campos, Mário J. Silva, Versus: A Model for a Web Repository, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de Daniel Gomes, Mário J. Silva, Tarântula - Sistema de Recolha de Documentos da Web, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de Daniel Gomes, João Campos, Mário J. Silva, Versus: A Web Repository, WDAS-2002: Workshop on Distributed Data & Structures, Paris, References Tumba! Bruno Martins, Mário J. Silva, Is it Portuguese? Language detection in large document collections, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de Miguel Costa, Mário J. Silva, Ranking no Motor de Busca TUMBA, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de Mário J. Silva, Tumba!: Relatório de Actividades de 2003 e Plano para 2003, Relatório Técnico TUMBA TR-02-1, Grupo XLDB da Faculdade de Ciências da Universidade de Lisboa, Dezembro de Mário J. Silva, The Case for a Portuguese Web Search Engine, Relatório Técnico DI/FCUL TR-03-3, Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa, Março de Rachel Aires, Sandra Aluísio, Paulo Quaresma, Diana Santos, Mário J. Silva. An initial proposal for cooperative evaluation on information retrieval in Portuguese. Propor' VI Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada, Faro, Junho de (accepted) José Borbinha, Nuno Freire, Mário J. Silva, Bruno Martins. Internet Search Engines and OPACs: Getting the best of two worlds. ElPub ICCC/IFIP 7th International Conference on Electronic Publishing (accepted). Mário J. Silva, Tumba! A Web Search and Archive Combination. Mário J. Silva, The Case for a Portuguese Web Search Engine, Language Identification Problem: identify pages in domains other than.pt that are written in Portuguese (during crawl) Approach: use categorization technique based in n-gram analysis (Dunning 1994, Adams 1997). Open problem: how to identify BRASILIAN Portuguese pages?

18 Crawling Policy Search engine requirements Home page and all important pages Up to X pages, Anti-Spam protection Frequent updating Archival system requirements Anything that is worth preserving Images We can support both within our community, with incremental crawling and indexing Multiple indexes and index selection. Matching & Ranking Algorithm Phase 1: Query Matching QueryServers fetch matching docids (pre-sorted in static ranking order) QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order) Phase 2: Ranking Pick N (1000) first results from phase 1 Compute final rank using hits data Are terms also in title? What is the distance among query terms in the page? Terms in Bold, Italic?

19 Scalability Analysis Word Word Index Word Index Index Query Server Query Broker Presentation Engine User requests may be balanced among multiple Presentation Engines Contents may be replicated Requests may be balanced among multiple Query Brokers Page Attributes may be replicated Page Attributes VCR (Contents Repository) Query Brokers may balance requests to multiple Query Servers Multiple Query servers for a Word Index Word indexes may be replicated