Peer-to-Peer Data Management

Similar documents
A PROXIMITY-AWARE INTEREST-CLUSTERED P2P FILE SHARING SYSTEM

Search and Information Retrieval

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam

A Peer-to-Peer Architecture for Information Retrieval Across Digital Library Collections

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

8 Conclusion and Future Work

Scalable Source Routing

Semantic Search in Peer-to-Peer Systems. Yingwu Zhu and Yiming Hu

Topic Communities in P2P Networks

1 o Semestre 2007/2008

Efficient Search in Gnutella-like Small-World Peerto-Peer

Search Engines. Stephen Shaw 18th of February, Netsoc

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Information Retrieval Elasticsearch

Peer-to-Peer Data Management

Using Peer to Peer Dynamic Querying in Grid Information Services

A Collaborative and Semantic Data Management Framework for Ubiquitous Computing Environment

The Case for a Hybrid P2P Search Infrastructure

Simulating a File-Sharing P2P Network

KEYWORD SEARCH IN RELATIONAL DATABASES

Information Searching Methods In P2P file-sharing systems

Enhancing P2P File-Sharing with an Internet-Scale Query Processor

D1.1 Service Discovery system: Load balancing mechanisms

Content Delivery Network (CDN) and P2P Model

Introduction to Information Retrieval

RESEARCH ISSUES IN PEER-TO-PEER DATA MANAGEMENT

SwanLink: Mobile P2P Environment for Graphical Content Management System

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

CS5412: TIER 2 OVERLAYS

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Performance Tuning for the Teradata Database

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks

Static IP Routing and Aggregation Exercises

Bloom Filter based Inter-domain Name Resolution: A Feasibility Study

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

The Role and uses of Peer-to-Peer in file-sharing. Computer Communication & Distributed Systems EDA 390

Wireless Sensor Networks Chapter 3: Network architecture

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

LOAD BALANCING WITH PARTIAL KNOWLEDGE OF SYSTEM

Six Degrees of Separation in Online Society

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Predicting the Stock Market with News Articles

MIDAS: Multi-Attribute Indexing for Distributed Architecture Systems

Information Retrieval and Web Search Engines

How To Create A P2P Network

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Scalable Prefix Matching for Internet Packet Forwarding

A Review on Efficient File Sharing in Clustered P2P System

Efficient Content Location Using Interest-Based Locality in Peer-to-Peer Systems

Components: Interconnect Page 1 of 18

Graph Mining and Social Network Analysis

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

A Reputation Management System in Structured Peer-to-Peer Networks

Introduction to Information Retrieval

Raddad Al King, Abdelkader Hameurlain, Franck Morvan

Mining Text Data: An Introduction

P2P VoIP for Today s Premium Voice Service 1

Towards a Next- Generation Inter-domain Routing Protocol. L. Subramanian, M. Caesar, C.T. Ee, M. Handley, Z. Mao, S. Shenker, and I.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution

Reputation Management Algorithms & Testing. Andrew G. West November 3, 2008

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Virtual Landmarks for the Internet

Big Data and Scripting map/reduce in Hadoop

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

Load Balancing in Structured Overlay Networks. Tallat M. Shafaat

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

query enabled P2P networks Park, Byunggyu

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Effective Keyword-based Selection of Relational Databases

Development of an Enhanced Web-based Automatic Customer Service System

Analysis on Leveraging social networks for p2p content-based file sharing in disconnected manets


Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Christian Bettstetter. Mobility Modeling, Connectivity, and Adaptive Clustering in Ad Hoc Networks

Cassandra A Decentralized, Structured Storage System

Distributed Hash Tables in P2P Systems - A literary survey

Performance of networks containing both MaxNet and SumNet links

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

Introduction to LAN/WAN. Network Layer

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Decentralized Peer-to-Peer Network Architecture: Gnutella and Freenet

Search Result Optimization using Annotators

Eng. Mohammed Abdualal

An Efficient Strategy for Data Recovery in Wi-Fi Systems

Distributed Caching Algorithms for Content Distribution Networks

Principles of Distributed Database Systems

Lecture 2.1 : The Distributed Bellman-Ford Algorithm. Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol

IBM Social Media Analytics

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Super-Agent Based Reputation Management with a Practical Reward Mechanism in Decentralized Systems

Transcription:

Peer-to-Peer Data Management Wolf-Tilo Balke Sascha Tönnies Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

4. Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

4. What is IR? Information retrieval (IR) is the science of searching for documents, for information within documents and for metadata about documents A user enters a query, i.e. an information need, into the system Several objects may match the query with different degrees of relevancy

4. RepresentingText How do we represent the complexities of language? Computers don t understand documents or queries Simple, yet effective approach: bag of words Treat all the words in a document as index terms for that document Assign a weight to each term based on its importance Disregard order, structure, meaning, etc. of the words

4. Representing Text McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $.8 to $34.9, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. 6 said 4 McDonalds 2 fat fries 8 new 6 company, french, nutrition 5 food, oil, percent, reduce, taste, Tuesday Bag of Words

4. Retrieval Retrieving relevant information is hard! Evolving, ambiguous user needs, context, etc. Complexities of language To operationalize information retrieval, we must vastly simplify the picture Information retrieval is all (and only) about matching words in documents with words in queries Obviously, not true But it works pretty well!

Document Document 2 4. Representing Documents asvectors Document The quick brown fox jumped over the lazy dog s back. Document 2 Now is the time for all good men to come to the aid of their party. Term aid all back brown come dog fox good jump lazy men now over party quick their time Stopword List for is of the to

4. RepresentingText text + structure document structured recognition accents, Howspacing, to comparestopwords etc. documents and queries? text noun groups stemming automatic or manual indexing structure full text index terms

4. Boolean Retrieval Weights assigned to terms are either or represents absence : term isn t in the document represents presence : term is in the document Build queries by combining terms with Boolean operators AND, OR, NOT The system returns all documents that satisfy the query

Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 4. Boolean View of a Document-Set (=Collection) Term aid all back brown come dog fox good jump lazy men now over party quick their time Each column represents the view of a particular document: What terms are contained in this document? Each row represents the view of a particular term: What documents contain this term? To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator

Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 4. Sample Queries Term dog fox dog fox dog fox dog fox fox dog dog AND fox Doc 3, Doc 5 dog OR fox Doc 3, Doc 5, Doc 7 dog NOT fox empty fox NOT dog Doc 7 Term good party g p over g p o good AND party Doc 6, Doc 8 good AND party NOT over Doc 6

4. The Perfect Query Paradox Every information need has a perfect set of documents If not, there would be no sense doing retrieval Every document set has a perfect query AND every word in a document to get a query for it Repeat for each document in the set OR every document query to get the set query But can users realistically be expected to formulate this perfect query? Boolean query formulation is hard!

4. Why Boolean Retrieval fails Natural language is way more complex AND discovers nonexistent relationships Terms in different sentences, paragraphs, Guessing terminology for OR is hard good, nice, excellent, outstanding, awesome, Guessing terms to exclude is even harder! Democratic party, party to a lawsuit,

4. Strengths and Weaknesses Strengths Precise, if you have a clear idea of what you re looking for Efficient for the computer Weaknesses Users must learn Boolean logic Boolean logic insufficient to capture the richness of language No control over size of result set: either too many documents or none All documents in the result set are considered equally good What about partial matches? Documents that don t quite match the query may be useful also

4. Ranked Retrieval Order documents by how likely they are to be relevant to the information need Present hits one screen at a time At any point, users can continue browsing through ranked list or reformulate query Attempts to retrieve relevant documents directly, not merely provide tools for doing so

4. Why Ranked Retrieval? Arranging documents by relevance is Closer to how humans think: some documents are better than others Closer to user behavior: users can decide when to stop reading Best (partial) match: documents need not have all query terms Although documents with more query terms should be better

4. Similarity-based Retrieval? Let s replace relevance with similarity Rank documents by their similarity with the query Treat the query as if it were a document Create a query bag-of-words Find its similarity to each document Rank order the documents by similarity Surprisingly, this works pretty well!

4. Vector Space Model t 3 d 2 d 3 φ θ d t t 2 d 5 d 4 Postulate: Documents that are close together in vector space talk about the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ closeness )

4. How to Weight Terms? Idea: Hans Peter Luhn 958, IBM Here s the intuition: Terms that appear often in a document should get high weights The more often a document contains the term dog, the more likely that the document is about dogs. Terms that appear in many documents should get low weights Words like the, a, of appear in (nearly) all documents. How do we capture this mathematically? Term frequency Inverse document frequency

4. TFxIDF TFxIDF [Gerald Salton, 96] Term Frequency (TF) How often a term appears in a document can be calculated locally Document Frequency (DF) Number of documents, which contain a specific term Needs global (system wide) knowledge Inverse Document Frequency (IDF) Discriminator for the importance of a term regarding the number of occurrences in all documents Needs global (system wide) knowledge

4. Working on Indices quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 The term-document matrix again has bag of words information about the collection

4. Small yet Fast? Can we make this data structure smaller, keeping in mind the need for fast retrieval? Observations: The nature of the search problem requires us to quickly find which documents contain a term The term-document matrix is very sparse Some terms are more useful than others

Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 4. Posting Lists Term aid all back brown come dog fox good jump lazy men now over party quick their time Postings 4, 8 2, 4, 6, 3, 7, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3, 3, 5, 7 2, 4, 8 2, 6, 8, 3, 5, 7, 8 6, 8, 3, 5, 7 2, 4, 6

4. Inverted Document Index Term aid all back brown come dog fox good jump lazy men now over party quick their time Postings 4, 8 2, 4, 6, 3, 7, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3, 3, 5, 7 2, 4, 8 2, 6, 8, 3, 5, 7, 8 6, 8, 3, 5, 7 2, 4, 6

4. What goes in the Postings? Boolean retrieval Just the document number Ranked Retrieval Document number and term weight (tf.idf,...) Proximity operators Word offsets for each occurrence of the term

4.2 Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

4.2 Information Retrieval in P2P Systems Information Retrieval deals with complex documents Meta-data can only capture some aspects of a document, but not anticipate all semantic searches E.g. sports-related newspaper article, but no names, locations, etc. Support for full-text searches needed Find the best-matching document from the bestconnected peer Unlike in file sharing emphasis is on the document quality If there are multiple sources offering similar quality documents, choose best peer according to connection, etc.

4.2 Challenges in P2P IR Efficient query evaluation scheme Central inverted index of documents is expensive to maintain How to disseminate a peer s query? Simple flooding of all queries is not scalable, if best documents have to be found (not just some match) Dealing with network churn A peer can always alter the set of documents offered, or significantly change individual documents Peers may join and leave the network, i.e. whole document collections may disappear, or can be added Integration of collection-wide information Peers are not able to calculate IR-style scorings from local knowledge, but needs some knowledge from the (virtual) merged collection Constant dissemination of collection-wide information needs a lot of bandwidth

4.2 Example: Problem of Collection-wide Information Example: Different news collections, query on keyword basketball General news collection, e.g. Many articles, only few about basketball, therefore IDF small Keyword discriminates well between articles NBA news collection Few articles, almost all about basketball, therefore IDF high Keyword hardly discriminates between articles Merged collection: IDF medium But how do independent collections (peers) exchange their information?

4.2 Example: Problem of Collection-wide Information Top object A... Peer A... B... B global scoring all objects identical TF = IDF = 6/3 A Querying Peer Query: A and B TF= IDF=3/2 TF= IDF= 3/ local scoring A... Peer 2 B... B... Top object TF= IDF=3/ TF= IDF= 3/2 local scoring

4.2 Distributed IR Distributed information retrieval techniques grew increasingly important for searching Web sources Abstracts of information sources To support distributed retrieval sources have to register abstracts or keyword sets Abstracts can either be kept in a central repository or distributed by gossiping algorithms, e.g. PlanetP [Cuenca-Acuna et al., 3] Collection selection Having no central index needs a sophisticated way of choosing the most promising collections for querying

4.2 Distributed IR Such abstracts can be compactly represented by Bloom Filters, i.e. bit vectors that allow membership queries Each term is hashed with n different functions and the position in the bit vector for each hash value is set to Allows for false positives, but no false negatives In Counting Bloom Filters objects can also be removed?

4.2 Distributed IR Benefit estimators for collection selection use aggregated statistics about individual collections for selection, e.g. CORI measure [Callan et al., 95] CORI calculates collection score s i for collection i regarding query q: with and where n is the number of collections, cdf the collection document frequency, cdf max the maximum cdf and cf t the collection frequency of term t

4.3 Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

4.3 Index Structures for Query Routing Traditional index structures cannot be readily employed in P2P systems High degree of distribution High degree of volatility (churn) High degree of index maintenance Distributed paradigms needed to route queries to appropriate peers Simple flooding method does not scale Distributed hash table lookup Using indexed routing information Using shortcut overlays

4.3 Distributed Hash Tables for IR Distributed hash tables Route queries to appropriate peers with number of hops logarithmic in network size No peer needs to maintain more than logarithmic amount of routing information But Exact match queries only All new content has to be published, if peers join/change Old content has to be unpublished, if peers leave Documents added/removed will contain a lot of different terms to be published/unpublished. Thus, usually many index peers have to be addressed Conjunction of query terms needs to access many peers, but there is still no guarantee that a single document with the conjunction exists

Occurrence Frequency 4.3 Distributed Hash Tables for IR Improvement: Hybrid P2P infrastructures [Loo et al., 4] Efficiency of DHT is worst, if highly replicated items are requested Experiments show worse behavior than flooding, degrading with churn Querying and content allocation follow Zipf-distribution Only few highly replicated and often queried items People are looking for hay, not for needles (S. Shenker) Hybrid P2P infrastructures use DHTs only for the less replicated and rarely Query Frequency Distribution,% 2 3 4 5 6 7 8 9 queried items, all other queries are flooded Still, DHTs have to be maintained for the majority of query terms 6,% 4,% 2,%,% 8,% 6,% 4,% 2,% Query

4.3 Routing Indexes for IR Routing indexes are local collections of (key, peer) pairs Key is either a keyword or a query Peer is the address of a peer that either offers relevant results, or routes the query to other peers with relevant result In contrast to flooding only interesting directions are queried Often distinguished between links in the default network (directions of content providers) and overlay structure of direct links to content providers ( shortcuts ) First introduced by [Crespo & Garcia-Molina, 2] to choose best neighbors in the default network for query forwarding Index maintenance is of local nature and index coverage is usually high due to Zipf distribution of requests Correctness of index is influenced by network volatility/churn

4.3 Routing Indexes for IR Routing index policies in the face of network churn With restricted index sizes new entries are collected and always stored. If the maximum size is reached, some stale information is replaced A simple strategy always replaces the currently oldest index entries Least recently used (LRU) strategy assigns higher usefulness to entries that have been successfully used recently Optimal index size is a problematic parameter Indexes with unrestricted size have to combat network churn differently time to live assigns an expiry time for each new index entry forgetting factors can periodically weigh down reliability of link information

4.3 An Algorithm for Correct Query Routing Goal: progressive distributed top-k ranking of documents Putting techniques together to design an efficient top-k algorithm Minimal number of object transfers Optimal number of object accesses Features of the P2P based approach Optimized Query-Routing No global Index Query-driven term-indexing

4.3 Bird s View. Distribute query through the network (Routing) 2. Every peer scores documents locally (Ranking) 3. Hierarchical construction of the final result (Merging) 4. Optimized query routing (Index)

4.3 Building Blocks Structured network local ranking result query-driven index merging

4.3 Network Structure Observation: peers strongly differ in availability, bandwidth, computing power, Hierarchical network structure with super-peers Query routing Result merging Indexes

4.3 Network topology Super-peers as hypercube (HyperCuP protocol) Resilient against leaving peers Broadcast with (n-) messages, log 2 (n) hops minimal spanning tree SP 5 SP 6 SP 2 SP 2 2 SP 5 SP SP 3 SP 7 SP 7 SP 8 2 2 SP 2 SP 6 SP 3 SP 4 SP 4 SP 8

4.3 Local Ranking Super-peer asks for local rankings of peers collections Top-k results (plus metric-dependent information) are returned to SP Arbitrary similarity measures can be used TFxIDF Similarities in taxonomies

4.3 Result Merging Results will be merged at the super-peers Unique scoring function Maximum of k messages per SP-SP egde SP C P 3 P 7 P 6 P 2 P 5 P 4 P SP D SP B SP A P Q

4.3 Indexing Super-peers keep indexes IDFs (collection wide information) IDF-values for query terms Top peers (routing) List of peers that already have contributed to a previous top-k result Others possible, e.g. for taxonomies Index entries are query-driven

4.3 Routing Indexes Example: Top k Query Routing Example for routing indexes in P2P networks with super-peer backbone holding routing indexes Progressive P2P top-k algorithm [Balke et al., 4] If query q is indexed, distribute query and collect results Otherwise flood query and Compute ranks at local peers Merge results at super-peers Use statistics for new entry in routing index (routing information, collection-wide information, etc.) Data structures at super-peers RequestResults: Peers which are queried for result (index information) BestPeer: Peers which delivered recent best result TopRes: Current top results Delivered: Delivered results

4.3 Routing Indexes Example: Top k Query Routing SP 5 SP4 RequestResults {SP8,P2, P3, P4} SP SP 3 SP 7 BestPeers {} TopRes {} Delivered {} P P SP 2 SP 6 Empty routing index at SP 4 q? d.8 Find top 2 documents d2.3 d3.2 SP 4 SP 8 P 2 P 3 P 4 d2.7 d22.4 d23.3 d3.6 d32.6 d33. d4.5 d42.5 d43.2

4.3 Routing Indexes Example: Top k Query Routing SP 5 P SP P SP 3 SP 2 SP 7 SP 6 SP4 RequestResults {} {SP8,P2, P3, P3, P4} P4} BestPeers {P2} {} {} TopRes {(P3, {(P2, {} {(P2, d3, d2, d2,.5),.7),.7)} TopRes Delivered TopRes (P4, (P3, {} d4, d3,.4)}.5), Delivered {(P2, (P4, d2, d4,.7)}.4)} Delivered {} q? d.8 d2.3 d3.2 SP 4 SP 8 P 2 P 3 P 4 d2.7 d3 d2.7.6 d4 d2.7.5 d22.4 d32.6 d42.5 d23.3 d33. d43.2

4.3 Routing Indexes Example: Top k Query Routing SP 5 P SP P SP 3 SP 2 q {(d,?.8)} d.8 d2.3 d3.2 SP 7 SP 6 SP 4 SP RequestResults {} {SP3,SP5, P} BestPeers {} {P} TopRes {(P, {(SP2, d2, d,.7)}.8), TopRes Delivered {(P, {SP2} (SP2, d, d2,.8)}.7)} Delivered {} SP 8 P 2 P 3 d2.7 d3.6 d22.4 d32.6 d23.3 d33. P 4 d4.5 d42.5 d43.2

4.3 Routing Indexes Example: Top k Query Routing SP 5 SP SP 3 SP 7 SP BestPeers RequestResults {P} {} Delivered BestPeers {} {(P, {SP2} d,.8)} RequestResults TopRes {(SP2, {} {(P, d2, d2,.3)}.7)} Delivered {(SP2, {(P, d, d2, d,.8)}.7),.8), TopRes Delivered (P, (SP2, d2, d2,.7)}.3)} P P SP 2 SP 6 q {(d,.8),.8)} q (d2,.7)} d.8 d2.3 SP 4 SP 8 d3.2 P 2 P 3 P 4 d2.7 d22.4 d23.3 d3.6 d32.6 d33. d4.5 d42.5 d43.2

4.3 Routing Indexes Example: Top k Query Routing q SP 5 SP SP4 SP2 Routing Index q RequestResults {SP2, {P2, {SP4} P3} P} {} BestPeers {SP2} SP SP 3 SP 2 SP 7 P P SP 6 TopRes {(P, d2,.3)} Delivered {(P, d,.8), (SP2, d2,.7)} q {(d,.8), (d2,.7)} d.8 d2.3 SP 4 SP 8 d3.2 P 2 P 3 P 4 d2.7 d22.4 d23.3 d3.6 d32.6 d33. d4.5 d42.5 d43.2

4.3 Query Routing At the first appearance of a queries peers only send out their input for IDF computation Super-peers aggregate IDFs and build index Whenever a query is repeated SPs will send recent IDF-values together with query terms Peers will uses IDFs for local score computation Disadvantage: at first occurrance of query it has to be sent twice Zipf-Distribution minimizes number of queries concerned Advantages: No effort for maintaining global IDF index Values for often occurring queries are kept up-to-date

4.3 Query Routing und Network Churn Query index strategy Send queries only to peers that have already recently contributed to answering a query Problem: the network s and each peer s volatility Solution : Send queries also to a randomly selected set of peers Solution 2: Best before -timestamp X SP 2 SP X SP 3 X SP 5 SP 4 X SP 6 SP 7 SP 8

4.3 Locality-Based Routing Indexes Refinement of routing indexes by social metaphors Similar retrieval process like in real life Every person has only limited knowledge of the environment Who knows about a certain topic? Who might know other people who know about the topic? Try to build (short) chains of acquaintances that will bring you close to the requested information Aims at building social networks as overlays Peers semantically connected by certain topics form small world networks, e.g. [Milgram, 67; Kleinberg, ] Paradigm of interest-based locality If a peer has relevant content for a user s query, it very often also has some other content that this user might be interested in

4.3 Locality-Based Routing Indexes For information retrieval in P2P network this enables new routing in interest-based overlay structures Route queries to peers with documents matching semantically close queries Traces on practical data collections show that Peers get well-connected The overlay graph shows highly-clustered characteristics with a small minimum distance between any two nodes Overhearing of communications routed through a peer can be used to enhance its local index Randomly sending queries also to peers from the default network helps to extend knowledge and can remedy the effect of network churn

4.4 Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

4.4 Supporting Effective P2P IR P2P information retrieval has to deal with the trade-off between Efficient local maintenance of statistics / index information, where information can be stale (incorrect) Expensive global maintenance of statistics / index information, where information always is accurate Needed is just the right level of dissemination of statistics to guarantee a sufficiently effective retrieval Some techniques help to support efficient retrieval Providing adequate collection-wide information Estimate document overlap between peers Pre-structure collections by categories / taxonomies

4.3 Providing Collection-Wide Information Collection-wide information is important for retrieval quality, but cannot be calculated locally like e,g., IDFs Some systems like e.g. PlanetP, do not use CWI directly, but circumnavigate the problem by using an inverted peer frequency where N is the number of all peers and N t is the number of peers offering documents on term t If summarizations of peers (abstracts) are eagerly disseminated, each peer can locally decide values for N and N t The relevance of peers in multi-keyword queries is simply the sum of IPFs for the individual terms Practical tests show an average overlap of about 7% between result sets retrieved with IDFs and those retrieved with IPFs Using IPFs the scalability is, however, still limited

4.4 Providing Collection-Wide Information Tests in Web information retrieval, e.g. [Viles & French, 95], show that CWI stays relatively stable over the whole collection of Web Sites even with churn Only joining/leaving corpora on completely new topics result in significant change Indexing CWI in a similar way as the routing information for queries is possible [Balke et al., 5] In structured networks CWI can be aggregated along the backbone and indexed CWI can be distributed together with the query New queries have to be flooded/routed twice The first flooding collects and aggregates CWI The second one provides the correct CWI for local scorings Non-expired indexed CWI can always be used when available

4.4 Estimating the Document Overlap Assessing the novelty of collections also supports retrieval quality Pre-computed statistics about expected result quality in each collection is often used to minimize the number of queried collections Choosing collection with high overlap for querying will usually not improve result sets sufficiently to justify the access costs Especially progressive searches, like top-k searches, profit from focusing on collections with small overlaps, since result merging procedures will ignore identical/similar results The novelty of a collection can only be calculated with respect to some reference collection(s) e.g. those collection(s) already in a peers local routing index

4.4 Estimating the Document Overlap A definition of a peer p s collection C p with respect to a reference collection C ref [Bender et al., 5] Since the information what exact documents a peer offers is usually not disseminated, the values have to be approximated from statistics E.g. if abstracts in the form of Bloom filters are given, a combined Bloom filter b p can be calculated by bitwise logical AND between p s Bloomfilters for all keywords in a query Novelty then can be estimated by comparing it to as the union of those Bloom filters b i of the set of collections S that have already been retrieved The degree of novelty is given by counting locations where p s Bloom filter has differing set bits

4.4 Prestructuring Collections with Taxonomies Retrieval in P2P systems generally considers two basic paradigms Fulltext-based queries Metadata-based queries Integrating these paradigms can support retrieval effectiveness Structuring document collections Disambiguation of query terms Peers often host collections of similar documents, e.g. similar kind of information (newspaper articles, etc.) on similar topics, etc. Scalability and successful use of statistics are strongly improved, if a common system of categories to classify the documents can be used Since categories are more or less similar to each other a taxonomy on categories allows for easily finding semantically similar documents

4.4 Prestructuring Collections with Taxonomies Topical similarity within a taxonomy is defined by [Li et al., 3] l: shortest path between categories c and c 2 h: level of common subsumer Common values =.2, =.6 (experimentally determined) E.g. newspaper articles: News h sim(politics, Sports): Foreign): Business Politics l Foreign Domestic l Sports Tennis l = 2 h = 2 sim =.35.68

4.4 Combination of Topics and Keywords Topics dominate keywords Cooperative Filter: Relax on topics until k results have been found Example: [<Politics>, London Olympics ] Topic Similarity Text Collection Politics Foreign Domestic Sports Business Tennis Politics Foreign Sports

4.4 Combination of Topics and Keywords SP 5 P SP P SP 3 SP 2 SP 7 SP 6 SP RequestResults {P} {(P, d, [P,.8]), TopRes TopRes (SP2, d2, [P,.7]), [P,.7])} Delivered {(P, d, d2, [P,.8])} [P,.3])} [S,.3])} Delivered {(P, d, [P,.8]), [P,.8])} Delivered (SP2, d2, [P,.7])} {d, {d} d2} d P.8 d2 P.3 d3 P.2 SP 4 SP 8 P 2 P 3 P 4 Politics News Sports d2 PD.7 d3 PD.6 d4 S.9 d22 P.4 d32 D.5 d42 S.5 d23 S.3 d33 D. d43 S.2

4.5 Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

4.5 Summary and Conclusion In today s P2P systems only exact match keyword retrieval is prevalent (usually on meta-data) Information retrieval in P2P scenarios is needed Individual, loosely coupled document collections need fulltext retrieval and ranking techniques Applications range from shared working environments e.g. in project groups, to distributed digital libraries Almost all IR systems use at least some global statistics, in P2P infrastructures the dissemination of necessary statistics becomes a performance bottleneck Trade-off between cached, but sometimes stale statistics and new, but expensively updated statistics needs to be managed How much staleness does a sufficient retrieval effectiveness allow?

4.5 Summary and Conclusion Choosing the right collections for querying improves retrieval efficiency Containing most promising documents with possibly little overlap Small worlds offer quick connections to semantically close collections Query routing indexes can handle some network churn while providing results of sufficient quality Local indexes can be efficiently maintained Can exploit advantages by Zipf-distributed content allocations and querying behavior Need to contact only small numbers of peers Supporting techniques like efficient CWI estimation/ dissemination or taxonomies of document categories further improves retrieval