Peer-to-Peer Data Management

Transcription

1 Peer-to-Peer Data Management Wolf-Tilo Balke Sascha Tönnies Institut für Informationssysteme Technische Universität Braunschweig

2 4. Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

3 4. What is IR? Information retrieval (IR) is the science of searching for documents, for information within documents and for metadata about documents A user enters a query, i.e. an information need, into the system Several objects may match the query with different degrees of relevancy

4 4. RepresentingText How do we represent the complexities of language? Computers don t understand documents or queries Simple, yet effective approach: bag of words Treat all the words in a document as index terms for that document Assign a weight to each term based on its importance Disregard order, structure, meaning, etc. of the words

5 4. Representing Text McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $.8 to $34.9, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. 6 said 4 McDonalds 2 fat fries 8 new 6 company, french, nutrition 5 food, oil, percent, reduce, taste, Tuesday Bag of Words

6 4. Retrieval Retrieving relevant information is hard! Evolving, ambiguous user needs, context, etc. Complexities of language To operationalize information retrieval, we must vastly simplify the picture Information retrieval is all (and only) about matching words in documents with words in queries Obviously, not true But it works pretty well!

7 Document Document 2 4. Representing Documents asvectors Document The quick brown fox jumped over the lazy dog s back. Document 2 Now is the time for all good men to come to the aid of their party. Term aid all back brown come dog fox good jump lazy men now over party quick their time Stopword List for is of the to

8 4. RepresentingText text + structure document structured recognition accents, Howspacing, to comparestopwords etc. documents and queries? text noun groups stemming automatic or manual indexing structure full text index terms

9 4. Boolean Retrieval Weights assigned to terms are either or represents absence : term isn t in the document represents presence : term is in the document Build queries by combining terms with Boolean operators AND, OR, NOT The system returns all documents that satisfy the query

10 Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 4. Boolean View of a Document-Set (=Collection) Term aid all back brown come dog fox good jump lazy men now over party quick their time Each column represents the view of a particular document: What terms are contained in this document? Each row represents the view of a particular term: What documents contain this term? To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator

11 Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 4. Sample Queries Term dog fox dog fox dog fox dog fox fox dog dog AND fox Doc 3, Doc 5 dog OR fox Doc 3, Doc 5, Doc 7 dog NOT fox empty fox NOT dog Doc 7 Term good party g p over g p o good AND party Doc 6, Doc 8 good AND party NOT over Doc 6

12 4. The Perfect Query Paradox Every information need has a perfect set of documents If not, there would be no sense doing retrieval Every document set has a perfect query AND every word in a document to get a query for it Repeat for each document in the set OR every document query to get the set query But can users realistically be expected to formulate this perfect query? Boolean query formulation is hard!

13 4. Why Boolean Retrieval fails Natural language is way more complex AND discovers nonexistent relationships Terms in different sentences, paragraphs, Guessing terminology for OR is hard good, nice, excellent, outstanding, awesome, Guessing terms to exclude is even harder! Democratic party, party to a lawsuit,

14 4. Strengths and Weaknesses Strengths Precise, if you have a clear idea of what you re looking for Efficient for the computer Weaknesses Users must learn Boolean logic Boolean logic insufficient to capture the richness of language No control over size of result set: either too many documents or none All documents in the result set are considered equally good What about partial matches? Documents that don t quite match the query may be useful also

15 4. Ranked Retrieval Order documents by how likely they are to be relevant to the information need Present hits one screen at a time At any point, users can continue browsing through ranked list or reformulate query Attempts to retrieve relevant documents directly, not merely provide tools for doing so

16 4. Why Ranked Retrieval? Arranging documents by relevance is Closer to how humans think: some documents are better than others Closer to user behavior: users can decide when to stop reading Best (partial) match: documents need not have all query terms Although documents with more query terms should be better

17 4. Similarity-based Retrieval? Let s replace relevance with similarity Rank documents by their similarity with the query Treat the query as if it were a document Create a query bag-of-words Find its similarity to each document Rank order the documents by similarity Surprisingly, this works pretty well!

18 4. Vector Space Model t 3 d 2 d 3 φ θ d t t 2 d 5 d 4 Postulate: Documents that are close together in vector space talk about the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ closeness )

19 4. How to Weight Terms? Idea: Hans Peter Luhn 958, IBM Here s the intuition: Terms that appear often in a document should get high weights The more often a document contains the term dog, the more likely that the document is about dogs. Terms that appear in many documents should get low weights Words like the, a, of appear in (nearly) all documents. How do we capture this mathematically? Term frequency Inverse document frequency

20 4. TFxIDF TFxIDF [Gerald Salton, 96] Term Frequency (TF) How often a term appears in a document can be calculated locally Document Frequency (DF) Number of documents, which contain a specific term Needs global (system wide) knowledge Inverse Document Frequency (IDF) Discriminator for the importance of a term regarding the number of occurrences in all documents Needs global (system wide) knowledge

21 4. Working on Indices quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 The term-document matrix again has bag of words information about the collection

22 4. Small yet Fast? Can we make this data structure smaller, keeping in mind the need for fast retrieval? Observations: The nature of the search problem requires us to quickly find which documents contain a term The term-document matrix is very sparse Some terms are more useful than others

23 Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 4. Posting Lists Term aid all back brown come dog fox good jump lazy men now over party quick their time Postings 4, 8 2, 4, 6, 3, 7, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3, 3, 5, 7 2, 4, 8 2, 6, 8, 3, 5, 7, 8 6, 8, 3, 5, 7 2, 4, 6

24 4. Inverted Document Index Term aid all back brown come dog fox good jump lazy men now over party quick their time Postings 4, 8 2, 4, 6, 3, 7, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3, 3, 5, 7 2, 4, 8 2, 6, 8, 3, 5, 7, 8 6, 8, 3, 5, 7 2, 4, 6

25 4. What goes in the Postings? Boolean retrieval Just the document number Ranked Retrieval Document number and term weight (tf.idf,...) Proximity operators Word offsets for each occurrence of the term

26 4.2 Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

27 4.2 Information Retrieval in P2P Systems Information Retrieval deals with complex documents Meta-data can only capture some aspects of a document, but not anticipate all semantic searches E.g. sports-related newspaper article, but no names, locations, etc. Support for full-text searches needed Find the best-matching document from the bestconnected peer Unlike in file sharing emphasis is on the document quality If there are multiple sources offering similar quality documents, choose best peer according to connection, etc.

28 4.2 Challenges in P2P IR Efficient query evaluation scheme Central inverted index of documents is expensive to maintain How to disseminate a peer s query? Simple flooding of all queries is not scalable, if best documents have to be found (not just some match) Dealing with network churn A peer can always alter the set of documents offered, or significantly change individual documents Peers may join and leave the network, i.e. whole document collections may disappear, or can be added Integration of collection-wide information Peers are not able to calculate IR-style scorings from local knowledge, but needs some knowledge from the (virtual) merged collection Constant dissemination of collection-wide information needs a lot of bandwidth

29 4.2 Example: Problem of Collection-wide Information Example: Different news collections, query on keyword basketball General news collection, e.g. Many articles, only few about basketball, therefore IDF small Keyword discriminates well between articles NBA news collection Few articles, almost all about basketball, therefore IDF high Keyword hardly discriminates between articles Merged collection: IDF medium But how do independent collections (peers) exchange their information?

30 4.2 Example: Problem of Collection-wide Information Top object A... Peer A... B... B global scoring all objects identical TF = IDF = 6/3 A Querying Peer Query: A and B TF= IDF=3/2 TF= IDF= 3/ local scoring A... Peer 2 B... B... Top object TF= IDF=3/ TF= IDF= 3/2 local scoring

31 4.2 Distributed IR Distributed information retrieval techniques grew increasingly important for searching Web sources Abstracts of information sources To support distributed retrieval sources have to register abstracts or keyword sets Abstracts can either be kept in a central repository or distributed by gossiping algorithms, e.g. PlanetP [Cuenca-Acuna et al., 3] Collection selection Having no central index needs a sophisticated way of choosing the most promising collections for querying

32 4.2 Distributed IR Such abstracts can be compactly represented by Bloom Filters, i.e. bit vectors that allow membership queries Each term is hashed with n different functions and the position in the bit vector for each hash value is set to Allows for false positives, but no false negatives In Counting Bloom Filters objects can also be removed?

33 4.2 Distributed IR Benefit estimators for collection selection use aggregated statistics about individual collections for selection, e.g. CORI measure [Callan et al., 95] CORI calculates collection score s i for collection i regarding query q: with and where n is the number of collections, cdf the collection document frequency, cdf max the maximum cdf and cf t the collection frequency of term t

35 4.3 Index Structures for Query Routing Traditional index structures cannot be readily employed in P2P systems High degree of distribution High degree of volatility (churn) High degree of index maintenance Distributed paradigms needed to route queries to appropriate peers Simple flooding method does not scale Distributed hash table lookup Using indexed routing information Using shortcut overlays

36 4.3 Distributed Hash Tables for IR Distributed hash tables Route queries to appropriate peers with number of hops logarithmic in network size No peer needs to maintain more than logarithmic amount of routing information But Exact match queries only All new content has to be published, if peers join/change Old content has to be unpublished, if peers leave Documents added/removed will contain a lot of different terms to be published/unpublished. Thus, usually many index peers have to be addressed Conjunction of query terms needs to access many peers, but there is still no guarantee that a single document with the conjunction exists

37 Occurrence Frequency 4.3 Distributed Hash Tables for IR Improvement: Hybrid P2P infrastructures [Loo et al., 4] Efficiency of DHT is worst, if highly replicated items are requested Experiments show worse behavior than flooding, degrading with churn Querying and content allocation follow Zipf-distribution Only few highly replicated and often queried items People are looking for hay, not for needles (S. Shenker) Hybrid P2P infrastructures use DHTs only for the less replicated and rarely Query Frequency Distribution,% queried items, all other queries are flooded Still, DHTs have to be maintained for the majority of query terms 6,% 4,% 2,%,% 8,% 6,% 4,% 2,% Query

38 4.3 Routing Indexes for IR Routing indexes are local collections of (key, peer) pairs Key is either a keyword or a query Peer is the address of a peer that either offers relevant results, or routes the query to other peers with relevant result In contrast to flooding only interesting directions are queried Often distinguished between links in the default network (directions of content providers) and overlay structure of direct links to content providers ( shortcuts ) First introduced by [Crespo & Garcia-Molina, 2] to choose best neighbors in the default network for query forwarding Index maintenance is of local nature and index coverage is usually high due to Zipf distribution of requests Correctness of index is influenced by network volatility/churn

39 4.3 Routing Indexes for IR Routing index policies in the face of network churn With restricted index sizes new entries are collected and always stored. If the maximum size is reached, some stale information is replaced A simple strategy always replaces the currently oldest index entries Least recently used (LRU) strategy assigns higher usefulness to entries that have been successfully used recently Optimal index size is a problematic parameter Indexes with unrestricted size have to combat network churn differently time to live assigns an expiry time for each new index entry forgetting factors can periodically weigh down reliability of link information

40 4.3 An Algorithm for Correct Query Routing Goal: progressive distributed top-k ranking of documents Putting techniques together to design an efficient top-k algorithm Minimal number of object transfers Optimal number of object accesses Features of the P2P based approach Optimized Query-Routing No global Index Query-driven term-indexing

41 4.3 Bird s View. Distribute query through the network (Routing) 2. Every peer scores documents locally (Ranking) 3. Hierarchical construction of the final result (Merging) 4. Optimized query routing (Index)

42 4.3 Building Blocks Structured network local ranking result query-driven index merging

43 4.3 Network Structure Observation: peers strongly differ in availability, bandwidth, computing power, Hierarchical network structure with super-peers Query routing Result merging Indexes

44 4.3 Network topology Super-peers as hypercube (HyperCuP protocol) Resilient against leaving peers Broadcast with (n-) messages, log 2 (n) hops minimal spanning tree SP 5 SP 6 SP 2 SP 2 2 SP 5 SP SP 3 SP 7 SP 7 SP SP 2 SP 6 SP 3 SP 4 SP 4 SP 8

45 4.3 Local Ranking Super-peer asks for local rankings of peers collections Top-k results (plus metric-dependent information) are returned to SP Arbitrary similarity measures can be used TFxIDF Similarities in taxonomies

46 4.3 Result Merging Results will be merged at the super-peers Unique scoring function Maximum of k messages per SP-SP egde SP C P 3 P 7 P 6 P 2 P 5 P 4 P SP D SP B SP A P Q

47 4.3 Indexing Super-peers keep indexes IDFs (collection wide information) IDF-values for query terms Top peers (routing) List of peers that already have contributed to a previous top-k result Others possible, e.g. for taxonomies Index entries are query-driven

48 4.3 Routing Indexes Example: Top k Query Routing Example for routing indexes in P2P networks with super-peer backbone holding routing indexes Progressive P2P top-k algorithm [Balke et al., 4] If query q is indexed, distribute query and collect results Otherwise flood query and Compute ranks at local peers Merge results at super-peers Use statistics for new entry in routing index (routing information, collection-wide information, etc.) Data structures at super-peers RequestResults: Peers which are queried for result (index information) BestPeer: Peers which delivered recent best result TopRes: Current top results Delivered: Delivered results

49 4.3 Routing Indexes Example: Top k Query Routing SP 5 SP4 RequestResults {SP8,P2, P3, P4} SP SP 3 SP 7 BestPeers {} TopRes {} Delivered {} P P SP 2 SP 6 Empty routing index at SP 4 q? d.8 Find top 2 documents d2.3 d3.2 SP 4 SP 8 P 2 P 3 P 4 d2.7 d22.4 d23.3 d3.6 d32.6 d33. d4.5 d42.5 d43.2

50 4.3 Routing Indexes Example: Top k Query Routing SP 5 P SP P SP 3 SP 2 SP 7 SP 6 SP4 RequestResults {} {SP8,P2, P3, P3, P4} P4} BestPeers {P2} {} {} TopRes {(P3, {(P2, {} {(P2, d3, d2, d2,.5),.7),.7)} TopRes Delivered TopRes (P4, (P3, {} d4, d3,.4)}.5), Delivered {(P2, (P4, d2, d4,.7)}.4)} Delivered {} q? d.8 d2.3 d3.2 SP 4 SP 8 P 2 P 3 P 4 d2.7 d3 d2.7.6 d4 d2.7.5 d22.4 d32.6 d42.5 d23.3 d33. d43.2

51 4.3 Routing Indexes Example: Top k Query Routing SP 5 P SP P SP 3 SP 2 q {(d,?.8)} d.8 d2.3 d3.2 SP 7 SP 6 SP 4 SP RequestResults {} {SP3,SP5, P} BestPeers {} {P} TopRes {(P, {(SP2, d2, d,.7)}.8), TopRes Delivered {(P, {SP2} (SP2, d, d2,.8)}.7)} Delivered {} SP 8 P 2 P 3 d2.7 d3.6 d22.4 d32.6 d23.3 d33. P 4 d4.5 d42.5 d43.2

52 4.3 Routing Indexes Example: Top k Query Routing SP 5 SP SP 3 SP 7 SP BestPeers RequestResults {P} {} Delivered BestPeers {} {(P, {SP2} d,.8)} RequestResults TopRes {(SP2, {} {(P, d2, d2,.3)}.7)} Delivered {(SP2, {(P, d, d2, d,.8)}.7),.8), TopRes Delivered (P, (SP2, d2, d2,.7)}.3)} P P SP 2 SP 6 q {(d,.8),.8)} q (d2,.7)} d.8 d2.3 SP 4 SP 8 d3.2 P 2 P 3 P 4 d2.7 d22.4 d23.3 d3.6 d32.6 d33. d4.5 d42.5 d43.2

53 4.3 Routing Indexes Example: Top k Query Routing q SP 5 SP SP4 SP2 Routing Index q RequestResults {SP2, {P2, {SP4} P3} P} {} BestPeers {SP2} SP SP 3 SP 2 SP 7 P P SP 6 TopRes {(P, d2,.3)} Delivered {(P, d,.8), (SP2, d2,.7)} q {(d,.8), (d2,.7)} d.8 d2.3 SP 4 SP 8 d3.2 P 2 P 3 P 4 d2.7 d22.4 d23.3 d3.6 d32.6 d33. d4.5 d42.5 d43.2

54 4.3 Query Routing At the first appearance of a queries peers only send out their input for IDF computation Super-peers aggregate IDFs and build index Whenever a query is repeated SPs will send recent IDF-values together with query terms Peers will uses IDFs for local score computation Disadvantage: at first occurrance of query it has to be sent twice Zipf-Distribution minimizes number of queries concerned Advantages: No effort for maintaining global IDF index Values for often occurring queries are kept up-to-date

55 4.3 Query Routing und Network Churn Query index strategy Send queries only to peers that have already recently contributed to answering a query Problem: the network s and each peer s volatility Solution : Send queries also to a randomly selected set of peers Solution 2: Best before -timestamp X SP 2 SP X SP 3 X SP 5 SP 4 X SP 6 SP 7 SP 8

56 4.3 Locality-Based Routing Indexes Refinement of routing indexes by social metaphors Similar retrieval process like in real life Every person has only limited knowledge of the environment Who knows about a certain topic? Who might know other people who know about the topic? Try to build (short) chains of acquaintances that will bring you close to the requested information Aims at building social networks as overlays Peers semantically connected by certain topics form small world networks, e.g. [Milgram, 67; Kleinberg, ] Paradigm of interest-based locality If a peer has relevant content for a user s query, it very often also has some other content that this user might be interested in

57 4.3 Locality-Based Routing Indexes For information retrieval in P2P network this enables new routing in interest-based overlay structures Route queries to peers with documents matching semantically close queries Traces on practical data collections show that Peers get well-connected The overlay graph shows highly-clustered characteristics with a small minimum distance between any two nodes Overhearing of communications routed through a peer can be used to enhance its local index Randomly sending queries also to peers from the default network helps to extend knowledge and can remedy the effect of network churn

59 4.4 Supporting Effective P2P IR P2P information retrieval has to deal with the trade-off between Efficient local maintenance of statistics / index information, where information can be stale (incorrect) Expensive global maintenance of statistics / index information, where information always is accurate Needed is just the right level of dissemination of statistics to guarantee a sufficiently effective retrieval Some techniques help to support efficient retrieval Providing adequate collection-wide information Estimate document overlap between peers Pre-structure collections by categories / taxonomies

60 4.3 Providing Collection-Wide Information Collection-wide information is important for retrieval quality, but cannot be calculated locally like e,g., IDFs Some systems like e.g. PlanetP, do not use CWI directly, but circumnavigate the problem by using an inverted peer frequency where N is the number of all peers and N t is the number of peers offering documents on term t If summarizations of peers (abstracts) are eagerly disseminated, each peer can locally decide values for N and N t The relevance of peers in multi-keyword queries is simply the sum of IPFs for the individual terms Practical tests show an average overlap of about 7% between result sets retrieved with IDFs and those retrieved with IPFs Using IPFs the scalability is, however, still limited

61 4.4 Providing Collection-Wide Information Tests in Web information retrieval, e.g. [Viles & French, 95], show that CWI stays relatively stable over the whole collection of Web Sites even with churn Only joining/leaving corpora on completely new topics result in significant change Indexing CWI in a similar way as the routing information for queries is possible [Balke et al., 5] In structured networks CWI can be aggregated along the backbone and indexed CWI can be distributed together with the query New queries have to be flooded/routed twice The first flooding collects and aggregates CWI The second one provides the correct CWI for local scorings Non-expired indexed CWI can always be used when available

62 4.4 Estimating the Document Overlap Assessing the novelty of collections also supports retrieval quality Pre-computed statistics about expected result quality in each collection is often used to minimize the number of queried collections Choosing collection with high overlap for querying will usually not improve result sets sufficiently to justify the access costs Especially progressive searches, like top-k searches, profit from focusing on collections with small overlaps, since result merging procedures will ignore identical/similar results The novelty of a collection can only be calculated with respect to some reference collection(s) e.g. those collection(s) already in a peers local routing index

63 4.4 Estimating the Document Overlap A definition of a peer p s collection C p with respect to a reference collection C ref [Bender et al., 5] Since the information what exact documents a peer offers is usually not disseminated, the values have to be approximated from statistics E.g. if abstracts in the form of Bloom filters are given, a combined Bloom filter b p can be calculated by bitwise logical AND between p s Bloomfilters for all keywords in a query Novelty then can be estimated by comparing it to as the union of those Bloom filters b i of the set of collections S that have already been retrieved The degree of novelty is given by counting locations where p s Bloom filter has differing set bits

64 4.4 Prestructuring Collections with Taxonomies Retrieval in P2P systems generally considers two basic paradigms Fulltext-based queries Metadata-based queries Integrating these paradigms can support retrieval effectiveness Structuring document collections Disambiguation of query terms Peers often host collections of similar documents, e.g. similar kind of information (newspaper articles, etc.) on similar topics, etc. Scalability and successful use of statistics are strongly improved, if a common system of categories to classify the documents can be used Since categories are more or less similar to each other a taxonomy on categories allows for easily finding semantically similar documents

65 4.4 Prestructuring Collections with Taxonomies Topical similarity within a taxonomy is defined by [Li et al., 3] l: shortest path between categories c and c 2 h: level of common subsumer Common values =.2, =.6 (experimentally determined) E.g. newspaper articles: News h sim(politics, Sports): Foreign): Business Politics l Foreign Domestic l Sports Tennis l = 2 h = 2 sim =.35.68

66 4.4 Combination of Topics and Keywords Topics dominate keywords Cooperative Filter: Relax on topics until k results have been found Example: [<Politics>, London Olympics ] Topic Similarity Text Collection Politics Foreign Domestic Sports Business Tennis Politics Foreign Sports

67 4.4 Combination of Topics and Keywords SP 5 P SP P SP 3 SP 2 SP 7 SP 6 SP RequestResults {P} {(P, d, [P,.8]), TopRes TopRes (SP2, d2, [P,.7]), [P,.7])} Delivered {(P, d, d2, [P,.8])} [P,.3])} [S,.3])} Delivered {(P, d, [P,.8]), [P,.8])} Delivered (SP2, d2, [P,.7])} {d, {d} d2} d P.8 d2 P.3 d3 P.2 SP 4 SP 8 P 2 P 3 P 4 Politics News Sports d2 PD.7 d3 PD.6 d4 S.9 d22 P.4 d32 D.5 d42 S.5 d23 S.3 d33 D. d43 S.2

69 4.5 Summary and Conclusion In today s P2P systems only exact match keyword retrieval is prevalent (usually on meta-data) Information retrieval in P2P scenarios is needed Individual, loosely coupled document collections need fulltext retrieval and ranking techniques Applications range from shared working environments e.g. in project groups, to distributed digital libraries Almost all IR systems use at least some global statistics, in P2P infrastructures the dissemination of necessary statistics becomes a performance bottleneck Trade-off between cached, but sometimes stale statistics and new, but expensively updated statistics needs to be managed How much staleness does a sufficient retrieval effectiveness allow?

70 4.5 Summary and Conclusion Choosing the right collections for querying improves retrieval efficiency Containing most promising documents with possibly little overlap Small worlds offer quick connections to semantically close collections Query routing indexes can handle some network churn while providing results of sufficient quality Local indexes can be efficiently maintained Can exploit advantages by Zipf-distributed content allocations and querying behavior Need to contact only small numbers of peers Supporting techniques like efficient CWI estimation/ dissemination or taxonomies of document categories further improves retrieval