Peer-to-Peer Data Management

Size: px
Start display at page:

Download "Peer-to-Peer Data Management"

Transcription

1 Peer-to-Peer Data Management Wolf-Tilo Balke Sascha Tönnies Institut für Informationssysteme Technische Universität Braunschweig

2 4. Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

3 4. What is IR? Information retrieval (IR) is the science of searching for documents, for information within documents and for metadata about documents A user enters a query, i.e. an information need, into the system Several objects may match the query with different degrees of relevancy

4 4. RepresentingText How do we represent the complexities of language? Computers don t understand documents or queries Simple, yet effective approach: bag of words Treat all the words in a document as index terms for that document Assign a weight to each term based on its importance Disregard order, structure, meaning, etc. of the words

5 4. Representing Text McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $.8 to $34.9, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. 6 said 4 McDonalds 2 fat fries 8 new 6 company, french, nutrition 5 food, oil, percent, reduce, taste, Tuesday Bag of Words

6 4. Retrieval Retrieving relevant information is hard! Evolving, ambiguous user needs, context, etc. Complexities of language To operationalize information retrieval, we must vastly simplify the picture Information retrieval is all (and only) about matching words in documents with words in queries Obviously, not true But it works pretty well!

7 Document Document 2 4. Representing Documents asvectors Document The quick brown fox jumped over the lazy dog s back. Document 2 Now is the time for all good men to come to the aid of their party. Term aid all back brown come dog fox good jump lazy men now over party quick their time Stopword List for is of the to

8 4. RepresentingText text + structure document structured recognition accents, Howspacing, to comparestopwords etc. documents and queries? text noun groups stemming automatic or manual indexing structure full text index terms

9 4. Boolean Retrieval Weights assigned to terms are either or represents absence : term isn t in the document represents presence : term is in the document Build queries by combining terms with Boolean operators AND, OR, NOT The system returns all documents that satisfy the query

10 Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 4. Boolean View of a Document-Set (=Collection) Term aid all back brown come dog fox good jump lazy men now over party quick their time Each column represents the view of a particular document: What terms are contained in this document? Each row represents the view of a particular term: What documents contain this term? To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator

11 Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 4. Sample Queries Term dog fox dog fox dog fox dog fox fox dog dog AND fox Doc 3, Doc 5 dog OR fox Doc 3, Doc 5, Doc 7 dog NOT fox empty fox NOT dog Doc 7 Term good party g p over g p o good AND party Doc 6, Doc 8 good AND party NOT over Doc 6

12 4. The Perfect Query Paradox Every information need has a perfect set of documents If not, there would be no sense doing retrieval Every document set has a perfect query AND every word in a document to get a query for it Repeat for each document in the set OR every document query to get the set query But can users realistically be expected to formulate this perfect query? Boolean query formulation is hard!

13 4. Why Boolean Retrieval fails Natural language is way more complex AND discovers nonexistent relationships Terms in different sentences, paragraphs, Guessing terminology for OR is hard good, nice, excellent, outstanding, awesome, Guessing terms to exclude is even harder! Democratic party, party to a lawsuit,

14 4. Strengths and Weaknesses Strengths Precise, if you have a clear idea of what you re looking for Efficient for the computer Weaknesses Users must learn Boolean logic Boolean logic insufficient to capture the richness of language No control over size of result set: either too many documents or none All documents in the result set are considered equally good What about partial matches? Documents that don t quite match the query may be useful also

15 4. Ranked Retrieval Order documents by how likely they are to be relevant to the information need Present hits one screen at a time At any point, users can continue browsing through ranked list or reformulate query Attempts to retrieve relevant documents directly, not merely provide tools for doing so

16 4. Why Ranked Retrieval? Arranging documents by relevance is Closer to how humans think: some documents are better than others Closer to user behavior: users can decide when to stop reading Best (partial) match: documents need not have all query terms Although documents with more query terms should be better

17 4. Similarity-based Retrieval? Let s replace relevance with similarity Rank documents by their similarity with the query Treat the query as if it were a document Create a query bag-of-words Find its similarity to each document Rank order the documents by similarity Surprisingly, this works pretty well!

18 4. Vector Space Model t 3 d 2 d 3 φ θ d t t 2 d 5 d 4 Postulate: Documents that are close together in vector space talk about the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ closeness )

19 4. How to Weight Terms? Idea: Hans Peter Luhn 958, IBM Here s the intuition: Terms that appear often in a document should get high weights The more often a document contains the term dog, the more likely that the document is about dogs. Terms that appear in many documents should get low weights Words like the, a, of appear in (nearly) all documents. How do we capture this mathematically? Term frequency Inverse document frequency

20 4. TFxIDF TFxIDF [Gerald Salton, 96] Term Frequency (TF) How often a term appears in a document can be calculated locally Document Frequency (DF) Number of documents, which contain a specific term Needs global (system wide) knowledge Inverse Document Frequency (IDF) Discriminator for the importance of a term regarding the number of occurrences in all documents Needs global (system wide) knowledge

21 4. Working on Indices quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 The term-document matrix again has bag of words information about the collection

22 4. Small yet Fast? Can we make this data structure smaller, keeping in mind the need for fast retrieval? Observations: The nature of the search problem requires us to quickly find which documents contain a term The term-document matrix is very sparse Some terms are more useful than others

23 Doc Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 4. Posting Lists Term aid all back brown come dog fox good jump lazy men now over party quick their time Postings 4, 8 2, 4, 6, 3, 7, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3, 3, 5, 7 2, 4, 8 2, 6, 8, 3, 5, 7, 8 6, 8, 3, 5, 7 2, 4, 6

24 4. Inverted Document Index Term aid all back brown come dog fox good jump lazy men now over party quick their time Postings 4, 8 2, 4, 6, 3, 7, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3, 3, 5, 7 2, 4, 8 2, 6, 8, 3, 5, 7, 8 6, 8, 3, 5, 7 2, 4, 6

25 4. What goes in the Postings? Boolean retrieval Just the document number Ranked Retrieval Document number and term weight (tf.idf,...) Proximity operators Word offsets for each occurrence of the term

26 4.2 Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

27 4.2 Information Retrieval in P2P Systems Information Retrieval deals with complex documents Meta-data can only capture some aspects of a document, but not anticipate all semantic searches E.g. sports-related newspaper article, but no names, locations, etc. Support for full-text searches needed Find the best-matching document from the bestconnected peer Unlike in file sharing emphasis is on the document quality If there are multiple sources offering similar quality documents, choose best peer according to connection, etc.

28 4.2 Challenges in P2P IR Efficient query evaluation scheme Central inverted index of documents is expensive to maintain How to disseminate a peer s query? Simple flooding of all queries is not scalable, if best documents have to be found (not just some match) Dealing with network churn A peer can always alter the set of documents offered, or significantly change individual documents Peers may join and leave the network, i.e. whole document collections may disappear, or can be added Integration of collection-wide information Peers are not able to calculate IR-style scorings from local knowledge, but needs some knowledge from the (virtual) merged collection Constant dissemination of collection-wide information needs a lot of bandwidth

29 4.2 Example: Problem of Collection-wide Information Example: Different news collections, query on keyword basketball General news collection, e.g. Many articles, only few about basketball, therefore IDF small Keyword discriminates well between articles NBA news collection Few articles, almost all about basketball, therefore IDF high Keyword hardly discriminates between articles Merged collection: IDF medium But how do independent collections (peers) exchange their information?

30 4.2 Example: Problem of Collection-wide Information Top object A... Peer A... B... B global scoring all objects identical TF = IDF = 6/3 A Querying Peer Query: A and B TF= IDF=3/2 TF= IDF= 3/ local scoring A... Peer 2 B... B... Top object TF= IDF=3/ TF= IDF= 3/2 local scoring

31 4.2 Distributed IR Distributed information retrieval techniques grew increasingly important for searching Web sources Abstracts of information sources To support distributed retrieval sources have to register abstracts or keyword sets Abstracts can either be kept in a central repository or distributed by gossiping algorithms, e.g. PlanetP [Cuenca-Acuna et al., 3] Collection selection Having no central index needs a sophisticated way of choosing the most promising collections for querying

32 4.2 Distributed IR Such abstracts can be compactly represented by Bloom Filters, i.e. bit vectors that allow membership queries Each term is hashed with n different functions and the position in the bit vector for each hash value is set to Allows for false positives, but no false negatives In Counting Bloom Filters objects can also be removed?

33 4.2 Distributed IR Benefit estimators for collection selection use aggregated statistics about individual collections for selection, e.g. CORI measure [Callan et al., 95] CORI calculates collection score s i for collection i regarding query q: with and where n is the number of collections, cdf the collection document frequency, cdf max the maximum cdf and cf t the collection frequency of term t

34 4.3 Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

35 4.3 Index Structures for Query Routing Traditional index structures cannot be readily employed in P2P systems High degree of distribution High degree of volatility (churn) High degree of index maintenance Distributed paradigms needed to route queries to appropriate peers Simple flooding method does not scale Distributed hash table lookup Using indexed routing information Using shortcut overlays

36 4.3 Distributed Hash Tables for IR Distributed hash tables Route queries to appropriate peers with number of hops logarithmic in network size No peer needs to maintain more than logarithmic amount of routing information But Exact match queries only All new content has to be published, if peers join/change Old content has to be unpublished, if peers leave Documents added/removed will contain a lot of different terms to be published/unpublished. Thus, usually many index peers have to be addressed Conjunction of query terms needs to access many peers, but there is still no guarantee that a single document with the conjunction exists

37 Occurrence Frequency 4.3 Distributed Hash Tables for IR Improvement: Hybrid P2P infrastructures [Loo et al., 4] Efficiency of DHT is worst, if highly replicated items are requested Experiments show worse behavior than flooding, degrading with churn Querying and content allocation follow Zipf-distribution Only few highly replicated and often queried items People are looking for hay, not for needles (S. Shenker) Hybrid P2P infrastructures use DHTs only for the less replicated and rarely Query Frequency Distribution,% queried items, all other queries are flooded Still, DHTs have to be maintained for the majority of query terms 6,% 4,% 2,%,% 8,% 6,% 4,% 2,% Query

38 4.3 Routing Indexes for IR Routing indexes are local collections of (key, peer) pairs Key is either a keyword or a query Peer is the address of a peer that either offers relevant results, or routes the query to other peers with relevant result In contrast to flooding only interesting directions are queried Often distinguished between links in the default network (directions of content providers) and overlay structure of direct links to content providers ( shortcuts ) First introduced by [Crespo & Garcia-Molina, 2] to choose best neighbors in the default network for query forwarding Index maintenance is of local nature and index coverage is usually high due to Zipf distribution of requests Correctness of index is influenced by network volatility/churn

39 4.3 Routing Indexes for IR Routing index policies in the face of network churn With restricted index sizes new entries are collected and always stored. If the maximum size is reached, some stale information is replaced A simple strategy always replaces the currently oldest index entries Least recently used (LRU) strategy assigns higher usefulness to entries that have been successfully used recently Optimal index size is a problematic parameter Indexes with unrestricted size have to combat network churn differently time to live assigns an expiry time for each new index entry forgetting factors can periodically weigh down reliability of link information

40 4.3 An Algorithm for Correct Query Routing Goal: progressive distributed top-k ranking of documents Putting techniques together to design an efficient top-k algorithm Minimal number of object transfers Optimal number of object accesses Features of the P2P based approach Optimized Query-Routing No global Index Query-driven term-indexing

41 4.3 Bird s View. Distribute query through the network (Routing) 2. Every peer scores documents locally (Ranking) 3. Hierarchical construction of the final result (Merging) 4. Optimized query routing (Index)

42 4.3 Building Blocks Structured network local ranking result query-driven index merging

43 4.3 Network Structure Observation: peers strongly differ in availability, bandwidth, computing power, Hierarchical network structure with super-peers Query routing Result merging Indexes

44 4.3 Network topology Super-peers as hypercube (HyperCuP protocol) Resilient against leaving peers Broadcast with (n-) messages, log 2 (n) hops minimal spanning tree SP 5 SP 6 SP 2 SP 2 2 SP 5 SP SP 3 SP 7 SP 7 SP SP 2 SP 6 SP 3 SP 4 SP 4 SP 8

45 4.3 Local Ranking Super-peer asks for local rankings of peers collections Top-k results (plus metric-dependent information) are returned to SP Arbitrary similarity measures can be used TFxIDF Similarities in taxonomies

46 4.3 Result Merging Results will be merged at the super-peers Unique scoring function Maximum of k messages per SP-SP egde SP C P 3 P 7 P 6 P 2 P 5 P 4 P SP D SP B SP A P Q

47 4.3 Indexing Super-peers keep indexes IDFs (collection wide information) IDF-values for query terms Top peers (routing) List of peers that already have contributed to a previous top-k result Others possible, e.g. for taxonomies Index entries are query-driven

48 4.3 Routing Indexes Example: Top k Query Routing Example for routing indexes in P2P networks with super-peer backbone holding routing indexes Progressive P2P top-k algorithm [Balke et al., 4] If query q is indexed, distribute query and collect results Otherwise flood query and Compute ranks at local peers Merge results at super-peers Use statistics for new entry in routing index (routing information, collection-wide information, etc.) Data structures at super-peers RequestResults: Peers which are queried for result (index information) BestPeer: Peers which delivered recent best result TopRes: Current top results Delivered: Delivered results

49 4.3 Routing Indexes Example: Top k Query Routing SP 5 SP4 RequestResults {SP8,P2, P3, P4} SP SP 3 SP 7 BestPeers {} TopRes {} Delivered {} P P SP 2 SP 6 Empty routing index at SP 4 q? d.8 Find top 2 documents d2.3 d3.2 SP 4 SP 8 P 2 P 3 P 4 d2.7 d22.4 d23.3 d3.6 d32.6 d33. d4.5 d42.5 d43.2

50 4.3 Routing Indexes Example: Top k Query Routing SP 5 P SP P SP 3 SP 2 SP 7 SP 6 SP4 RequestResults {} {SP8,P2, P3, P3, P4} P4} BestPeers {P2} {} {} TopRes {(P3, {(P2, {} {(P2, d3, d2, d2,.5),.7),.7)} TopRes Delivered TopRes (P4, (P3, {} d4, d3,.4)}.5), Delivered {(P2, (P4, d2, d4,.7)}.4)} Delivered {} q? d.8 d2.3 d3.2 SP 4 SP 8 P 2 P 3 P 4 d2.7 d3 d2.7.6 d4 d2.7.5 d22.4 d32.6 d42.5 d23.3 d33. d43.2

51 4.3 Routing Indexes Example: Top k Query Routing SP 5 P SP P SP 3 SP 2 q {(d,?.8)} d.8 d2.3 d3.2 SP 7 SP 6 SP 4 SP RequestResults {} {SP3,SP5, P} BestPeers {} {P} TopRes {(P, {(SP2, d2, d,.7)}.8), TopRes Delivered {(P, {SP2} (SP2, d, d2,.8)}.7)} Delivered {} SP 8 P 2 P 3 d2.7 d3.6 d22.4 d32.6 d23.3 d33. P 4 d4.5 d42.5 d43.2

52 4.3 Routing Indexes Example: Top k Query Routing SP 5 SP SP 3 SP 7 SP BestPeers RequestResults {P} {} Delivered BestPeers {} {(P, {SP2} d,.8)} RequestResults TopRes {(SP2, {} {(P, d2, d2,.3)}.7)} Delivered {(SP2, {(P, d, d2, d,.8)}.7),.8), TopRes Delivered (P, (SP2, d2, d2,.7)}.3)} P P SP 2 SP 6 q {(d,.8),.8)} q (d2,.7)} d.8 d2.3 SP 4 SP 8 d3.2 P 2 P 3 P 4 d2.7 d22.4 d23.3 d3.6 d32.6 d33. d4.5 d42.5 d43.2

53 4.3 Routing Indexes Example: Top k Query Routing q SP 5 SP SP4 SP2 Routing Index q RequestResults {SP2, {P2, {SP4} P3} P} {} BestPeers {SP2} SP SP 3 SP 2 SP 7 P P SP 6 TopRes {(P, d2,.3)} Delivered {(P, d,.8), (SP2, d2,.7)} q {(d,.8), (d2,.7)} d.8 d2.3 SP 4 SP 8 d3.2 P 2 P 3 P 4 d2.7 d22.4 d23.3 d3.6 d32.6 d33. d4.5 d42.5 d43.2

54 4.3 Query Routing At the first appearance of a queries peers only send out their input for IDF computation Super-peers aggregate IDFs and build index Whenever a query is repeated SPs will send recent IDF-values together with query terms Peers will uses IDFs for local score computation Disadvantage: at first occurrance of query it has to be sent twice Zipf-Distribution minimizes number of queries concerned Advantages: No effort for maintaining global IDF index Values for often occurring queries are kept up-to-date

55 4.3 Query Routing und Network Churn Query index strategy Send queries only to peers that have already recently contributed to answering a query Problem: the network s and each peer s volatility Solution : Send queries also to a randomly selected set of peers Solution 2: Best before -timestamp X SP 2 SP X SP 3 X SP 5 SP 4 X SP 6 SP 7 SP 8

56 4.3 Locality-Based Routing Indexes Refinement of routing indexes by social metaphors Similar retrieval process like in real life Every person has only limited knowledge of the environment Who knows about a certain topic? Who might know other people who know about the topic? Try to build (short) chains of acquaintances that will bring you close to the requested information Aims at building social networks as overlays Peers semantically connected by certain topics form small world networks, e.g. [Milgram, 67; Kleinberg, ] Paradigm of interest-based locality If a peer has relevant content for a user s query, it very often also has some other content that this user might be interested in

57 4.3 Locality-Based Routing Indexes For information retrieval in P2P network this enables new routing in interest-based overlay structures Route queries to peers with documents matching semantically close queries Traces on practical data collections show that Peers get well-connected The overlay graph shows highly-clustered characteristics with a small minimum distance between any two nodes Overhearing of communications routed through a peer can be used to enhance its local index Randomly sending queries also to peers from the default network helps to extend knowledge and can remedy the effect of network churn

58 4.4 Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

59 4.4 Supporting Effective P2P IR P2P information retrieval has to deal with the trade-off between Efficient local maintenance of statistics / index information, where information can be stale (incorrect) Expensive global maintenance of statistics / index information, where information always is accurate Needed is just the right level of dissemination of statistics to guarantee a sufficiently effective retrieval Some techniques help to support efficient retrieval Providing adequate collection-wide information Estimate document overlap between peers Pre-structure collections by categories / taxonomies

60 4.3 Providing Collection-Wide Information Collection-wide information is important for retrieval quality, but cannot be calculated locally like e,g., IDFs Some systems like e.g. PlanetP, do not use CWI directly, but circumnavigate the problem by using an inverted peer frequency where N is the number of all peers and N t is the number of peers offering documents on term t If summarizations of peers (abstracts) are eagerly disseminated, each peer can locally decide values for N and N t The relevance of peers in multi-keyword queries is simply the sum of IPFs for the individual terms Practical tests show an average overlap of about 7% between result sets retrieved with IDFs and those retrieved with IPFs Using IPFs the scalability is, however, still limited

61 4.4 Providing Collection-Wide Information Tests in Web information retrieval, e.g. [Viles & French, 95], show that CWI stays relatively stable over the whole collection of Web Sites even with churn Only joining/leaving corpora on completely new topics result in significant change Indexing CWI in a similar way as the routing information for queries is possible [Balke et al., 5] In structured networks CWI can be aggregated along the backbone and indexed CWI can be distributed together with the query New queries have to be flooded/routed twice The first flooding collects and aggregates CWI The second one provides the correct CWI for local scorings Non-expired indexed CWI can always be used when available

62 4.4 Estimating the Document Overlap Assessing the novelty of collections also supports retrieval quality Pre-computed statistics about expected result quality in each collection is often used to minimize the number of queried collections Choosing collection with high overlap for querying will usually not improve result sets sufficiently to justify the access costs Especially progressive searches, like top-k searches, profit from focusing on collections with small overlaps, since result merging procedures will ignore identical/similar results The novelty of a collection can only be calculated with respect to some reference collection(s) e.g. those collection(s) already in a peers local routing index

63 4.4 Estimating the Document Overlap A definition of a peer p s collection C p with respect to a reference collection C ref [Bender et al., 5] Since the information what exact documents a peer offers is usually not disseminated, the values have to be approximated from statistics E.g. if abstracts in the form of Bloom filters are given, a combined Bloom filter b p can be calculated by bitwise logical AND between p s Bloomfilters for all keywords in a query Novelty then can be estimated by comparing it to as the union of those Bloom filters b i of the set of collections S that have already been retrieved The degree of novelty is given by counting locations where p s Bloom filter has differing set bits

64 4.4 Prestructuring Collections with Taxonomies Retrieval in P2P systems generally considers two basic paradigms Fulltext-based queries Metadata-based queries Integrating these paradigms can support retrieval effectiveness Structuring document collections Disambiguation of query terms Peers often host collections of similar documents, e.g. similar kind of information (newspaper articles, etc.) on similar topics, etc. Scalability and successful use of statistics are strongly improved, if a common system of categories to classify the documents can be used Since categories are more or less similar to each other a taxonomy on categories allows for easily finding semantically similar documents

65 4.4 Prestructuring Collections with Taxonomies Topical similarity within a taxonomy is defined by [Li et al., 3] l: shortest path between categories c and c 2 h: level of common subsumer Common values =.2, =.6 (experimentally determined) E.g. newspaper articles: News h sim(politics, Sports): Foreign): Business Politics l Foreign Domestic l Sports Tennis l = 2 h = 2 sim =.35.68

66 4.4 Combination of Topics and Keywords Topics dominate keywords Cooperative Filter: Relax on topics until k results have been found Example: [<Politics>, London Olympics ] Topic Similarity Text Collection Politics Foreign Domestic Sports Business Tennis Politics Foreign Sports

67 4.4 Combination of Topics and Keywords SP 5 P SP P SP 3 SP 2 SP 7 SP 6 SP RequestResults {P} {(P, d, [P,.8]), TopRes TopRes (SP2, d2, [P,.7]), [P,.7])} Delivered {(P, d, d2, [P,.8])} [P,.3])} [S,.3])} Delivered {(P, d, [P,.8]), [P,.8])} Delivered (SP2, d2, [P,.7])} {d, {d} d2} d P.8 d2 P.3 d3 P.2 SP 4 SP 8 P 2 P 3 P 4 Politics News Sports d2 PD.7 d3 PD.6 d4 S.9 d22 P.4 d32 D.5 d42 S.5 d23 S.3 d33 D. d43 S.2

68 4.5 Overview. Introduction 2. Content Searching in Peer-to-Peer Applications. Problems in Peer-to-Peer Information Retrieval 2. Related Work in Distributed Information Retrieval 3. Index structures for Query Routing. Distributed Hash Tables for Information Retrieval 2. Routing Indexes for Information Retrieval 3. Locality-Based Routing Indexes 4. Supporting Effective Information Retrieval. Providing Collection-Wide Information 2. Estimating the Document Overlap 3. Prestructuring Collections with Taxonomies 5. Summary and Conclusion

69 4.5 Summary and Conclusion In today s P2P systems only exact match keyword retrieval is prevalent (usually on meta-data) Information retrieval in P2P scenarios is needed Individual, loosely coupled document collections need fulltext retrieval and ranking techniques Applications range from shared working environments e.g. in project groups, to distributed digital libraries Almost all IR systems use at least some global statistics, in P2P infrastructures the dissemination of necessary statistics becomes a performance bottleneck Trade-off between cached, but sometimes stale statistics and new, but expensively updated statistics needs to be managed How much staleness does a sufficient retrieval effectiveness allow?

70 4.5 Summary and Conclusion Choosing the right collections for querying improves retrieval efficiency Containing most promising documents with possibly little overlap Small worlds offer quick connections to semantically close collections Query routing indexes can handle some network churn while providing results of sufficient quality Local indexes can be efficiently maintained Can exploit advantages by Zipf-distributed content allocations and querying behavior Need to contact only small numbers of peers Supporting techniques like efficient CWI estimation/ dissemination or taxonomies of document categories further improves retrieval

A PROXIMITY-AWARE INTEREST-CLUSTERED P2P FILE SHARING SYSTEM

A PROXIMITY-AWARE INTEREST-CLUSTERED P2P FILE SHARING SYSTEM A PROXIMITY-AWARE INTEREST-CLUSTERED P2P FILE SHARING SYSTEM Dr.S. DHANALAKSHMI 1, R. ANUPRIYA 2 1 Prof & Head, 2 Research Scholar Computer Science and Applications, Vivekanandha College of Arts and Sciences

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article

More information

Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam

Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam A Survey on P2P File Sharing Systems Using Proximity-aware interest Clustering Varalakshmi.T #1, Arul Murugan.R #2 # Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam

More information

A Peer-to-Peer Architecture for Information Retrieval Across Digital Library Collections

A Peer-to-Peer Architecture for Information Retrieval Across Digital Library Collections A Peer-to-Peer Architecture for Information Retrieval Across Digital Library Collections Ivana Podnar, Toan Luu, Martin Rajman, Fabius Klemm, Karl Aberer School of Computer and Communication Sciences Ecole

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 Load Balancing Heterogeneous Request in DHT-based P2P Systems Mrs. Yogita A. Dalvi Dr. R. Shankar Mr. Atesh

More information

8 Conclusion and Future Work

8 Conclusion and Future Work 8 Conclusion and Future Work This chapter concludes this thesis and provides an outlook on future work in the area of mobile ad hoc networks and peer-to-peer overlay networks 8.1 Conclusion Due to the

More information

Scalable Source Routing

Scalable Source Routing Scalable Source Routing January 2010 Thomas Fuhrmann Department of Informatics, Self-Organizing Systems Group, Technical University Munich, Germany Routing in Networks You re there. I m here. Scalable

More information

Semantic Search in Peer-to-Peer Systems. Yingwu Zhu and Yiming Hu

Semantic Search in Peer-to-Peer Systems. Yingwu Zhu and Yiming Hu Semantic Search in Peer-to-Peer Systems Yingwu Zhu and Yiming Hu Contents 1 Semantic Search in Peer-to-Peer Systems 1 1.1 Introduction.................................... 1 1.2 Search in Unstructured P2P

More information

Topic Communities in P2P Networks

Topic Communities in P2P Networks Topic Communities in P2P Networks Joint work with A. Löser (IBM), C. Tempich (AIFB) SNA@ESWC 2006 Budva, Montenegro, June 12, 2006 Two opposite challenges when considering Social Networks Analysis Nodes/Agents

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Efficient Search in Gnutella-like Small-World Peerto-Peer

Efficient Search in Gnutella-like Small-World Peerto-Peer Efficient Search in Gnutella-like Small-World Peerto-Peer Systems * Dongsheng Li, Xicheng Lu, Yijie Wang, Nong Xiao School of Computer, National University of Defense Technology, 410073 Changsha, China

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group Medical Information-Retrieval Systems Dong Peng Medical Informatics Group Outline Evolution of medical Information-Retrieval (IR). The information retrieval process. The trend of medical information retrieval

More information

Information Retrieval Elasticsearch

Information Retrieval Elasticsearch Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches

More information

Peer-to-Peer Data Management

Peer-to-Peer Data Management Peer-to-Peer Data Management Wolf-Tilo Balke Sascha Tönnies Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de Overview Why Peer-to-Peer Databases? Federation

More information

Using Peer to Peer Dynamic Querying in Grid Information Services

Using Peer to Peer Dynamic Querying in Grid Information Services Using Peer to Peer Dynamic Querying in Grid Information Services Domenico Talia and Paolo Trunfio DEIS University of Calabria HPC 2008 July 2, 2008 Cetraro, Italy Using P2P for Large scale Grid Information

More information

A Collaborative and Semantic Data Management Framework for Ubiquitous Computing Environment

A Collaborative and Semantic Data Management Framework for Ubiquitous Computing Environment A Collaborative and Semantic Data Management Framework for Ubiquitous Computing Environment Weisong Chen, Cho-Li Wang, and Francis C.M. Lau Department of Computer Science, The University of Hong Kong {wschen,

More information

The Case for a Hybrid P2P Search Infrastructure

The Case for a Hybrid P2P Search Infrastructure The Case for a Hybrid P2P Search Infrastructure Boon Thau Loo Ryan Huebsch Ion Stoica Joseph M. Hellerstein University of California at Berkeley Intel Research Berkeley boonloo, huebsch, istoica, jmh @cs.berkeley.edu

More information

Simulating a File-Sharing P2P Network

Simulating a File-Sharing P2P Network Simulating a File-Sharing P2P Network Mario T. Schlosser, Tyson E. Condie, and Sepandar D. Kamvar Department of Computer Science Stanford University, Stanford, CA 94305, USA Abstract. Assessing the performance

More information

KEYWORD SEARCH IN RELATIONAL DATABASES

KEYWORD SEARCH IN RELATIONAL DATABASES KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to

More information

Information Searching Methods In P2P file-sharing systems

Information Searching Methods In P2P file-sharing systems Information Searching Methods In P2P file-sharing systems Nuno Alberto Ferreira Lopes PhD student (nuno.lopes () di.uminho.pt) Grupo de Sistemas Distribuídos Departamento de Informática Universidade do

More information

Enhancing P2P File-Sharing with an Internet-Scale Query Processor

Enhancing P2P File-Sharing with an Internet-Scale Query Processor Enhancing P2P File-Sharing with an Internet-Scale Query Processor Boon Thau Loo Joseph M. Hellerstein Ryan Huebsch Scott Shenker Ion Stoica UC Berkeley, Intel Research Berkeley and International Computer

More information

D1.1 Service Discovery system: Load balancing mechanisms

D1.1 Service Discovery system: Load balancing mechanisms D1.1 Service Discovery system: Load balancing mechanisms VERSION 1.0 DATE 2011 EDITORIAL MANAGER Eddy Caron AUTHORS STAFF Eddy Caron, Cédric Tedeschi Copyright ANR SPADES. 08-ANR-SEGI-025. Contents Introduction

More information

Content Delivery Network (CDN) and P2P Model

Content Delivery Network (CDN) and P2P Model A multi-agent algorithm to improve content management in CDN networks Agostino Forestiero, forestiero@icar.cnr.it Carlo Mastroianni, mastroianni@icar.cnr.it ICAR-CNR Institute for High Performance Computing

More information

Introduction to Information Retrieval http://informationretrieval.org

Introduction to Information Retrieval http://informationretrieval.org Introduction to Information Retrieval http://informationretrieval.org IIR 6&7: Vector Space Model Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-08-29 Schütze:

More information

RESEARCH ISSUES IN PEER-TO-PEER DATA MANAGEMENT

RESEARCH ISSUES IN PEER-TO-PEER DATA MANAGEMENT RESEARCH ISSUES IN PEER-TO-PEER DATA MANAGEMENT Bilkent University 1 OUTLINE P2P computing systems Representative P2P systems P2P data management Incentive mechanisms Concluding remarks Bilkent University

More information

SwanLink: Mobile P2P Environment for Graphical Content Management System

SwanLink: Mobile P2P Environment for Graphical Content Management System SwanLink: Mobile P2P Environment for Graphical Content Management System Popovic, Jovan; Bosnjakovic, Andrija; Minic, Predrag; Korolija, Nenad; and Milutinovic, Veljko Abstract This document describes

More information

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Feng Zhou, Li Zhuang, Ben Y. Zhao, Ling Huang, Anthony D. Joseph and John D. Kubiatowicz University of California, Berkeley The Problem

More information

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9 Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;

More information

CS5412: TIER 2 OVERLAYS

CS5412: TIER 2 OVERLAYS 1 CS5412: TIER 2 OVERLAYS Lecture VI Ken Birman Recap 2 A week ago we discussed RON and Chord: typical examples of P2P network tools popular in the cloud Then we shifted attention and peeked into the data

More information

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks OnurSoft Onur Tolga Şehitoğlu November 10, 2012 v1.0 Contents 1 Introduction 3 1.1 Purpose..............................

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

More information

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks University of Tübingen Chair for Computer Networks and Internet Adapting Distributed Hash Tables for Mobile Ad Hoc Networks Tobias Heer, Stefan Götz, Simon Rieche, Klaus Wehrle Protocol Engineering and

More information

Static IP Routing and Aggregation Exercises

Static IP Routing and Aggregation Exercises Politecnico di Torino Static IP Routing and Aggregation xercises Fulvio Risso August 0, 0 Contents I. Methodology 4. Static routing and routes aggregation 5.. Main concepts........................................

More information

Bloom Filter based Inter-domain Name Resolution: A Feasibility Study

Bloom Filter based Inter-domain Name Resolution: A Feasibility Study Bloom Filter based Inter-domain Name Resolution: A Feasibility Study Konstantinos V. Katsaros, Wei Koong Chai and George Pavlou University College London, UK Outline Inter-domain name resolution in ICN

More information

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer. RESEARCH ARTICLE ISSN: 2321-7758 GLOBAL LOAD DISTRIBUTION USING SKIP GRAPH, BATON AND CHORD J.K.JEEVITHA, B.KARTHIKA* Information Technology,PSNA College of Engineering & Technology, Dindigul, India Article

More information

The Role and uses of Peer-to-Peer in file-sharing. Computer Communication & Distributed Systems EDA 390

The Role and uses of Peer-to-Peer in file-sharing. Computer Communication & Distributed Systems EDA 390 The Role and uses of Peer-to-Peer in file-sharing Computer Communication & Distributed Systems EDA 390 Jenny Bengtsson Prarthanaa Khokar jenben@dtek.chalmers.se prarthan@dtek.chalmers.se Gothenburg, May

More information

Wireless Sensor Networks Chapter 3: Network architecture

Wireless Sensor Networks Chapter 3: Network architecture Wireless Sensor Networks Chapter 3: Network architecture António Grilo Courtesy: Holger Karl, UPB Goals of this chapter Having looked at the individual nodes in the previous chapter, we look at general

More information

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

LOAD BALANCING WITH PARTIAL KNOWLEDGE OF SYSTEM

LOAD BALANCING WITH PARTIAL KNOWLEDGE OF SYSTEM LOAD BALANCING WITH PARTIAL KNOWLEDGE OF SYSTEM IN PEER TO PEER NETWORKS R. Vijayalakshmi and S. Muthu Kumarasamy Dept. of Computer Science & Engineering, S.A. Engineering College Anna University, Chennai,

More information

Six Degrees of Separation in Online Society

Six Degrees of Separation in Online Society Six Degrees of Separation in Online Society Lei Zhang * Tsinghua-Southampton Joint Lab on Web Science Graduate School in Shenzhen, Tsinghua University Shenzhen, Guangdong Province, P.R.China zhanglei@sz.tsinghua.edu.cn

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

MIDAS: Multi-Attribute Indexing for Distributed Architecture Systems

MIDAS: Multi-Attribute Indexing for Distributed Architecture Systems MIDAS: Multi-Attribute Indexing for Distributed Architecture Systems George Tsatsanifos (NTUA) Dimitris Sacharidis (R.C. Athena ) Timos Sellis (NTUA, R.C. Athena ) 12 th International Symposium on Spatial

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

How To Create A P2P Network

How To Create A P2P Network Peer-to-peer systems INF 5040 autumn 2007 lecturer: Roman Vitenberg INF5040, Frank Eliassen & Roman Vitenberg 1 Motivation for peer-to-peer Inherent restrictions of the standard client/server model Centralised

More information

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs. Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 0 Organizational Issues Lecture 21.10.2014 03.02.2015

More information

Scalable Prefix Matching for Internet Packet Forwarding

Scalable Prefix Matching for Internet Packet Forwarding Scalable Prefix Matching for Internet Packet Forwarding Marcel Waldvogel Computer Engineering and Networks Laboratory Institut für Technische Informatik und Kommunikationsnetze Background Internet growth

More information

A Review on Efficient File Sharing in Clustered P2P System

A Review on Efficient File Sharing in Clustered P2P System A Review on Efficient File Sharing in Clustered P2P System Anju S Kumar 1, Ratheesh S 2, Manoj M 3 1 PG scholar, Dept. of Computer Science, College of Engineering Perumon, Kerala, India 2 Assisstant Professor,

More information

Efficient Content Location Using Interest-Based Locality in Peer-to-Peer Systems

Efficient Content Location Using Interest-Based Locality in Peer-to-Peer Systems Efficient Content Location Using Interest-Based Locality in Peer-to-Peer Systems Kunwadee Sripanidkulchai Bruce Maggs Hui Zhang Carnegie Mellon University, Pittsburgh, PA 15213 {kunwadee,bmm,hzhang}@cs.cmu.edu

More information

Components: Interconnect Page 1 of 18

Components: Interconnect Page 1 of 18 Components: Interconnect Page 1 of 18 PE to PE interconnect: The most expensive supercomputer component Possible implementations: FULL INTERCONNECTION: The ideal Usually not attainable Each PE has a direct

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P) Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...

More information

A Reputation Management System in Structured Peer-to-Peer Networks

A Reputation Management System in Structured Peer-to-Peer Networks A Reputation Management System in Structured Peer-to-Peer Networks So Young Lee, O-Hoon Kwon, Jong Kim and Sung Je Hong Dept. of Computer Science & Engineering, Pohang University of Science and Technology

More information

Introduction to Information Retrieval http://informationretrieval.org

Introduction to Information Retrieval http://informationretrieval.org Introduction to Information Retrieval http://informationretrieval.org IIR 7: Scores in a Complete Search System Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-05-07

More information

Raddad Al King, Abdelkader Hameurlain, Franck Morvan

Raddad Al King, Abdelkader Hameurlain, Franck Morvan Raddad Al King, Abdelkader Hameurlain, Franck Morvan Institut de Recherche en Informatique de Toulouse (IRIT), Université Paul Sabatier 118, route de Narbonne, F-31062 Toulouse Cedex 9, France E-mail:

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

P2P VoIP for Today s Premium Voice Service 1

P2P VoIP for Today s Premium Voice Service 1 1 P2P VoIP for Today s Premium Voice Service 1 Ayaskant Rath, Stevan Leiden, Yong Liu, Shivendra S. Panwar, Keith W. Ross ARath01@students.poly.edu, {YongLiu, Panwar, Ross}@poly.edu, Steve.Leiden@verizon.com

More information

Towards a Next- Generation Inter-domain Routing Protocol. L. Subramanian, M. Caesar, C.T. Ee, M. Handley, Z. Mao, S. Shenker, and I.

Towards a Next- Generation Inter-domain Routing Protocol. L. Subramanian, M. Caesar, C.T. Ee, M. Handley, Z. Mao, S. Shenker, and I. Towards a Next- Generation Inter-domain Routing Protocol L. Subramanian, M. Caesar, C.T. Ee, M. Handley, Z. Mao, S. Shenker, and I. Stoica Routing 1999 Internet Map Coloured by ISP Source: Bill Cheswick,

More information

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,

More information

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution Peer-to-Peer Networks Chapter 6: P2P Content Distribution Chapter Outline Content distribution overview Why P2P content distribution? Network coding Peer-to-peer multicast Kangasharju: Peer-to-Peer Networks

More information

Reputation Management Algorithms & Testing. Andrew G. West November 3, 2008

Reputation Management Algorithms & Testing. Andrew G. West November 3, 2008 Reputation Management Algorithms & Testing Andrew G. West November 3, 2008 EigenTrust EigenTrust (Hector Garcia-molina, et. al) A normalized vector-matrix multiply based method to aggregate trust such

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Ryosuke Tsuchiya 1, Hironori Washizaki 1, Yoshiaki Fukazawa 1, Keishi Oshima 2, and Ryota Mibe

More information

Load Balancing in Structured Overlay Networks. Tallat M. Shafaat tallat(@)kth.se

Load Balancing in Structured Overlay Networks. Tallat M. Shafaat tallat(@)kth.se Load Balancing in Structured Overlay Networks Tallat M. Shafaat tallat(@)kth.se Overview Background The problem : load imbalance Causes of load imbalance Solutions But first, some slides from previous

More information

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015 W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction

More information

query enabled P2P networks 2009. 08. 27 Park, Byunggyu

query enabled P2P networks 2009. 08. 27 Park, Byunggyu Load balancing mechanism in range query enabled P2P networks 2009. 08. 27 Park, Byunggyu Background Contents DHT(Distributed Hash Table) Motivation Proposed scheme Compression based Hashing Load balancing

More information

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28 Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

More information

Effective Keyword-based Selection of Relational Databases

Effective Keyword-based Selection of Relational Databases Effective Keyword-based Selection of Relational Databases Bei Yu National University of Singapore Guoliang Li Tsinghua University Anthony K. H. Tung National University of Singapore Karen Sollins MIT ABSTRACT

More information

Development of an Enhanced Web-based Automatic Customer Service System

Development of an Enhanced Web-based Automatic Customer Service System Development of an Enhanced Web-based Automatic Customer Service System Ji-Wei Wu, Chih-Chang Chang Wei and Judy C.R. Tseng Department of Computer Science and Information Engineering Chung Hua University

More information

Analysis on Leveraging social networks for p2p content-based file sharing in disconnected manets

Analysis on Leveraging social networks for p2p content-based file sharing in disconnected manets Analysis on Leveraging social networks for p2p content-based file sharing in disconnected manets # K.Deepika 1, M.Tech Computer Science Engineering, Mail: medeepusony@gmail.com # K.Meena 2, Assistant Professor

More information

American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

Christian Bettstetter. Mobility Modeling, Connectivity, and Adaptive Clustering in Ad Hoc Networks

Christian Bettstetter. Mobility Modeling, Connectivity, and Adaptive Clustering in Ad Hoc Networks Christian Bettstetter Mobility Modeling, Connectivity, and Adaptive Clustering in Ad Hoc Networks Contents 1 Introduction 1 2 Ad Hoc Networking: Principles, Applications, and Research Issues 5 2.1 Fundamental

More information

Cassandra A Decentralized, Structured Storage System

Cassandra A Decentralized, Structured Storage System Cassandra A Decentralized, Structured Storage System Avinash Lakshman and Prashant Malik Facebook Published: April 2010, Volume 44, Issue 2 Communications of the ACM http://dl.acm.org/citation.cfm?id=1773922

More information

Distributed Hash Tables in P2P Systems - A literary survey

Distributed Hash Tables in P2P Systems - A literary survey Distributed Hash Tables in P2P Systems - A literary survey Timo Tanner Helsinki University of Technology tstanner@cc.hut.fi Abstract Distributed Hash Tables (DHT) are algorithms used in modern peer-to-peer

More information

Performance of networks containing both MaxNet and SumNet links

Performance of networks containing both MaxNet and SumNet links Performance of networks containing both MaxNet and SumNet links Lachlan L. H. Andrew and Bartek P. Wydrowski Abstract Both MaxNet and SumNet are distributed congestion control architectures suitable for

More information

High Throughput Computing on P2P Networks. Carlos Pérez Miguel carlos.perezm@ehu.es

High Throughput Computing on P2P Networks. Carlos Pérez Miguel carlos.perezm@ehu.es High Throughput Computing on P2P Networks Carlos Pérez Miguel carlos.perezm@ehu.es Overview High Throughput Computing Motivation All things distributed: Peer-to-peer Non structured overlays Structured

More information

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

Taxonomies in Practice Welcome to the second decade of online taxonomy construction Building a Taxonomy for Auto-classification by Wendi Pohs EDITOR S SUMMARY Taxonomies have expanded from browsing aids to the foundation for automatic classification. Early auto-classification methods

More information

Introduction to LAN/WAN. Network Layer

Introduction to LAN/WAN. Network Layer Introduction to LAN/WAN Network Layer Topics Introduction (5-5.1) Routing (5.2) (The core) Internetworking (5.5) Congestion Control (5.3) Network Layer Design Isues Store-and-Forward Packet Switching Services

More information

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

More information

Decentralized Peer-to-Peer Network Architecture: Gnutella and Freenet

Decentralized Peer-to-Peer Network Architecture: Gnutella and Freenet Decentralized Peer-to-Peer Network Architecture: Gnutella and Freenet AUTHOR: Jem E. Berkes umberkes@cc.umanitoba.ca University of Manitoba Winnipeg, Manitoba Canada April 9, 2003 Introduction Although

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Eng. Mohammed Abdualal

Eng. Mohammed Abdualal Islamic University of Gaza Faculty of Engineering Computer Engineering Department Information Storage and Retrieval (ECOM 5124) IR HW 5+6 Scoring, term weighting and the vector space model Exercise 6.2

More information

An Efficient Strategy for Data Recovery in Wi-Fi Systems

An Efficient Strategy for Data Recovery in Wi-Fi Systems International Journal of Research & Development in Science and Technology Volume 1, Issue 2, December 2014, PP 1-6 ISSN 2350-4751 (Print) & ISSN 2350-4751(Online) An Efficient Strategy for Data Recovery

More information

Distributed Caching Algorithms for Content Distribution Networks

Distributed Caching Algorithms for Content Distribution Networks Distributed Caching Algorithms for Content Distribution Networks Sem Borst, Varun Gupta, Anwar Walid Alcatel-Lucent Bell Labs, CMU BCAM Seminar Bilbao, September 30, 2010 Introduction Scope: personalized/on-demand

More information

Principles of Distributed Database Systems

Principles of Distributed Database Systems M. Tamer Özsu Patrick Valduriez Principles of Distributed Database Systems Third Edition

More information

Lecture 2.1 : The Distributed Bellman-Ford Algorithm. Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol

Lecture 2.1 : The Distributed Bellman-Ford Algorithm. Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol Lecture 2 : The DSDV Protocol Lecture 2.1 : The Distributed Bellman-Ford Algorithm Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol The Routing Problem S S D D The routing problem

More information

IBM Social Media Analytics

IBM Social Media Analytics IBM Social Media Analytics Analyze social media data to better understand your customers and markets Highlights Understand consumer sentiment and optimize marketing campaigns. Improve the customer experience

More information

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design Physical Database Design Process Physical Database Design Process The last stage of the database design process. A process of mapping the logical database structure developed in previous stages into internal

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1 Recommender Systems Seminar Topic : Application Tung Do 28. Januar 2014 TU Darmstadt Thanh Tung Do 1 Agenda Google news personalization : Scalable Online Collaborative Filtering Algorithm, System Components

More information

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A Introduction to IR Systems: Supporting Boolean Text Search Chapter 27, Part A Database Management Systems, R. Ramakrishnan 1 Information Retrieval A research field traditionally separate from Databases

More information

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* Junho Jang, Saeyoung Han, Sungyong Park, and Jihoon Yang Department of Computer Science and Interdisciplinary Program

More information

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...

More information

Super-Agent Based Reputation Management with a Practical Reward Mechanism in Decentralized Systems

Super-Agent Based Reputation Management with a Practical Reward Mechanism in Decentralized Systems Super-Agent Based Reputation Management with a Practical Reward Mechanism in Decentralized Systems Yao Wang, Jie Zhang, and Julita Vassileva Department of Computer Science, University of Saskatchewan,

More information