Using Bloom Filters to Refine Web Search Results
|
|
- Moses Barber
- 8 years ago
- Views:
Transcription
1 Using Bloo Filters to Refine Web Search Results Navendu Jain Departent of Coputer Sciences University of Texas at Austin Austin, TX, Mike Dahlin Departent of Coputer Sciences University of Texas at Austin Austin, TX, Renu Tewari IBM Aladen Research Center 65 Harry Road San Jose, CA, ABSTRACT Search engines have priarily focused on presenting the ost relevant pages to the user quickly. A less well explored aspect of iproving the search experience is to reove or group all near-duplicate docuents in the results presented to the user. In this paper, we apply a Bloo filter based siilarity detection technique to address this issue by refining the search results presented to the user. First, we present and analyze our technique for finding siilar docuents using contentdefined chunking and Bloo filters, and deonstrate its effectiveness in copactly representing and quickly atching pages for siilarity testing. Later, we deonstrate how a nuber of results of popular and rando search queries retrieved fro different search engines, Google, Yahoo, MSN, are siilar and can be eliinated or re-organized. 1. INTRODUCTION Enterprise and web search has becoe a ubiquitous part of the web experience. Nuerous studies have shown that the ad-hoc distribution of inforation on the web has resulted in a high degree of content aliasing (i.e., the sae data contained in pages fro different URLs) [14] and which adversely affects the perforance of search engines [6]. The initial study by Broder et al., in 1997 [7], and the later one by Fetterly et al. [11], shows that around 29.2% of data is coon across pages in a saple of 15 illion pages. This coon data when presented to the user on a search query degrades user-experience by repeating the sae inforation on every click. Siilar data can be grouped or eliinated to iprove the search experience. Siilarity based grouping is also useful for organizing the results presented by eta-crawlers (e.g., vivisio, etacrawler, dogpile, copernic). The findings by searchenginejournal.co [2] show a significant overlap of search results returned by Google and Yahoo search engines the top 2 keyword searches fro Google had about 4% identical or siilar pages to the Yahoo results. Soeties search results ay appear different purely due to the restructuring and reforatting of data. For exaple, one site ay forat a docuent into ultiple web pages, with the top level page only containing a fraction of the docuent along with a next link to follow to the reaining part, while an- This work was supported in part by the Texas Advanced Technology Progra, the National Science Foundation (CNS ), and an IBM Faculty Partnership Award. This work was done during an internship at IBM Aladen. Copyright is held by the author/owner(s). Eighth International Workshop on the Web and Databases (WebDB 25), June 16-17, 25, Baltiore, Maryland. other site ay have the entire docuent in the sae web page. An effective siilarity detection technique should find these contained docuents and label the as siilar. Although iproving search results by identifying nearduplicates had been proposed for Altavista [6], we found that popular search engines, Google, Yahoo, MSN, even today have a significant fraction of near-duplicates in their top results 1. For exaple, consider the results of the query eacs anual using the Google search engine. We focus on the top 2 results (i.e., first 2 pages) as they represent the results ost likely to be viewed by the user. Four of the results, toc.htl, toc.htl, www. dc.urkuak.fi/docs/gnu/eacs/eacs toc.htl, and chapter/eacs toc.htl, on the first page (top-1 results), were highly siilar in fact, they had nearly identical content but different page headers, disclaiers, and logo iages. For this particular query, on the whole, 7 out of 2 docuents were redundant (3 identical pairs and 4 siilar to one top page docuent). Siilar results were found using Yahoo, MSN 2, and A9 3 search engines. In this paper, we study the current state of popular search engines and evaluate the application of a Bloo filter based near-duplicate detection technique on search results. We deonstrate, using ultiple search engines, how a nuber of results (ranging fro 7% to 6%) on search queries are siilar and can be eliinated or re-organized. Later, we explore the use of Bloo filters for finding siilar objects and deonstrate their effectiveness in copactly representing and quickly atching pages for siilarity testing. Although Bloo filters have been extensively used for set ebership checks, they have not been analyzed for siilarity detection between text docuents. Finally, we apply our Bloo filter based technique to effectively reove siilar search results and iprove user experience. Our evaluation of search results shows that the occurrence of near-duplicates is strongly correlated to: i) the relevance of the docuent and ii) the popularity of the query. Docuents that are considered ore relevant and have a higher rank also have ore near-duplicates copared to less relevant docuents. Siilarly, results fro the ore popular queries have ore near-duplicates copared to the less popular ones. Our siilarity atcher can be deployed as a filter over 1 Google does have a patent [17] for near-duplicate detection although it is not clear which approach they use. 2 Results for a recently popular query, ohio court battle fro both Google and MSN search had a siilar behavior, with 1 and 4 out of the top 2 results being identical resp. 3 A9 states that it uses a Google back-end for part of its search.
2 any search engine s result set. The overhead of integrating our siilarity detection algorith with search engines only associates about.4% extra bytes per docuent and provides fast atching on the order of illiseconds as described later in section 3. Note that we focus on one ain aspect of siilarity text content. This ight not copletely capture the huan-judgeent notion of siilarity in all cases. However, our technique can be easily extended to include link structure based siilarity easures by coparing Bloo filters generated fro hyperlinks ebedded in web pages. The rest of the paper is organized as follows. Siilarity detection using Bloo filters is described and analyzed in Section 2. Section 3 evaluates and copares our siilarity technique to iprove search results fro ultiple engines and for different workloads. Finally, Section 4 covers related work and we conclude with Section SIMILARITY DETECTION USING BLOOM FILTERS Our siilarity detection algorith proceeds in three steps as follows. First, we use content-defined chunking (CDC) to extract docuent features that are resilient to odifications. Second, we use these features as set eleents for generating Bloo filters 4. Third, we copare the Bloo filters to detect near-duplicate docuents above a certain siilarity threshold (say 7%). We start with an overview of Bloo filters and CDCs, and later present and analyze the siilarity detection technique for refining web search results. 2.1 Bloo Filter Overview A Bloo filter of a set U is ipleented as an array of bits [4]. Each eleent u (u U) of the set is hashed using k independent hash functions h 1,..., h k. Each hash function h i(u) for 1 i k aps to one bit in the array {1... }. Thus, when an eleent is added to the set, it sets k bits, each bit corresponding to a hash function, in the Bloo filter array to 1. If a bit was already set it stays 1. For set ebership checks, Bloo filters ay yield a false positive, where it ay appear that an eleent v is in U even though it is not. Fro the analysis in [8], given n = U and the Bloo filter size, the optial value of k that iniizes the false positive probability, p k, where p denotes that probability that a given bit is set in the Bloo filter, is k = ln 2. Previously, Bloo filters have priarily n been used for finding set-ebership [8]. 2.2 Content-defined Chunking Overview To copute the Bloo filter of a docuent, we first need to split it into a set of eleents. Observe that splitting a docuent using a fixed block size akes it very susceptible to odifications, thereby, aking it useless for siilarity coparison. For effective siilarity detection, we need a echanis that is ore resilient to changes in the docuent. CDC splits a docuent into variable-sized blocks whose boundaries are deterined by its Rabin fingerprint atching a predeterined arker value [18]. The nuber of bits in the Rabin fingerprint that are used to atch the arker deterine the expected chunk size. For exaple, given a arker x78 and an expected chunk size of 2 k, a rolling (overlapping sequence) 48-byte fingerprint is coputed. If the lower k bits of the fingerprint equal x78, a new chunk boundary is set. Since the chunk boundaries are content-based, any odifications should affect only a couple of neighboring chunks and 4 Within a search engine context, the CDCs and the Bloo filters of the docuents can be coputed offline and stored. not the entire docuent. CDC has been used in LBFS [15], REBL [13] and other systes for redundancy eliination. 2.3 Bloo Filters for Siilarity Testing Observe that we can view each docuent to be a set in Bloo filter parlance whose eleents are the CDCs that it is coposed of 5. Given that Bloo filters copactly represent a set, they can also be used to approxiately atch two sets. Bloo filters, however, cannot be used for exact atching as they have a finite false-atch probability but they are naturally suited for siilarity atching. For finding siilar docuents, we copare the Bloo filter of one with that of the other. In case the two docuents share a large nuber of 1 s (bit-wise AND) they are arked as siilar. In this case, the bit-wise AND can also be perceived as the dot product of the two bit vectors. If the set bits in the Bloo filter of a docuent are a coplete subset of that of another filter then it is highly probable that the docuent is included in the other. Web pages are typically coposed of fragents, either static ones (e.g., logo iages), or dynaic (e.g., personalized product prootions, local weather) [19]. When targeting pages for a siilarity based grouping, the test for siilarity should be on the fragent of interest and not the entire page. Bloo filters, when applied to siilarity detection, have several advantages. First, the copactness of Bloo filters is very attractive for storage and transission whenever we want to iniize the eta-data overheads. Second, Bloo filters enable fast coparison as atching is a bitwise-and operation. Third, since Bloo filters are a coplete representation of a set rather than a deterinistic saple (e.g., shingling), they can deterine inclusions effectively. To deonstrate the effectiveness of Bloo filters for siilarity detection, consider, for exaple, the pages fro the Money/CNN web server (oney.cnn.co). We crawled 13 MB of data fro the site that resulted in 1753 docuents. We copared the top-level page arsh ceo/index.htl with all the other pages fro the site. For each docuent, we converted it into a canonical representation as described later in Section 3. The CDCs of the pages were coputed using an expected and axiu chunk size of 256 bytes and 64 KB respectively. The corresponding Bloo filter was of size 256 bytes. Figure 1 shows that two other copies of the page one with the URI /24/1/25/news/fortune5/arsh\ ceo/index.ht and another one with a dynaic URI /24/ 1/25/news/fortune5/arsh ceo/index.ht?cnn=yes atched with all set bits in the Bloo filter of the original docuent. As another exaple, we crawled around 2 MB of data (59 docuents) fro the ib web site ( We copared the page/investor/corpgovernance/index.phtl with all the other crawled pages fro the site. The chunk sizes were chosen as above. Figure 2 shows that two other pages with the URIs/investor/corpgovernance/cgcoi.phtl and/investor/ corpgovernance/cgblaws.phtl appeared siilar, atching in 53% and 69% of the bits in the Bloo filter, respectively. To further illustrate that Bloo filters can differentiate between ultiple siilar docuents, we extracted a technical docuentation file foo (say) (of size 17 KB) increentally fro a CVS archive, generating 2 different versions, with foo being the original, foo.1 being the first version (with a change of 415 bytes fro foo ) and foo.19 being the last. As shown in Figure 3, the Bloo filter for foo atched the ost (98%) with the closest version foo.1. 5 For ultisets, we ake each CDC unique before Bloo filter generation to differentiate ultiple copies of the sae CDC.
3 Fraction of 1 s atched in the AND outputs Docuent Siilarity using Bloo Filter: arsh_ceo/index.htl arsh_ceo/index.htl arsh_ceo/index.ht arsh_ceo/index.ht?cnn=yes Web docuents in oney.cnn.co Source Tree Figure 1: Coparison of the docuent arsh ceo/index.htl with all pages fro the oney.cnn.co web site Fraction of 1 s atched in the AND outputs Docuent Siilarity using Bloo Filter: investor/corpgovernance/index.phtl 1 investor/corpgovernance/index.phtl investor/corpgovernance/cgcoi.phtl Analysis The ain consideration when using Bloo filters for siilarity detection is the false atch probability of the above algorith as a function of siilarity between the source and a candidate docuent. Extending the analysis for ebership testing in [4] to siilarity detection, we proceed to deterine the expected nuber of inferred atches between the two sets. Let A and B be the two sets being copared for siilarity. Let denote the nuber of bits (size) in the Bloo filter. For siplicity, assue that both sets have the sae nuber of eleents. Let n denote the nuber of eleents in both sets A and B i.e., A = B = n. As before, k denotes the nuber of hash functions. The probability that a bit is set by a hash function h i for 1 i k is 1. A bit can be set by any of the k hash functions for each of the n eleents. Therefore, the probability that a bit is not set by any hash function for any eleent is (1 1 )nk. Thus, the probability, p, that a given bit is set in the Bloo filter of A is given by: p = 1 `1 1 nk 1 e nk (1) For an eleent to be considered a eber of the set, all the corresponding k bits should be set. Thus, the probability of a false atch, i.e., an outside eleent is inferred as being in set A, is p k. Let C denote the intersection of sets A and B and c denote its cardinality, i.e., C = A B and C = c. For siilarity coparison, let us take each eleent in set B and check if it belongs to the Bloo filter of the given set A. We should find that the c coon eleents will definitely atch and a few of the other (n c) ay also atch due to the false atch probability. By Linearity of Expectation, the expected nuber of eleents of B inferred to have atched with A is E[# of inferred atches] = (c) + (n c)p k To iniize the false atches, this expected nuber should be as close to c as possible. For that (n c)p k should be close to, i.e., p k should approach. This happens to be the sae as iniizing the probability of a false positive. Expanding p and under asyptotic analysis, it reduces to iniizing (1 e nk ) k. Using the sae analysis for iniizing the false positive rate given in [8], the inia obtained after differentiation is when k = ln 2. Thus, the expected nuber n of inferred atches for this value of k becoes investor/corpgovernance/cgblaws.phtl Web docuents in Source Tree Figure 2: Coparison of the docuent investor/corpgovernance/index.phtl with pages fro E[# of inferred atches] = c + (n c)(.6185) n Thus, the expected nuber of bits set corresponding to inferred atches is h E[# of atched bits] = k`c + (n c)(.6185) n i Under the assuption of perfectly rando hash functions, the expected nuber of total bits set in the Bloo filter of Fraction of 1 s atched in the AND outputs File Siilarity using Bloo Filter: CVS Repository Benchark foo foo versions Figure 3: Coparison of the original file foo with later versions foo.1, foo.2 foo.19 the source set A, is p. The ratio, then, of the expected nuber of atched bits corresponding to inferred atches in A B to the expected total nuber of bits set in the Bloo filter of A is: E[# of atched bits] E[# total bits set] = 1 e k (c + (n c)(.6185) n ) `1 e nk Observe that this ratio equals 1 when all the eleents atch, i.e., c = n. If there are no atching eleents, i.e., c =, the ratio = 2(1 (.5) (.6185) n ). For = n, this evaluates to.6973, i.e., 69% of atching bits ay be false. For larger values, = 2n,4n, 8n, 1n,11n, the corresponding ratios are.4658,.1929,.295,.113,.7 respectively. Thus, for = 11n, on an average, less than 1% of the bits set ay atch incorrectly. The expected ratio of atching bits is highly correlated to the expected ratio of atching eleents. Thus, if a large fraction of the bits atch, then it s highly likely that a large fraction of the eleents are coon. 2.4 Discussion Previous work on docuent siilarity has ostly been based on shingling or super fingerprints. Using this ethod, for each object, all the k consecutive words of a docuent (called k-shingles) are hashed using Rabin fingerprint [18] to create a set of fingerprints (also called features or preiages). These fingerprints are then sapled to copute a super-fingerprint of the docuent. Many variants have been proposed that use different techniques on how the shingle fingerprints are sapled (in-hashing, Mod, Min s etc.) and atched [7, 6, 5]. While Mod selects all fingerprints whose value odulo is zero; Min s selects the set of s fingerprints with the sallest value. The in-hashing approach further refines the sapling to be the in values of say 84 rando in-wise independent perutations (or hashes) of the set of all shingle fingerprints. This results in a fixed size saple of 84 fingerprints that is the resulting feature vector. To further siplify atching, these 84 fingerprints can be grouped as 6 super-shingles by concatenating 14 adjacent fingerprints [11]. In [13] these are called super-fingerprints. A pair of objects are then considered siilar if either all or a large fraction of the values in the super-fingerprints atch. Our Bloo filter based siilarity detection differs fro the shingling technique in several ways. It should be noted, however, that the variants of shingling discussed above iprove upon the original approach and we provide a coparison of our technique with these variants wherever applicable. First, shingling (Mod, Min s) coputes docuent siilarity using the intersection of the two feature sets. In our approach, it requires only the bit-wise AND of the two Bloo filters (e.g., two 128 bit vectors). Next, shingling has a higher coputational overhead as it first segents the docuent into k-word shingles (k = 5 in [11]) resulting in shingle set size
4 of about S k + 1, where S is the docuent size. Later, it coputes the iage (value) of each shingle by applying set (say H) of in-wise independent hash functions ( H =84 as used in [11]) and then for each function, selecting the shingle corresponding to the iniu iage. On the other hand, we apply a set of independent hash functions (typically less than 8) to the chunk set of size on average S where c is the c expected chunk size (e.g., c = 256 bytes for S = 8 KB docuent). Third, the size of the feature set (nuber of shingles) depends on the sapling technique in shingling. For exaple, in Mod, even soe large docuents ight have very few features whereas sall docuents ight have zero features. Soe shingling variants (e.g., Min s, Mod 2 i) ai to select roughly a constant nuber of features. Our CDC based approach only varies the chunk size c, to deterine the nuber of chunks as a trade-off between perforance and fine-grained atching. We leave the epirical coparison with shingling as future work. In general, a copact Bloo filter is easier to attach as a docuent tag and can be copared siply by atching the bits. Thus, Bloo filter based atching is ore suitable for eta crawlers and can be added on to existing search engines without any significant changes. 3. EXPERIMENTAL EVALUATION In this section, we evaluate Bloo filter-based siilarity detection using several types of query results obtained fro querying different search engines using the keywords posted on Google Zeitgeist htl, Yahoo Buzz buzz.yahoo.co, and MSN Search Insider Methodology We have ipleented our siilarity detection odule using C and Perl. The code for content defined chunking is based on the CDC ipleentation of LBFS [15]. The experiental testbed used a 933 MHz Intel Pentiu III workstation with 512 MB of RAM running Linux kernel The three coercial search engines used in our evaluation are Google Yahoo Search and MSN Search The Google search results were obtained using the GoogleAPI [1], for each of the search queries, the API was called to return the top 1 search results. Although we requested 1 results, the API, due to soe internal errors, always returned less than 1 entries varying fro 481 to 897. For each search result, the docuent fro the corresponding URL was fetched fro the original web server to copute its Bloo filter. Each docuent was converted into a canonical for by reoving all the HTML arkups and tags, bullets and nuberings such as a.1, extra white space, colons, replacing dashes, single-quotes and double-quotes with single space, and converting all the text to lower case to ake the coparison case insensitive. In any cases, due to server unavailability, incorrect docuent links, page not found errors, and network tieouts, the entire set of requested docuents could not always be retrieved Size of the Bloo Filter As we discussed in the section 2, the fraction of bits that atch incorrectly depends on the size of the Bloo filter. For a 97% accurate atch, the nuber of bits in the Bloo filter should be 8x the nuber of eleents (chunks) in the set (docuent). When applying CDC to each docuent, we use the expected chunk size of 256 bytes, while liiting the axiu chunk size to 64 KB. For an average docuent of size 8 KB, this results in around 32 chunks. The Bloo filter is set to be 8x this value i.e., 256 bits. To accoodate large docuents, we set the axiu docuent size to 64 KB (corresponding to the axiu chunk size). Therefore, the Bloo filter size is set to be 8x the expected nuber of chunks (256 for docuent size 64 KB) i.e., 248 bits or 256 bytes, which is a 3.2% and.4% overhead for docuent size of 8 KB and 64 KB respectively. Exaple. When we applied the Bloo filter based atcher to the eacs anual query (Section 1), we found that the page chapter/eacs toc. htl atched the other three, eacs toc.htl, toc.htl, and toc. htl, with 74%, 81% and 95% of the Bloo filter bits atching, respectively. A 7% atching threshold would have identified and grouped all these 4 pages together. Percentage of Duplicate Docuents Near-Duplicate Results for "eacs anual" search on Google % Siilar 6% Siilar 7% Siilar 8% Siilar 9% Siilar Figure 4: eacs anual query search results (Google) 3.2 Effect of the Degree of Siilarity In this section, we evaluate how the degree of siilarity affects the nuber of docuents that are arked siilar. The degree of siilarity is the percentage of the docuent data that atches (e.g., a 1% degree of siilarity is an identical docuent). Intuitively, the higher the degree of siilarity, the lower the nuber of docuents that should atch. Moreover, the nuber of docuents that are siilar depends on the total nuber of docuents retrieved by the query. Although, we initially expected a linear behavior, we observed that the higher ranked results (the top 1 to 2 results) were also the ones that were ore duplicated. Using GoogleAPI, we retrieved 493 results for the eacs anual query. To deterine the nuber of docuents that are siilar aong the set of retrieved docuents, we use a union-find data structure for clustering Bloo filters of the docuents based on siilarity. Figure 4 shows that for 493 docuents retrieved, the nuber of docuent clusters were 56, 22, 317, 328, 34, when the degree of siilarity was 5, 6, 7, 8, 9%, respectively. Each cluster represents a set of siilar docuents (or a single docuent if no siilar ones are found). We assue that a docuent belongs to a cluster if it is siilar to a docuent in the cluster, i.e., we assue that siilarity is transitive for high values of the degree of siilarity (as in [9]). The fraction of duplicate docuents as shown in figure 4, decreases fro 88% to 31% as the degree of siilarity increases fro 5% to 9%. As the nuber of retrieved queries increase fro 1 to 493, the fraction of duplicate docuents initially decrease and then increase foring a inia around 25 results. The decrease was due to the larger aliasing of better ranked docuents. However, as the nuber of results increase, the initial set of docuents get repeated ore frequently, increasing the nuber of duplicates. Siilar results were obtained for a nuber of other queries that we evaluated. 3.3 Effect of the Search Query Popularity To get a representative collection of the types of queries
5 Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for Popular search 4 "jon stewart crossfire" query 35 "electoral college" query "day of the dead" query Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for Mediu-Popular search 25 "republican national convention" query "national hurricane center" query 2 "indian larry" query Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for Rando search "olypics 24 doping" query 1 "hawking black hole bet" query "x prize spaceship" query Figure 5: Search results for the top 3 Figure 6: Search results for 3 ediupopular Figure 7: Search results for 3 rando perfored on search engines, we selected saples fro Google Zeitgeist (Nov. 24) of three different query popularities: i) Most Popular, ii) Mediu-Popular, and iii) Rando. For ost-popular search queries, the three queries selected in order were jon stewart crossfire (TP1), electoral college (TP2) and day of the dead (TP3). We coputed the nuber of duplicates having 7% siilarity (atleast 7% of the bits in the filter atched) in the search results. Figure 5 shows the corresponding nuber of duplicates for a axiu of 87 search results fro the Google search API. The TP1 query had the axiu fraction of near-duplicates, 44.3%, while the other two TP2 and TP3 had 29.7% and 24.3%, respectively. Observe that the ost popular query TP1was the one with the ost duplicates. For the ediu popular queries, we selected three queries fro the list Google Top 1 Gaining Queries for the week ending Aug. 3, 24 on the Google Zeitgeist indian larry (MP1), national hurricane center (MP2) and republican national convention (MP3). Figure 6 shows the corresponding search results having 7% siilarity for a axiu of 88 docuents fro the Google search engine. The fraction of near-duplicates aong 88 search results ranged fro 16% for MP1 to 28% for MP2. For a non-popular query saple, we selected three queries at rando olypics 24 doping, hawking black hole bet, and x prize spaceship. The Google API retrieved only about 36 results for the first two queries and 32 results for the third query. Figure 7 shows the nuber of near-duplicate docuents in the search results corresponding to the three queries. The fraction of near-duplicates in all these queries were in the sae range, around 18%. As we observed earlier, as the popularity of queries decrease so do the nuber of duplicate results. The ost popular queries had the largest nuber of near-duplicate results, the ediu ones fewer, and the rando queries the lowest. 3.4 Behavior of different search engines The previous experients all copared the results fro the Google search engine. We next evaluate the behavior of all three search engines, Google, Yahoo and MSN search in returning near-duplicate docuents for the 1 popular queries featured on their respective web sites. To our knowledge, Yahoo and MSN search do not provide an API siilar to the GoogleAPI for doing autoated retrieval of search results. Therefore, we anually ade HTTP requests to the URLs corresponding to the first 5 search results for a query. We plot iniu, average and axiu nuber of nearduplicate (atleast 7% siilar) search results in the 1 popular queries. The three whiskers on each vertical bar in Figures 8,9,1 represent in., avg., and ax. in order. Figure 8 shows the results for Google, with average nuber of nearduplicates ranging fro 7% to 23%. Figure 9 shows nearduplicates in Yahoo results ranging fro 12% to 25%. Figure 1 shows the results for MSN, where the near-duplicates range fro 18% to 26%. Coparing the earlier eacs anual query, MSN had 32% near duplicates while Yahoo had 22%. These experients support our hypothesis that current search engines return a significant nuber of near-duplicates. However, these results do not in any way suggest that any particular search engine perfors better than the others. 3.5 Analyzing Response Ties In this section, we analyze the response ties for perforing siilarity coparisons using Bloo filters. The tiings include (a) the (offline) coputation tie to copute the docuent CDC hashes and generating the Bloo filter, and (b) the (online) atching tie to deterine siilarity using bitwise AND on Bloo filters and tie for insertions and unions in a union-find data structure for clustering. Exp. Chunk Sizes 256 Bytes 512 Bytes 2 KB 8 KB File Size (s) (s) (s) (s) 1 KB KB MB MB Table 1: CDC hash coputation tie for different files and expected chunk sizes # of chunks k = 2 k = 4 k = 8 Docuent Size (n) (s) (s) (s) 1 KB KB MB MB Table 2: Tie (s) for Bloo filter generation for different docuent sizes (expected chunk size 256 bytes) Bloo Filter Size (Bits) Tie (µsec) Table 3: Tie (icroseconds) for coputing the bitwise AND of Bloo filters for different sizes Table 1 shows the CDC hash coputation ties for a coplete docuent (of size 1 KB, 1 KB, 1 MB, 1 MB) for different expected chunk sizes (256 bytes, 512 bytes, 2 KB, 8 KB). The Bloo filter generation ties are shown in Table 2 for different values (2, 4, 8) of the nuber of hash functions (k) and different nuber of chunks (n). Although the Bloo filter generation ties appear high relative to the CDC ties, it is ore an artifact of the ipleentation of the Bloo filter code in Perl instead of C and not due to any inherent coplexity in the Bloo filter code. A preliinary ipleentation in C reduced the Bloo filter generation tie by an order of agnitude. For the atching tie overhead, Table 3 shows the pairwise atching tie for two Bloo filters for different filter
6 Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for 1 popular 1 1 popular GOOGLE queries Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for 1 popular queries on Yahoo 12 1 popular Yahoo queries Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for 1 popular queries on MSN 2 1 popular MSN queries Figure 8: Search results for 1 popular Figure 9: Search results for 1 popular queries on Yahoo Search Figure 1: Search results for 1 popular queries on MSN Search No. of Results Search Query eacs anual ohio court battle hawking black hole bet Table 4: Matching and Clustering tie (in s) sizes ranging fro 1 bits to 5 bits. The overall atching and clustering tie for different query requests is shown in Table 4. Overall, using untuned Perl and C code, for clustering 8 results each of size 1 KB for the eacs anual query would take around 8*.3 s + 8* 14 s + 66s = 121 s. However, the Bloo filters can be coputed and stored apriori reducing the tie to 66 s. 4. RELATED WORK The proble of near-duplicate detection consists of two ajor coponents: (a) extracting docuent representations aka features (e.g., shingles using Rabin fingerprints [18], supershingles [11], super-fingerprints [13]), and (b) coputing the siilarity between the feature sets. As discussed in Section 2, any variants have been proposed that use different techniques on how the shingle fingerprints are sapled (e.g., in-hashing, Mod, Min s) and atched [7, 6, 5]. Google s patent for near-duplicate detection uses another shingling variant to copute fingerprints fro the shingles [17]. Our siilar detection algorith uses CDC [15] for coputing docuent features and then applies Bloo filters for siilarity testing. In contrast to existing approaches, our technique is siple to ipleent, incurs only about.4% extra bytes per docuent, and perfors faster atching using only bit-wise AND operations. Bloo filters have been proposed to estiate the cardinality of set intersection in [8] but have not been applied for near-duplicate eliination in web search. We recently learned about Bloo filter replaceents [16] which we will explore in the future. Page and site siilarity has been extensively studied for web data in various contexts, fro syntactic clustering of web data [7] and its applications for filtering near duplicates in search engines [6] to storage space and bandwidth reduction for web crawlers and search engines. In [9], replica identification was also proposed for organizing web search results. Fetterly et al. exained the aount of textual changes in individual web pages over tie in the PageTurner study [12] and later investigated the teporal evolution of clusters of near-duplicate pages [11]. Bharat and Broder investigated the proble of identifying irrored host pairs on the web [3]. Dasu et al. used in hashing and sketches to identify fields having siilar values in database tables [1]. 5. CONCLUSIONS In this paper, we applied a Bloo filter based siilarity detection technique to refine the search results presented to the user. Bloo filters copactly represent the entire docuent and can be used for quick atching. We deonstrated how a nuber of results of popular and rando search queries retrieved fro different search engines, Google, Yahoo, MSN, are siilar and can be eliinated or re-organized. 6. ACKNOWLEDGMENTS We thank Rezaul Chowdhury, Vijaya Raachandran, Sridhar Rajagopalan, Madhukar Korupolu, and the anonyous reviewers for giving us valuable coents. 7. REFERENCES [1] Google web apis (beta), [2] Yahoo results getting ore siilar to google http: // www. searchenginejournal. co/ index. php? p= 584&c= 1. [3] K. Bharat and A. Broder. Mirror, irror on the web: a study of host pairs with replicated content. Coput. Networks, 31(11-16): , [4] B. H. Bloo. Space/tie trade-offs in hash coding with allowable errors. Coun. ACM, 13(7): , 197. [5] A. Z. Broder. On the reseblance and containent of docuents. In SEQUENCES, [6] A. Z. Broder. Identifying and filtering near-duplicate docuents. In COM, pages 1 1, 2. [7] A. Z. Broder, S. C. Glassan, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW 97. [8] A. Z. Broder and M. Mitzenacher. Network applications of bloo filters: A survey. In Allerton 2. [9] J. Cho, N. Shivakuar, and H. Garcia-Molina. Finding replicated web collections. SIGMOD Rec., 2. [1] T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, 22. [11] D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In LA-WEB, 23. [12] D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In WWW, 23. [13] P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M. Tracey. Redundancy eliination within large collections of files. In USENIX Annual Technical Conference, General Track, pages 59 72, 24. [14] J. C. Mogul, Y.-M. Chan, and T. Kelly. Design, ipleentation, and evaluation of duplicate transfer detection in HTTP. In NSDI, pages 43 56, 24. [15] A. Muthitacharoen, B. Chen, and D. Mazieres. A low-bandwidth network file syste. In SOSP, 21. [16] R. Pagh, A. Pagh, and S. S. Rao. An optial bloo filter replaceent. In SODA, 25. [17] W. Pugh and M. Henzinger. Detecting duplicate and near-duplicate files, US Patent # [18] M. O. Rabin. Fingerprinting by rando polynoials. Technical Report TR-15-81, Harvard University, [19] L. Raasway, A. Iyengar, L. Liu, and F. Douglis. Autoatic detection of fragents in dynaically generated web pages. In WWW, 24.
A framework for performance monitoring, load balancing, adaptive timeouts and quality of service in digital libraries
Int J Digit Libr (2000) 3: 9 35 INTERNATIONAL JOURNAL ON Digital Libraries Springer-Verlag 2000 A fraework for perforance onitoring, load balancing, adaptive tieouts and quality of service in digital libraries
More informationAn Innovate Dynamic Load Balancing Algorithm Based on Task
An Innovate Dynaic Load Balancing Algorith Based on Task Classification Hong-bin Wang,,a, Zhi-yi Fang, b, Guan-nan Qu,*,c, Xiao-dan Ren,d College of Coputer Science and Technology, Jilin University, Changchun
More informationApplying Multiple Neural Networks on Large Scale Data
0 International Conference on Inforation and Electronics Engineering IPCSIT vol6 (0) (0) IACSIT Press, Singapore Applying Multiple Neural Networks on Large Scale Data Kritsanatt Boonkiatpong and Sukree
More informationInformation Processing Letters
Inforation Processing Letters 111 2011) 178 183 Contents lists available at ScienceDirect Inforation Processing Letters www.elsevier.co/locate/ipl Offline file assignents for online load balancing Paul
More informationOnline Bagging and Boosting
Abstract Bagging and boosting are two of the ost well-known enseble learning ethods due to their theoretical perforance guarantees and strong experiental results. However, these algoriths have been used
More informationAnalyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy
Vol. 9, No. 5 (2016), pp.303-312 http://dx.doi.org/10.14257/ijgdc.2016.9.5.26 Analyzing Spatioteporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy Chen Yang, Renjie Zhou
More informationSearching strategy for multi-target discovery in wireless networks
Searching strategy for ulti-target discovery in wireless networks Zhao Cheng, Wendi B. Heinzelan Departent of Electrical and Coputer Engineering University of Rochester Rochester, NY 467 (585) 75-{878,
More informationPreference-based Search and Multi-criteria Optimization
Fro: AAAI-02 Proceedings. Copyright 2002, AAAI (www.aaai.org). All rights reserved. Preference-based Search and Multi-criteria Optiization Ulrich Junker ILOG 1681, route des Dolines F-06560 Valbonne ujunker@ilog.fr
More informationINTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS
Artificial Intelligence Methods and Techniques for Business and Engineering Applications 210 INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE
More informationReal Time Target Tracking with Binary Sensor Networks and Parallel Computing
Real Tie Target Tracking with Binary Sensor Networks and Parallel Coputing Hong Lin, John Rushing, Sara J. Graves, Steve Tanner, and Evans Criswell Abstract A parallel real tie data fusion and target tracking
More informationAn Approach to Combating Free-riding in Peer-to-Peer Networks
An Approach to Cobating Free-riding in Peer-to-Peer Networks Victor Ponce, Jie Wu, and Xiuqi Li Departent of Coputer Science and Engineering Florida Atlantic University Boca Raton, FL 33431 April 7, 2008
More informationSoftware Quality Characteristics Tested For Mobile Application Development
Thesis no: MGSE-2015-02 Software Quality Characteristics Tested For Mobile Application Developent Literature Review and Epirical Survey WALEED ANWAR Faculty of Coputing Blekinge Institute of Technology
More informationA Fast Algorithm for Online Placement and Reorganization of Replicated Data
A Fast Algorith for Online Placeent and Reorganization of Replicated Data R. J. Honicky Storage Systes Research Center University of California, Santa Cruz Ethan L. Miller Storage Systes Research Center
More informationDynamic Placement for Clustered Web Applications
Dynaic laceent for Clustered Web Applications A. Karve, T. Kibrel, G. acifici, M. Spreitzer, M. Steinder, M. Sviridenko, and A. Tantawi IBM T.J. Watson Research Center {karve,kibrel,giovanni,spreitz,steinder,sviri,tantawi}@us.ib.co
More informationExtended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network
2013 European Control Conference (ECC) July 17-19, 2013, Zürich, Switzerland. Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona
More informationEnergy Proportionality for Disk Storage Using Replication
Energy Proportionality for Disk Storage Using Replication Jinoh Ki and Doron Rote Lawrence Berkeley National Laboratory University of California, Berkeley, CA 94720 {jinohki,d rote}@lbl.gov Abstract Energy
More informationEntity Search Engine: Towards Agile Best-Effort Information Integration over the Web
Entity Search Engine: Towards Agile Best-Effort Inforation Integration over the Web Tao Cheng, Kevin Chen-Chuan Chang University of Illinois at Urbana-Chapaign {tcheng3, kcchang}@cs.uiuc.edu. INTRODUCTION
More informationApproximately-Perfect Hashing: Improving Network Throughput through Efficient Off-chip Routing Table Lookup
Approxiately-Perfect ing: Iproving Network Throughput through Efficient Off-chip Routing Table Lookup Zhuo Huang, Jih-Kwon Peir, Shigang Chen Departent of Coputer & Inforation Science & Engineering, University
More informationRECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure
RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION Henrik Kure Dina, Danish Inforatics Network In the Agricultural Sciences Royal Veterinary and Agricultural University
More informationCooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks
Cooperative Caching for Adaptive Bit Rate Streaing in Content Delivery Networs Phuong Luu Vo Departent of Coputer Science and Engineering, International University - VNUHCM, Vietna vtlphuong@hciu.edu.vn
More informationExploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2
Exploiting Hardware Heterogeneity within the Sae Instance Type of Aazon EC2 Zhonghong Ou, Hao Zhuang, Jukka K. Nurinen, Antti Ylä-Jääski, Pan Hui Aalto University, Finland; Deutsch Teleko Laboratories,
More informationPartitioned Elias-Fano Indexes
Partitioned Elias-ano Indexes Giuseppe Ottaviano ISTI-CNR, Pisa giuseppe.ottaviano@isti.cnr.it Rossano Venturini Dept. of Coputer Science, University of Pisa rossano@di.unipi.it ABSTRACT The Elias-ano
More informationLocal Area Network Management
Technology Guidelines for School Coputer-based Technologies Local Area Network Manageent Local Area Network Manageent Introduction This docuent discusses the tasks associated with anageent of Local Area
More informationFuzzy Sets in HR Management
Acta Polytechnica Hungarica Vol. 8, No. 3, 2011 Fuzzy Sets in HR Manageent Blanka Zeková AXIOM SW, s.r.o., 760 01 Zlín, Czech Republic blanka.zekova@sezna.cz Jana Talašová Faculty of Science, Palacký Univerzity,
More informationAudio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA
Audio Engineering Society Convention Paper Presented at the 119th Convention 2005 October 7 10 New York, New York USA This convention paper has been reproduced fro the authors advance anuscript, without
More informationMachine Learning Applications in Grid Computing
Machine Learning Applications in Grid Coputing George Cybenko, Guofei Jiang and Daniel Bilar Thayer School of Engineering Dartouth College Hanover, NH 03755, USA gvc@dartouth.edu, guofei.jiang@dartouth.edu
More information6. Time (or Space) Series Analysis
ATM 55 otes: Tie Series Analysis - Section 6a Page 8 6. Tie (or Space) Series Analysis In this chapter we will consider soe coon aspects of tie series analysis including autocorrelation, statistical prediction,
More informationThe Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs
Send Orders for Reprints to reprints@benthascience.ae 206 The Open Fuels & Energy Science Journal, 2015, 8, 206-210 Open Access The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic
More informationManaging Complex Network Operation with Predictive Analytics
Managing Coplex Network Operation with Predictive Analytics Zhenyu Huang, Pak Chung Wong, Patrick Mackey, Yousu Chen, Jian Ma, Kevin Schneider, and Frank L. Greitzer Pacific Northwest National Laboratory
More informationData Set Generation for Rectangular Placement Problems
Data Set Generation for Rectangular Placeent Probles Christine L. Valenzuela (Muford) Pearl Y. Wang School of Coputer Science & Inforatics Departent of Coputer Science MS 4A5 Cardiff University George
More informationMedia Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation
Media Adaptation Fraework in Biofeedback Syste for Stroke Patient Rehabilitation Yinpeng Chen, Weiwei Xu, Hari Sundara, Thanassis Rikakis, Sheng-Min Liu Arts, Media and Engineering Progra Arizona State
More informationModeling Parallel Applications Performance on Heterogeneous Systems
Modeling Parallel Applications Perforance on Heterogeneous Systes Jaeela Al-Jaroodi, Nader Mohaed, Hong Jiang and David Swanson Departent of Coputer Science and Engineering University of Nebraska Lincoln
More informationThis paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive
MANUFACTURING & SERVICE OPERATIONS MANAGEMENT Vol., No. 3, Suer 28, pp. 429 447 issn 523-464 eissn 526-5498 8 3 429 infors doi.287/so.7.8 28 INFORMS INFORMS holds copyright to this article and distributed
More informationCalculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance
Calculating the Return on nvestent () for DMSMS Manageent Peter Sandborn CALCE, Departent of Mechanical Engineering (31) 45-3167 sandborn@calce.ud.edu www.ene.ud.edu/escml/obsolescence.ht October 28, 21
More informationStandards and Protocols for the Collection and Dissemination of Graduating Student Initial Career Outcomes Information For Undergraduates
National Association of Colleges and Eployers Standards and Protocols for the Collection and Disseination of Graduating Student Initial Career Outcoes Inforation For Undergraduates Developed by the NACE
More informationEvaluating Inventory Management Performance: a Preliminary Desk-Simulation Study Based on IOC Model
Evaluating Inventory Manageent Perforance: a Preliinary Desk-Siulation Study Based on IOC Model Flora Bernardel, Roberto Panizzolo, and Davide Martinazzo Abstract The focus of this study is on preliinary
More informationA quantum secret ballot. Abstract
A quantu secret ballot Shahar Dolev and Itaar Pitowsky The Edelstein Center, Levi Building, The Hebrerw University, Givat Ra, Jerusale, Israel Boaz Tair arxiv:quant-ph/060087v 8 Mar 006 Departent of Philosophy
More informationExercise 4 INVESTIGATION OF THE ONE-DEGREE-OF-FREEDOM SYSTEM
Eercise 4 IVESTIGATIO OF THE OE-DEGREE-OF-FREEDOM SYSTEM 1. Ai of the eercise Identification of paraeters of the euation describing a one-degree-of- freedo (1 DOF) atheatical odel of the real vibrating
More informationGenerating Certification Authority Authenticated Public Keys in Ad Hoc Networks
SECURITY AND COMMUNICATION NETWORKS Published online in Wiley InterScience (www.interscience.wiley.co). Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks G. Kounga 1, C. J.
More informationReliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks
Reliability Constrained acket-sizing for inear Multi-hop Wireless Networks Ning Wen, and Randall A. Berry Departent of Electrical Engineering and Coputer Science Northwestern University, Evanston, Illinois
More informationADJUSTING FOR QUALITY CHANGE
ADJUSTING FOR QUALITY CHANGE 7 Introduction 7.1 The easureent of changes in the level of consuer prices is coplicated by the appearance and disappearance of new and old goods and services, as well as changes
More information- 265 - Part C. Property and Casualty Insurance Companies
Part C. Property and Casualty Insurance Copanies This Part discusses proposals to curtail favorable tax rules for property and casualty ("P&C") insurance copanies. The syste of reserves for unpaid losses
More informationData Streaming Algorithms for Estimating Entropy of Network Traffic
Data Streaing Algoriths for Estiating Entropy of Network Traffic Ashwin Lall University of Rochester Vyas Sekar Carnegie Mellon University Mitsunori Ogihara University of Rochester Jun (Ji) Xu Georgia
More informationASIC Design Project Management Supported by Multi Agent Simulation
ASIC Design Project Manageent Supported by Multi Agent Siulation Jana Blaschke, Christian Sebeke, Wolfgang Rosenstiel Abstract The coplexity of Application Specific Integrated Circuits (ASICs) is continuously
More informationWork Travel and Decision Probling in the Network Marketing World
TRB Paper No. 03-4348 WORK TRAVEL MODE CHOICE MODELING USING DATA MINING: DECISION TREES AND NEURAL NETWORKS Chi Xie Research Assistant Departent of Civil and Environental Engineering University of Massachusetts,
More informationEquivalent Tapped Delay Line Channel Responses with Reduced Taps
Equivalent Tapped Delay Line Channel Responses with Reduced Taps Shweta Sagari, Wade Trappe, Larry Greenstein {shsagari, trappe, ljg}@winlab.rutgers.edu WINLAB, Rutgers University, North Brunswick, NJ
More informationMarkovian inventory policy with application to the paper industry
Coputers and Cheical Engineering 26 (2002) 1399 1413 www.elsevier.co/locate/copcheeng Markovian inventory policy with application to the paper industry K. Karen Yin a, *, Hu Liu a,1, Neil E. Johnson b,2
More informationUse of extrapolation to forecast the working capital in the mechanical engineering companies
ECONTECHMOD. AN INTERNATIONAL QUARTERLY JOURNAL 2014. Vol. 1. No. 1. 23 28 Use of extrapolation to forecast the working capital in the echanical engineering copanies A. Cherep, Y. Shvets Departent of finance
More informationarxiv:0805.1434v1 [math.pr] 9 May 2008
Degree-distribution stability of scale-free networs Zhenting Hou, Xiangxing Kong, Dinghua Shi,2, and Guanrong Chen 3 School of Matheatics, Central South University, Changsha 40083, China 2 Departent of
More informationAdaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel
Recent Advances in Counications Adaptive odulation and Coding for Unanned Aerial Vehicle (UAV) Radio Channel Airhossein Fereidountabar,Gian Carlo Cardarilli, Rocco Fazzolari,Luca Di Nunzio Abstract In
More informationAirline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN
Airline Yield Manageent with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN Integral Developent Corporation, 301 University Avenue, Suite 200, Palo Alto, California 94301 SHALER STIDHAM
More informationThe Design and Implementation of an Enculturated Web-Based Intelligent Tutoring System
The Design and Ipleentation of an Enculturated Web-Based Intelligent Tutoring Syste Phaedra Mohaed Departent of Coputing and Inforation Technology The University of the West Indies phaedra.ohaed@gail.co
More informationEfficient Key Management for Secure Group Communications with Bursty Behavior
Efficient Key Manageent for Secure Group Counications with Bursty Behavior Xukai Zou, Byrav Raaurthy Departent of Coputer Science and Engineering University of Nebraska-Lincoln Lincoln, NE68588, USA Eail:
More informationAN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES
Int. J. Appl. Math. Coput. Sci., 2014, Vol. 24, No. 1, 133 149 DOI: 10.2478/acs-2014-0011 AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES PIOTR KULCZYCKI,,
More informationEvaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects
Evaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects Lucas Grèze Robert Pellerin Nathalie Perrier Patrice Leclaire February 2011 CIRRELT-2011-11 Bureaux
More informationOptimal Resource-Constraint Project Scheduling with Overlapping Modes
Optial Resource-Constraint Proect Scheduling with Overlapping Modes François Berthaut Lucas Grèze Robert Pellerin Nathalie Perrier Adnène Hai February 20 CIRRELT-20-09 Bureaux de Montréal : Bureaux de
More informationSAMPLING METHODS LEARNING OBJECTIVES
6 SAMPLING METHODS 6 Using Statistics 6-6 2 Nonprobability Sapling and Bias 6-6 Stratified Rando Sapling 6-2 6 4 Cluster Sapling 6-4 6 5 Systeatic Sapling 6-9 6 6 Nonresponse 6-2 6 7 Suary and Review of
More informationAn improved TF-IDF approach for text classification *
Zhang et al. / J Zheiang Univ SCI 2005 6A(1:49-55 49 Journal of Zheiang University SCIECE ISS 1009-3095 http://www.zu.edu.cn/zus E-ail: zus@zu.edu.cn An iproved TF-IDF approach for text classification
More informationSUPPORTING YOUR HIPAA COMPLIANCE EFFORTS
WHITE PAPER SUPPORTING YOUR HIPAA COMPLIANCE EFFORTS Quanti Solutions. Advancing HIM through Innovation HEALTHCARE SUPPORTING YOUR HIPAA COMPLIANCE EFFORTS Quanti Solutions. Advancing HIM through Innovation
More informationImplementation of Active Queue Management in a Combined Input and Output Queued Switch
pleentation of Active Queue Manageent in a obined nput and Output Queued Switch Bartek Wydrowski and Moshe Zukeran AR Special Research entre for Ultra-Broadband nforation Networks, EEE Departent, The University
More informationThe AGA Evaluating Model of Customer Loyalty Based on E-commerce Environment
6 JOURNAL OF SOFTWARE, VOL. 4, NO. 3, MAY 009 The AGA Evaluating Model of Custoer Loyalty Based on E-coerce Environent Shaoei Yang Econoics and Manageent Departent, North China Electric Power University,
More informationA Scalable Application Placement Controller for Enterprise Data Centers
W WWW 7 / Track: Perforance and Scalability A Scalable Application Placeent Controller for Enterprise Data Centers Chunqiang Tang, Malgorzata Steinder, Michael Spreitzer, and Giovanni Pacifici IBM T.J.
More informationOn Computing Nearest Neighbors with Applications to Decoding of Binary Linear Codes
On Coputing Nearest Neighbors with Applications to Decoding of Binary Linear Codes Alexander May and Ilya Ozerov Horst Görtz Institute for IT-Security Ruhr-University Bochu, Gerany Faculty of Matheatics
More informationMethod of supply chain optimization in E-commerce
MPRA Munich Personal RePEc Archive Method of supply chain optiization in E-coerce Petr Suchánek and Robert Bucki Silesian University - School of Business Adinistration, The College of Inforatics and Manageent
More informationEnergy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and migration algorithms
Energy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and igration algoriths Chaia Ghribi, Makhlouf Hadji and Djaal Zeghlache Institut Mines-Téléco, Téléco SudParis UMR CNRS 5157 9, Rue
More informationFactored Models for Probabilistic Modal Logic
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008 Factored Models for Probabilistic Modal Logic Afsaneh Shirazi and Eyal Air Coputer Science Departent, University of Illinois
More informationThe Velocities of Gas Molecules
he Velocities of Gas Molecules by Flick Colean Departent of Cheistry Wellesley College Wellesley MA 8 Copyright Flick Colean 996 All rights reserved You are welcoe to use this docuent in your own classes
More informationCRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS
641 CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS Marketa Zajarosova 1* *Ph.D. VSB - Technical University of Ostrava, THE CZECH REPUBLIC arketa.zajarosova@vsb.cz Abstract Custoer relationship
More informationThe Fundamentals of Modal Testing
The Fundaentals of Modal Testing Application Note 243-3 Η(ω) = Σ n r=1 φ φ i j / 2 2 2 2 ( ω n - ω ) + (2ξωωn) Preface Modal analysis is defined as the study of the dynaic characteristics of a echanical
More informationProtecting Small Keys in Authentication Protocols for Wireless Sensor Networks
Protecting Sall Keys in Authentication Protocols for Wireless Sensor Networks Kalvinder Singh Australia Developent Laboratory, IBM and School of Inforation and Counication Technology, Griffith University
More informationRed Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure
Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure By Alan Radding and Nick Carr Abstract This paper discusses the issues related to storage design and anageent when an IT
More informationESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET
ESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET Francisco Alonso, Roberto Blanco, Ana del Río and Alicia Sanchis Banco de España Banco de España Servicio de Estudios Docuento de
More informationCalculation Method for evaluating Solar Assisted Heat Pump Systems in SAP 2009. 15 July 2013
Calculation Method for evaluating Solar Assisted Heat Pup Systes in SAP 2009 15 July 2013 Page 1 of 17 1 Introduction This docuent describes how Solar Assisted Heat Pup Systes are recognised in the National
More informationResource Allocation in Wireless Networks with Multiple Relays
Resource Allocation in Wireless Networks with Multiple Relays Kağan Bakanoğlu, Stefano Toasin, Elza Erkip Departent of Electrical and Coputer Engineering, Polytechnic Institute of NYU, Brooklyn, NY, 0
More informationThe Benefit of SMT in the Multi-Core Era: Flexibility towards Degrees of Thread-Level Parallelism
The enefit of SMT in the Multi-Core Era: Flexibility towards Degrees of Thread-Level Parallelis Stijn Eyeran Lieven Eeckhout Ghent University, elgiu Stijn.Eyeran@elis.UGent.be, Lieven.Eeckhout@elis.UGent.be
More informationPerformance Evaluation of Machine Learning Techniques using Software Cost Drivers
Perforance Evaluation of Machine Learning Techniques using Software Cost Drivers Manas Gaur Departent of Coputer Engineering, Delhi Technological University Delhi, India ABSTRACT There is a treendous rise
More informationAn Application Research on the Workflow-based Large-scale Hospital Information System Integration
106 JOURNAL OF COMPUTERS, VOL. 6, NO. 1, JANUARY 2011 An Application Research on the Workflow-based Large-scale Hospital Inforation Syste Integration Yang Guojun School of Coputer, Neijiang Noral University,
More informationREQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES
REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES Charles Reynolds Christopher Fox reynolds @cs.ju.edu fox@cs.ju.edu Departent of Coputer
More informationAUC Optimization vs. Error Rate Minimization
AUC Optiization vs. Error Rate Miniization Corinna Cortes and Mehryar Mohri AT&T Labs Research 180 Park Avenue, Florha Park, NJ 0793, USA {corinna, ohri}@research.att.co Abstract The area under an ROC
More informationIdentification and Analysis of hard disk drive in digital forensic
Identification and Analysis of hard disk drive in digital forensic Kailash Kuar Dr. Sanjeev Sofat Dr. Naveen Aggarwal Phd(CSE) Student Prof. and Head CSE Deptt. Asst. Prof. CSE Deptt. PEC University of
More informationReconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)
Sandia is a ultiprogra laboratory operated by Sandia Corporation, a Lockheed Martin Copany, Reconnect 04 Solving Integer Progras with Branch and Bound (and Branch and Cut) Cynthia Phillips (Sandia National
More informationPERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO
Bulletin of the Transilvania University of Braşov Series I: Engineering Sciences Vol. 4 (53) No. - 0 PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO V. CAZACU I. SZÉKELY F. SANDU 3 T. BĂLAN Abstract:
More informationA Study on the Chain Restaurants Dynamic Negotiation Games of the Optimization of Joint Procurement of Food Materials
International Journal of Coputer Science & Inforation Technology (IJCSIT) Vol 6, No 1, February 2014 A Study on the Chain estaurants Dynaic Negotiation aes of the Optiization of Joint Procureent of Food
More informationHow To Get A Loan From A Bank For Free
Finance 111 Finance We have to work with oney every day. While balancing your checkbook or calculating your onthly expenditures on espresso requires only arithetic, when we start saving, planning for retireent,
More informationAutoHelp. An 'Intelligent' Case-Based Help Desk Providing. Web-Based Support for EOSDIS Customers. A Concept and Proof-of-Concept Implementation
//j yd xd/_ ' Year One Report ":,/_i',:?,2... i" _.,.j- _,._".;-/._. ","/ AutoHelp An 'Intelligent' Case-Based Help Desk Providing Web-Based Support for EOSDIS Custoers A Concept and Proof-of-Concept Ipleentation
More informationMulti-level Metadata Management Scheme for Cloud Storage System
, pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1
More informationImage restoration for a rectangular poor-pixels detector
Iage restoration for a rectangular poor-pixels detector Pengcheng Wen 1, Xiangjun Wang 1, Hong Wei 2 1 State Key Laboratory of Precision Measuring Technology and Instruents, Tianjin University, China 2
More informationAn Improved Decision-making Model of Human Resource Outsourcing Based on Internet Collaboration
International Journal of Hybrid Inforation Technology, pp. 339-350 http://dx.doi.org/10.14257/hit.2016.9.4.28 An Iproved Decision-aking Model of Huan Resource Outsourcing Based on Internet Collaboration
More informationHigh Performance Chinese/English Mixed OCR with Character Level Language Identification
2009 0th International Conference on Docuent Analysis and Recognition High Perforance Chinese/English Mixed OCR with Character Level Language Identification Kai Wang Institute of Machine Intelligence,
More informationConstruction Economics & Finance. Module 3 Lecture-1
Depreciation:- Construction Econoics & Finance Module 3 Lecture- It represents the reduction in arket value of an asset due to age, wear and tear and obsolescence. The physical deterioration of the asset
More informationSOME APPLICATIONS OF FORECASTING Prof. Thomas B. Fomby Department of Economics Southern Methodist University May 2008
SOME APPLCATONS OF FORECASTNG Prof. Thoas B. Foby Departent of Econoics Southern Methodist University May 8 To deonstrate the usefulness of forecasting ethods this note discusses four applications of forecasting
More informationResearch Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises
Advance Journal of Food Science and Technology 9(2): 964-969, 205 ISSN: 2042-4868; e-issn: 2042-4876 205 Maxwell Scientific Publication Corp. Subitted: August 0, 205 Accepted: Septeber 3, 205 Published:
More informationNew for 2016! Get Licensed
Financial Manageent 2016 HS There s only one place you need to go for all your professional developent needs. The Power to Know. NEW Experience a different school of learning! New for 2016! Online courses
More informationModeling Cooperative Gene Regulation Using Fast Orthogonal Search
8 The Open Bioinforatics Journal, 28, 2, 8-89 Open Access odeling Cooperative Gene Regulation Using Fast Orthogonal Search Ian inz* and ichael J. Korenberg* Departent of Electrical and Coputer Engineering,
More informationOnline Community Detection for Large Complex Networks
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Online Counity Detection for Large Coplex Networks Wangsheng Zhang, Gang Pan, Zhaohui Wu, Shijian Li Departent
More informationA Multi-Core Pipelined Architecture for Parallel Computing
Parallel & Cloud Coputing PCC Vol, Iss A Multi-Core Pipelined Architecture for Parallel Coputing Duoduo Liao *1, Sion Y Berkovich Coputing for Geospatial Research Institute Departent of Coputer Science,
More informationON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX 77843-3128
ON SELF-ROUTING IN CLOS CONNECTION NETWORKS BARRY G. DOUGLASS Electrical Engineering Departent Texas A&M University College Station, TX 778-8 A. YAVUZ ORUÇ Electrical Engineering Departent and Institute
More informationStudy on the development of statistical data on the European security technological and industrial base
Study on the developent of statistical data on the European security technological and industrial base Security Sector Survey Analysis: France Client: European Coission DG Migration and Hoe Affairs Brussels,
More informationMulti-Class Deep Boosting
Multi-Class Deep Boosting Vitaly Kuznetsov Courant Institute 25 Mercer Street New York, NY 002 vitaly@cis.nyu.edu Mehryar Mohri Courant Institute & Google Research 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu
More informationHalloween Costume Ideas for the Wii Game
Algorithica 2001) 30: 101 139 DOI: 101007/s00453-001-0003-0 Algorithica 2001 Springer-Verlag New York Inc Optial Search and One-Way Trading Online Algoriths R El-Yaniv, 1 A Fiat, 2 R M Karp, 3 and G Turpin
More information