Using Bloom Filters to Refine Web Search Results

Size: px
Start display at page:

Download "Using Bloom Filters to Refine Web Search Results"

Transcription

1 Using Bloo Filters to Refine Web Search Results Navendu Jain Departent of Coputer Sciences University of Texas at Austin Austin, TX, Mike Dahlin Departent of Coputer Sciences University of Texas at Austin Austin, TX, Renu Tewari IBM Aladen Research Center 65 Harry Road San Jose, CA, ABSTRACT Search engines have priarily focused on presenting the ost relevant pages to the user quickly. A less well explored aspect of iproving the search experience is to reove or group all near-duplicate docuents in the results presented to the user. In this paper, we apply a Bloo filter based siilarity detection technique to address this issue by refining the search results presented to the user. First, we present and analyze our technique for finding siilar docuents using contentdefined chunking and Bloo filters, and deonstrate its effectiveness in copactly representing and quickly atching pages for siilarity testing. Later, we deonstrate how a nuber of results of popular and rando search queries retrieved fro different search engines, Google, Yahoo, MSN, are siilar and can be eliinated or re-organized. 1. INTRODUCTION Enterprise and web search has becoe a ubiquitous part of the web experience. Nuerous studies have shown that the ad-hoc distribution of inforation on the web has resulted in a high degree of content aliasing (i.e., the sae data contained in pages fro different URLs) [14] and which adversely affects the perforance of search engines [6]. The initial study by Broder et al., in 1997 [7], and the later one by Fetterly et al. [11], shows that around 29.2% of data is coon across pages in a saple of 15 illion pages. This coon data when presented to the user on a search query degrades user-experience by repeating the sae inforation on every click. Siilar data can be grouped or eliinated to iprove the search experience. Siilarity based grouping is also useful for organizing the results presented by eta-crawlers (e.g., vivisio, etacrawler, dogpile, copernic). The findings by searchenginejournal.co [2] show a significant overlap of search results returned by Google and Yahoo search engines the top 2 keyword searches fro Google had about 4% identical or siilar pages to the Yahoo results. Soeties search results ay appear different purely due to the restructuring and reforatting of data. For exaple, one site ay forat a docuent into ultiple web pages, with the top level page only containing a fraction of the docuent along with a next link to follow to the reaining part, while an- This work was supported in part by the Texas Advanced Technology Progra, the National Science Foundation (CNS ), and an IBM Faculty Partnership Award. This work was done during an internship at IBM Aladen. Copyright is held by the author/owner(s). Eighth International Workshop on the Web and Databases (WebDB 25), June 16-17, 25, Baltiore, Maryland. other site ay have the entire docuent in the sae web page. An effective siilarity detection technique should find these contained docuents and label the as siilar. Although iproving search results by identifying nearduplicates had been proposed for Altavista [6], we found that popular search engines, Google, Yahoo, MSN, even today have a significant fraction of near-duplicates in their top results 1. For exaple, consider the results of the query eacs anual using the Google search engine. We focus on the top 2 results (i.e., first 2 pages) as they represent the results ost likely to be viewed by the user. Four of the results, toc.htl, toc.htl, www. dc.urkuak.fi/docs/gnu/eacs/eacs toc.htl, and chapter/eacs toc.htl, on the first page (top-1 results), were highly siilar in fact, they had nearly identical content but different page headers, disclaiers, and logo iages. For this particular query, on the whole, 7 out of 2 docuents were redundant (3 identical pairs and 4 siilar to one top page docuent). Siilar results were found using Yahoo, MSN 2, and A9 3 search engines. In this paper, we study the current state of popular search engines and evaluate the application of a Bloo filter based near-duplicate detection technique on search results. We deonstrate, using ultiple search engines, how a nuber of results (ranging fro 7% to 6%) on search queries are siilar and can be eliinated or re-organized. Later, we explore the use of Bloo filters for finding siilar objects and deonstrate their effectiveness in copactly representing and quickly atching pages for siilarity testing. Although Bloo filters have been extensively used for set ebership checks, they have not been analyzed for siilarity detection between text docuents. Finally, we apply our Bloo filter based technique to effectively reove siilar search results and iprove user experience. Our evaluation of search results shows that the occurrence of near-duplicates is strongly correlated to: i) the relevance of the docuent and ii) the popularity of the query. Docuents that are considered ore relevant and have a higher rank also have ore near-duplicates copared to less relevant docuents. Siilarly, results fro the ore popular queries have ore near-duplicates copared to the less popular ones. Our siilarity atcher can be deployed as a filter over 1 Google does have a patent [17] for near-duplicate detection although it is not clear which approach they use. 2 Results for a recently popular query, ohio court battle fro both Google and MSN search had a siilar behavior, with 1 and 4 out of the top 2 results being identical resp. 3 A9 states that it uses a Google back-end for part of its search.

2 any search engine s result set. The overhead of integrating our siilarity detection algorith with search engines only associates about.4% extra bytes per docuent and provides fast atching on the order of illiseconds as described later in section 3. Note that we focus on one ain aspect of siilarity text content. This ight not copletely capture the huan-judgeent notion of siilarity in all cases. However, our technique can be easily extended to include link structure based siilarity easures by coparing Bloo filters generated fro hyperlinks ebedded in web pages. The rest of the paper is organized as follows. Siilarity detection using Bloo filters is described and analyzed in Section 2. Section 3 evaluates and copares our siilarity technique to iprove search results fro ultiple engines and for different workloads. Finally, Section 4 covers related work and we conclude with Section SIMILARITY DETECTION USING BLOOM FILTERS Our siilarity detection algorith proceeds in three steps as follows. First, we use content-defined chunking (CDC) to extract docuent features that are resilient to odifications. Second, we use these features as set eleents for generating Bloo filters 4. Third, we copare the Bloo filters to detect near-duplicate docuents above a certain siilarity threshold (say 7%). We start with an overview of Bloo filters and CDCs, and later present and analyze the siilarity detection technique for refining web search results. 2.1 Bloo Filter Overview A Bloo filter of a set U is ipleented as an array of bits [4]. Each eleent u (u U) of the set is hashed using k independent hash functions h 1,..., h k. Each hash function h i(u) for 1 i k aps to one bit in the array {1... }. Thus, when an eleent is added to the set, it sets k bits, each bit corresponding to a hash function, in the Bloo filter array to 1. If a bit was already set it stays 1. For set ebership checks, Bloo filters ay yield a false positive, where it ay appear that an eleent v is in U even though it is not. Fro the analysis in [8], given n = U and the Bloo filter size, the optial value of k that iniizes the false positive probability, p k, where p denotes that probability that a given bit is set in the Bloo filter, is k = ln 2. Previously, Bloo filters have priarily n been used for finding set-ebership [8]. 2.2 Content-defined Chunking Overview To copute the Bloo filter of a docuent, we first need to split it into a set of eleents. Observe that splitting a docuent using a fixed block size akes it very susceptible to odifications, thereby, aking it useless for siilarity coparison. For effective siilarity detection, we need a echanis that is ore resilient to changes in the docuent. CDC splits a docuent into variable-sized blocks whose boundaries are deterined by its Rabin fingerprint atching a predeterined arker value [18]. The nuber of bits in the Rabin fingerprint that are used to atch the arker deterine the expected chunk size. For exaple, given a arker x78 and an expected chunk size of 2 k, a rolling (overlapping sequence) 48-byte fingerprint is coputed. If the lower k bits of the fingerprint equal x78, a new chunk boundary is set. Since the chunk boundaries are content-based, any odifications should affect only a couple of neighboring chunks and 4 Within a search engine context, the CDCs and the Bloo filters of the docuents can be coputed offline and stored. not the entire docuent. CDC has been used in LBFS [15], REBL [13] and other systes for redundancy eliination. 2.3 Bloo Filters for Siilarity Testing Observe that we can view each docuent to be a set in Bloo filter parlance whose eleents are the CDCs that it is coposed of 5. Given that Bloo filters copactly represent a set, they can also be used to approxiately atch two sets. Bloo filters, however, cannot be used for exact atching as they have a finite false-atch probability but they are naturally suited for siilarity atching. For finding siilar docuents, we copare the Bloo filter of one with that of the other. In case the two docuents share a large nuber of 1 s (bit-wise AND) they are arked as siilar. In this case, the bit-wise AND can also be perceived as the dot product of the two bit vectors. If the set bits in the Bloo filter of a docuent are a coplete subset of that of another filter then it is highly probable that the docuent is included in the other. Web pages are typically coposed of fragents, either static ones (e.g., logo iages), or dynaic (e.g., personalized product prootions, local weather) [19]. When targeting pages for a siilarity based grouping, the test for siilarity should be on the fragent of interest and not the entire page. Bloo filters, when applied to siilarity detection, have several advantages. First, the copactness of Bloo filters is very attractive for storage and transission whenever we want to iniize the eta-data overheads. Second, Bloo filters enable fast coparison as atching is a bitwise-and operation. Third, since Bloo filters are a coplete representation of a set rather than a deterinistic saple (e.g., shingling), they can deterine inclusions effectively. To deonstrate the effectiveness of Bloo filters for siilarity detection, consider, for exaple, the pages fro the Money/CNN web server (oney.cnn.co). We crawled 13 MB of data fro the site that resulted in 1753 docuents. We copared the top-level page arsh ceo/index.htl with all the other pages fro the site. For each docuent, we converted it into a canonical representation as described later in Section 3. The CDCs of the pages were coputed using an expected and axiu chunk size of 256 bytes and 64 KB respectively. The corresponding Bloo filter was of size 256 bytes. Figure 1 shows that two other copies of the page one with the URI /24/1/25/news/fortune5/arsh\ ceo/index.ht and another one with a dynaic URI /24/ 1/25/news/fortune5/arsh ceo/index.ht?cnn=yes atched with all set bits in the Bloo filter of the original docuent. As another exaple, we crawled around 2 MB of data (59 docuents) fro the ib web site ( We copared the page/investor/corpgovernance/index.phtl with all the other crawled pages fro the site. The chunk sizes were chosen as above. Figure 2 shows that two other pages with the URIs/investor/corpgovernance/cgcoi.phtl and/investor/ corpgovernance/cgblaws.phtl appeared siilar, atching in 53% and 69% of the bits in the Bloo filter, respectively. To further illustrate that Bloo filters can differentiate between ultiple siilar docuents, we extracted a technical docuentation file foo (say) (of size 17 KB) increentally fro a CVS archive, generating 2 different versions, with foo being the original, foo.1 being the first version (with a change of 415 bytes fro foo ) and foo.19 being the last. As shown in Figure 3, the Bloo filter for foo atched the ost (98%) with the closest version foo.1. 5 For ultisets, we ake each CDC unique before Bloo filter generation to differentiate ultiple copies of the sae CDC.

3 Fraction of 1 s atched in the AND outputs Docuent Siilarity using Bloo Filter: arsh_ceo/index.htl arsh_ceo/index.htl arsh_ceo/index.ht arsh_ceo/index.ht?cnn=yes Web docuents in oney.cnn.co Source Tree Figure 1: Coparison of the docuent arsh ceo/index.htl with all pages fro the oney.cnn.co web site Fraction of 1 s atched in the AND outputs Docuent Siilarity using Bloo Filter: investor/corpgovernance/index.phtl 1 investor/corpgovernance/index.phtl investor/corpgovernance/cgcoi.phtl Analysis The ain consideration when using Bloo filters for siilarity detection is the false atch probability of the above algorith as a function of siilarity between the source and a candidate docuent. Extending the analysis for ebership testing in [4] to siilarity detection, we proceed to deterine the expected nuber of inferred atches between the two sets. Let A and B be the two sets being copared for siilarity. Let denote the nuber of bits (size) in the Bloo filter. For siplicity, assue that both sets have the sae nuber of eleents. Let n denote the nuber of eleents in both sets A and B i.e., A = B = n. As before, k denotes the nuber of hash functions. The probability that a bit is set by a hash function h i for 1 i k is 1. A bit can be set by any of the k hash functions for each of the n eleents. Therefore, the probability that a bit is not set by any hash function for any eleent is (1 1 )nk. Thus, the probability, p, that a given bit is set in the Bloo filter of A is given by: p = 1 `1 1 nk 1 e nk (1) For an eleent to be considered a eber of the set, all the corresponding k bits should be set. Thus, the probability of a false atch, i.e., an outside eleent is inferred as being in set A, is p k. Let C denote the intersection of sets A and B and c denote its cardinality, i.e., C = A B and C = c. For siilarity coparison, let us take each eleent in set B and check if it belongs to the Bloo filter of the given set A. We should find that the c coon eleents will definitely atch and a few of the other (n c) ay also atch due to the false atch probability. By Linearity of Expectation, the expected nuber of eleents of B inferred to have atched with A is E[# of inferred atches] = (c) + (n c)p k To iniize the false atches, this expected nuber should be as close to c as possible. For that (n c)p k should be close to, i.e., p k should approach. This happens to be the sae as iniizing the probability of a false positive. Expanding p and under asyptotic analysis, it reduces to iniizing (1 e nk ) k. Using the sae analysis for iniizing the false positive rate given in [8], the inia obtained after differentiation is when k = ln 2. Thus, the expected nuber n of inferred atches for this value of k becoes investor/corpgovernance/cgblaws.phtl Web docuents in Source Tree Figure 2: Coparison of the docuent investor/corpgovernance/index.phtl with pages fro E[# of inferred atches] = c + (n c)(.6185) n Thus, the expected nuber of bits set corresponding to inferred atches is h E[# of atched bits] = k`c + (n c)(.6185) n i Under the assuption of perfectly rando hash functions, the expected nuber of total bits set in the Bloo filter of Fraction of 1 s atched in the AND outputs File Siilarity using Bloo Filter: CVS Repository Benchark foo foo versions Figure 3: Coparison of the original file foo with later versions foo.1, foo.2 foo.19 the source set A, is p. The ratio, then, of the expected nuber of atched bits corresponding to inferred atches in A B to the expected total nuber of bits set in the Bloo filter of A is: E[# of atched bits] E[# total bits set] = 1 e k (c + (n c)(.6185) n ) `1 e nk Observe that this ratio equals 1 when all the eleents atch, i.e., c = n. If there are no atching eleents, i.e., c =, the ratio = 2(1 (.5) (.6185) n ). For = n, this evaluates to.6973, i.e., 69% of atching bits ay be false. For larger values, = 2n,4n, 8n, 1n,11n, the corresponding ratios are.4658,.1929,.295,.113,.7 respectively. Thus, for = 11n, on an average, less than 1% of the bits set ay atch incorrectly. The expected ratio of atching bits is highly correlated to the expected ratio of atching eleents. Thus, if a large fraction of the bits atch, then it s highly likely that a large fraction of the eleents are coon. 2.4 Discussion Previous work on docuent siilarity has ostly been based on shingling or super fingerprints. Using this ethod, for each object, all the k consecutive words of a docuent (called k-shingles) are hashed using Rabin fingerprint [18] to create a set of fingerprints (also called features or preiages). These fingerprints are then sapled to copute a super-fingerprint of the docuent. Many variants have been proposed that use different techniques on how the shingle fingerprints are sapled (in-hashing, Mod, Min s etc.) and atched [7, 6, 5]. While Mod selects all fingerprints whose value odulo is zero; Min s selects the set of s fingerprints with the sallest value. The in-hashing approach further refines the sapling to be the in values of say 84 rando in-wise independent perutations (or hashes) of the set of all shingle fingerprints. This results in a fixed size saple of 84 fingerprints that is the resulting feature vector. To further siplify atching, these 84 fingerprints can be grouped as 6 super-shingles by concatenating 14 adjacent fingerprints [11]. In [13] these are called super-fingerprints. A pair of objects are then considered siilar if either all or a large fraction of the values in the super-fingerprints atch. Our Bloo filter based siilarity detection differs fro the shingling technique in several ways. It should be noted, however, that the variants of shingling discussed above iprove upon the original approach and we provide a coparison of our technique with these variants wherever applicable. First, shingling (Mod, Min s) coputes docuent siilarity using the intersection of the two feature sets. In our approach, it requires only the bit-wise AND of the two Bloo filters (e.g., two 128 bit vectors). Next, shingling has a higher coputational overhead as it first segents the docuent into k-word shingles (k = 5 in [11]) resulting in shingle set size

4 of about S k + 1, where S is the docuent size. Later, it coputes the iage (value) of each shingle by applying set (say H) of in-wise independent hash functions ( H =84 as used in [11]) and then for each function, selecting the shingle corresponding to the iniu iage. On the other hand, we apply a set of independent hash functions (typically less than 8) to the chunk set of size on average S where c is the c expected chunk size (e.g., c = 256 bytes for S = 8 KB docuent). Third, the size of the feature set (nuber of shingles) depends on the sapling technique in shingling. For exaple, in Mod, even soe large docuents ight have very few features whereas sall docuents ight have zero features. Soe shingling variants (e.g., Min s, Mod 2 i) ai to select roughly a constant nuber of features. Our CDC based approach only varies the chunk size c, to deterine the nuber of chunks as a trade-off between perforance and fine-grained atching. We leave the epirical coparison with shingling as future work. In general, a copact Bloo filter is easier to attach as a docuent tag and can be copared siply by atching the bits. Thus, Bloo filter based atching is ore suitable for eta crawlers and can be added on to existing search engines without any significant changes. 3. EXPERIMENTAL EVALUATION In this section, we evaluate Bloo filter-based siilarity detection using several types of query results obtained fro querying different search engines using the keywords posted on Google Zeitgeist htl, Yahoo Buzz buzz.yahoo.co, and MSN Search Insider Methodology We have ipleented our siilarity detection odule using C and Perl. The code for content defined chunking is based on the CDC ipleentation of LBFS [15]. The experiental testbed used a 933 MHz Intel Pentiu III workstation with 512 MB of RAM running Linux kernel The three coercial search engines used in our evaluation are Google Yahoo Search and MSN Search The Google search results were obtained using the GoogleAPI [1], for each of the search queries, the API was called to return the top 1 search results. Although we requested 1 results, the API, due to soe internal errors, always returned less than 1 entries varying fro 481 to 897. For each search result, the docuent fro the corresponding URL was fetched fro the original web server to copute its Bloo filter. Each docuent was converted into a canonical for by reoving all the HTML arkups and tags, bullets and nuberings such as a.1, extra white space, colons, replacing dashes, single-quotes and double-quotes with single space, and converting all the text to lower case to ake the coparison case insensitive. In any cases, due to server unavailability, incorrect docuent links, page not found errors, and network tieouts, the entire set of requested docuents could not always be retrieved Size of the Bloo Filter As we discussed in the section 2, the fraction of bits that atch incorrectly depends on the size of the Bloo filter. For a 97% accurate atch, the nuber of bits in the Bloo filter should be 8x the nuber of eleents (chunks) in the set (docuent). When applying CDC to each docuent, we use the expected chunk size of 256 bytes, while liiting the axiu chunk size to 64 KB. For an average docuent of size 8 KB, this results in around 32 chunks. The Bloo filter is set to be 8x this value i.e., 256 bits. To accoodate large docuents, we set the axiu docuent size to 64 KB (corresponding to the axiu chunk size). Therefore, the Bloo filter size is set to be 8x the expected nuber of chunks (256 for docuent size 64 KB) i.e., 248 bits or 256 bytes, which is a 3.2% and.4% overhead for docuent size of 8 KB and 64 KB respectively. Exaple. When we applied the Bloo filter based atcher to the eacs anual query (Section 1), we found that the page chapter/eacs toc. htl atched the other three, eacs toc.htl, toc.htl, and toc. htl, with 74%, 81% and 95% of the Bloo filter bits atching, respectively. A 7% atching threshold would have identified and grouped all these 4 pages together. Percentage of Duplicate Docuents Near-Duplicate Results for "eacs anual" search on Google % Siilar 6% Siilar 7% Siilar 8% Siilar 9% Siilar Figure 4: eacs anual query search results (Google) 3.2 Effect of the Degree of Siilarity In this section, we evaluate how the degree of siilarity affects the nuber of docuents that are arked siilar. The degree of siilarity is the percentage of the docuent data that atches (e.g., a 1% degree of siilarity is an identical docuent). Intuitively, the higher the degree of siilarity, the lower the nuber of docuents that should atch. Moreover, the nuber of docuents that are siilar depends on the total nuber of docuents retrieved by the query. Although, we initially expected a linear behavior, we observed that the higher ranked results (the top 1 to 2 results) were also the ones that were ore duplicated. Using GoogleAPI, we retrieved 493 results for the eacs anual query. To deterine the nuber of docuents that are siilar aong the set of retrieved docuents, we use a union-find data structure for clustering Bloo filters of the docuents based on siilarity. Figure 4 shows that for 493 docuents retrieved, the nuber of docuent clusters were 56, 22, 317, 328, 34, when the degree of siilarity was 5, 6, 7, 8, 9%, respectively. Each cluster represents a set of siilar docuents (or a single docuent if no siilar ones are found). We assue that a docuent belongs to a cluster if it is siilar to a docuent in the cluster, i.e., we assue that siilarity is transitive for high values of the degree of siilarity (as in [9]). The fraction of duplicate docuents as shown in figure 4, decreases fro 88% to 31% as the degree of siilarity increases fro 5% to 9%. As the nuber of retrieved queries increase fro 1 to 493, the fraction of duplicate docuents initially decrease and then increase foring a inia around 25 results. The decrease was due to the larger aliasing of better ranked docuents. However, as the nuber of results increase, the initial set of docuents get repeated ore frequently, increasing the nuber of duplicates. Siilar results were obtained for a nuber of other queries that we evaluated. 3.3 Effect of the Search Query Popularity To get a representative collection of the types of queries

5 Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for Popular search 4 "jon stewart crossfire" query 35 "electoral college" query "day of the dead" query Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for Mediu-Popular search 25 "republican national convention" query "national hurricane center" query 2 "indian larry" query Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for Rando search "olypics 24 doping" query 1 "hawking black hole bet" query "x prize spaceship" query Figure 5: Search results for the top 3 Figure 6: Search results for 3 ediupopular Figure 7: Search results for 3 rando perfored on search engines, we selected saples fro Google Zeitgeist (Nov. 24) of three different query popularities: i) Most Popular, ii) Mediu-Popular, and iii) Rando. For ost-popular search queries, the three queries selected in order were jon stewart crossfire (TP1), electoral college (TP2) and day of the dead (TP3). We coputed the nuber of duplicates having 7% siilarity (atleast 7% of the bits in the filter atched) in the search results. Figure 5 shows the corresponding nuber of duplicates for a axiu of 87 search results fro the Google search API. The TP1 query had the axiu fraction of near-duplicates, 44.3%, while the other two TP2 and TP3 had 29.7% and 24.3%, respectively. Observe that the ost popular query TP1was the one with the ost duplicates. For the ediu popular queries, we selected three queries fro the list Google Top 1 Gaining Queries for the week ending Aug. 3, 24 on the Google Zeitgeist indian larry (MP1), national hurricane center (MP2) and republican national convention (MP3). Figure 6 shows the corresponding search results having 7% siilarity for a axiu of 88 docuents fro the Google search engine. The fraction of near-duplicates aong 88 search results ranged fro 16% for MP1 to 28% for MP2. For a non-popular query saple, we selected three queries at rando olypics 24 doping, hawking black hole bet, and x prize spaceship. The Google API retrieved only about 36 results for the first two queries and 32 results for the third query. Figure 7 shows the nuber of near-duplicate docuents in the search results corresponding to the three queries. The fraction of near-duplicates in all these queries were in the sae range, around 18%. As we observed earlier, as the popularity of queries decrease so do the nuber of duplicate results. The ost popular queries had the largest nuber of near-duplicate results, the ediu ones fewer, and the rando queries the lowest. 3.4 Behavior of different search engines The previous experients all copared the results fro the Google search engine. We next evaluate the behavior of all three search engines, Google, Yahoo and MSN search in returning near-duplicate docuents for the 1 popular queries featured on their respective web sites. To our knowledge, Yahoo and MSN search do not provide an API siilar to the GoogleAPI for doing autoated retrieval of search results. Therefore, we anually ade HTTP requests to the URLs corresponding to the first 5 search results for a query. We plot iniu, average and axiu nuber of nearduplicate (atleast 7% siilar) search results in the 1 popular queries. The three whiskers on each vertical bar in Figures 8,9,1 represent in., avg., and ax. in order. Figure 8 shows the results for Google, with average nuber of nearduplicates ranging fro 7% to 23%. Figure 9 shows nearduplicates in Yahoo results ranging fro 12% to 25%. Figure 1 shows the results for MSN, where the near-duplicates range fro 18% to 26%. Coparing the earlier eacs anual query, MSN had 32% near duplicates while Yahoo had 22%. These experients support our hypothesis that current search engines return a significant nuber of near-duplicates. However, these results do not in any way suggest that any particular search engine perfors better than the others. 3.5 Analyzing Response Ties In this section, we analyze the response ties for perforing siilarity coparisons using Bloo filters. The tiings include (a) the (offline) coputation tie to copute the docuent CDC hashes and generating the Bloo filter, and (b) the (online) atching tie to deterine siilarity using bitwise AND on Bloo filters and tie for insertions and unions in a union-find data structure for clustering. Exp. Chunk Sizes 256 Bytes 512 Bytes 2 KB 8 KB File Size (s) (s) (s) (s) 1 KB KB MB MB Table 1: CDC hash coputation tie for different files and expected chunk sizes # of chunks k = 2 k = 4 k = 8 Docuent Size (n) (s) (s) (s) 1 KB KB MB MB Table 2: Tie (s) for Bloo filter generation for different docuent sizes (expected chunk size 256 bytes) Bloo Filter Size (Bits) Tie (µsec) Table 3: Tie (icroseconds) for coputing the bitwise AND of Bloo filters for different sizes Table 1 shows the CDC hash coputation ties for a coplete docuent (of size 1 KB, 1 KB, 1 MB, 1 MB) for different expected chunk sizes (256 bytes, 512 bytes, 2 KB, 8 KB). The Bloo filter generation ties are shown in Table 2 for different values (2, 4, 8) of the nuber of hash functions (k) and different nuber of chunks (n). Although the Bloo filter generation ties appear high relative to the CDC ties, it is ore an artifact of the ipleentation of the Bloo filter code in Perl instead of C and not due to any inherent coplexity in the Bloo filter code. A preliinary ipleentation in C reduced the Bloo filter generation tie by an order of agnitude. For the atching tie overhead, Table 3 shows the pairwise atching tie for two Bloo filters for different filter

6 Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for 1 popular 1 1 popular GOOGLE queries Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for 1 popular queries on Yahoo 12 1 popular Yahoo queries Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for 1 popular queries on MSN 2 1 popular MSN queries Figure 8: Search results for 1 popular Figure 9: Search results for 1 popular queries on Yahoo Search Figure 1: Search results for 1 popular queries on MSN Search No. of Results Search Query eacs anual ohio court battle hawking black hole bet Table 4: Matching and Clustering tie (in s) sizes ranging fro 1 bits to 5 bits. The overall atching and clustering tie for different query requests is shown in Table 4. Overall, using untuned Perl and C code, for clustering 8 results each of size 1 KB for the eacs anual query would take around 8*.3 s + 8* 14 s + 66s = 121 s. However, the Bloo filters can be coputed and stored apriori reducing the tie to 66 s. 4. RELATED WORK The proble of near-duplicate detection consists of two ajor coponents: (a) extracting docuent representations aka features (e.g., shingles using Rabin fingerprints [18], supershingles [11], super-fingerprints [13]), and (b) coputing the siilarity between the feature sets. As discussed in Section 2, any variants have been proposed that use different techniques on how the shingle fingerprints are sapled (e.g., in-hashing, Mod, Min s) and atched [7, 6, 5]. Google s patent for near-duplicate detection uses another shingling variant to copute fingerprints fro the shingles [17]. Our siilar detection algorith uses CDC [15] for coputing docuent features and then applies Bloo filters for siilarity testing. In contrast to existing approaches, our technique is siple to ipleent, incurs only about.4% extra bytes per docuent, and perfors faster atching using only bit-wise AND operations. Bloo filters have been proposed to estiate the cardinality of set intersection in [8] but have not been applied for near-duplicate eliination in web search. We recently learned about Bloo filter replaceents [16] which we will explore in the future. Page and site siilarity has been extensively studied for web data in various contexts, fro syntactic clustering of web data [7] and its applications for filtering near duplicates in search engines [6] to storage space and bandwidth reduction for web crawlers and search engines. In [9], replica identification was also proposed for organizing web search results. Fetterly et al. exained the aount of textual changes in individual web pages over tie in the PageTurner study [12] and later investigated the teporal evolution of clusters of near-duplicate pages [11]. Bharat and Broder investigated the proble of identifying irrored host pairs on the web [3]. Dasu et al. used in hashing and sketches to identify fields having siilar values in database tables [1]. 5. CONCLUSIONS In this paper, we applied a Bloo filter based siilarity detection technique to refine the search results presented to the user. Bloo filters copactly represent the entire docuent and can be used for quick atching. We deonstrated how a nuber of results of popular and rando search queries retrieved fro different search engines, Google, Yahoo, MSN, are siilar and can be eliinated or re-organized. 6. ACKNOWLEDGMENTS We thank Rezaul Chowdhury, Vijaya Raachandran, Sridhar Rajagopalan, Madhukar Korupolu, and the anonyous reviewers for giving us valuable coents. 7. REFERENCES [1] Google web apis (beta), [2] Yahoo results getting ore siilar to google http: // www. searchenginejournal. co/ index. php? p= 584&c= 1. [3] K. Bharat and A. Broder. Mirror, irror on the web: a study of host pairs with replicated content. Coput. Networks, 31(11-16): , [4] B. H. Bloo. Space/tie trade-offs in hash coding with allowable errors. Coun. ACM, 13(7): , 197. [5] A. Z. Broder. On the reseblance and containent of docuents. In SEQUENCES, [6] A. Z. Broder. Identifying and filtering near-duplicate docuents. In COM, pages 1 1, 2. [7] A. Z. Broder, S. C. Glassan, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW 97. [8] A. Z. Broder and M. Mitzenacher. Network applications of bloo filters: A survey. In Allerton 2. [9] J. Cho, N. Shivakuar, and H. Garcia-Molina. Finding replicated web collections. SIGMOD Rec., 2. [1] T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, 22. [11] D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In LA-WEB, 23. [12] D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In WWW, 23. [13] P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M. Tracey. Redundancy eliination within large collections of files. In USENIX Annual Technical Conference, General Track, pages 59 72, 24. [14] J. C. Mogul, Y.-M. Chan, and T. Kelly. Design, ipleentation, and evaluation of duplicate transfer detection in HTTP. In NSDI, pages 43 56, 24. [15] A. Muthitacharoen, B. Chen, and D. Mazieres. A low-bandwidth network file syste. In SOSP, 21. [16] R. Pagh, A. Pagh, and S. S. Rao. An optial bloo filter replaceent. In SODA, 25. [17] W. Pugh and M. Henzinger. Detecting duplicate and near-duplicate files, US Patent # [18] M. O. Rabin. Fingerprinting by rando polynoials. Technical Report TR-15-81, Harvard University, [19] L. Raasway, A. Iyengar, L. Liu, and F. Douglis. Autoatic detection of fragents in dynaically generated web pages. In WWW, 24.

A framework for performance monitoring, load balancing, adaptive timeouts and quality of service in digital libraries

A framework for performance monitoring, load balancing, adaptive timeouts and quality of service in digital libraries Int J Digit Libr (2000) 3: 9 35 INTERNATIONAL JOURNAL ON Digital Libraries Springer-Verlag 2000 A fraework for perforance onitoring, load balancing, adaptive tieouts and quality of service in digital libraries

More information

An Innovate Dynamic Load Balancing Algorithm Based on Task

An Innovate Dynamic Load Balancing Algorithm Based on Task An Innovate Dynaic Load Balancing Algorith Based on Task Classification Hong-bin Wang,,a, Zhi-yi Fang, b, Guan-nan Qu,*,c, Xiao-dan Ren,d College of Coputer Science and Technology, Jilin University, Changchun

More information

Applying Multiple Neural Networks on Large Scale Data

Applying Multiple Neural Networks on Large Scale Data 0 International Conference on Inforation and Electronics Engineering IPCSIT vol6 (0) (0) IACSIT Press, Singapore Applying Multiple Neural Networks on Large Scale Data Kritsanatt Boonkiatpong and Sukree

More information

Information Processing Letters

Information Processing Letters Inforation Processing Letters 111 2011) 178 183 Contents lists available at ScienceDirect Inforation Processing Letters www.elsevier.co/locate/ipl Offline file assignents for online load balancing Paul

More information

Online Bagging and Boosting

Online Bagging and Boosting Abstract Bagging and boosting are two of the ost well-known enseble learning ethods due to their theoretical perforance guarantees and strong experiental results. However, these algoriths have been used

More information

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy Vol. 9, No. 5 (2016), pp.303-312 http://dx.doi.org/10.14257/ijgdc.2016.9.5.26 Analyzing Spatioteporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy Chen Yang, Renjie Zhou

More information

Searching strategy for multi-target discovery in wireless networks

Searching strategy for multi-target discovery in wireless networks Searching strategy for ulti-target discovery in wireless networks Zhao Cheng, Wendi B. Heinzelan Departent of Electrical and Coputer Engineering University of Rochester Rochester, NY 467 (585) 75-{878,

More information

Preference-based Search and Multi-criteria Optimization

Preference-based Search and Multi-criteria Optimization Fro: AAAI-02 Proceedings. Copyright 2002, AAAI (www.aaai.org). All rights reserved. Preference-based Search and Multi-criteria Optiization Ulrich Junker ILOG 1681, route des Dolines F-06560 Valbonne ujunker@ilog.fr

More information

INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS

INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS Artificial Intelligence Methods and Techniques for Business and Engineering Applications 210 INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE

More information

Real Time Target Tracking with Binary Sensor Networks and Parallel Computing

Real Time Target Tracking with Binary Sensor Networks and Parallel Computing Real Tie Target Tracking with Binary Sensor Networks and Parallel Coputing Hong Lin, John Rushing, Sara J. Graves, Steve Tanner, and Evans Criswell Abstract A parallel real tie data fusion and target tracking

More information

An Approach to Combating Free-riding in Peer-to-Peer Networks

An Approach to Combating Free-riding in Peer-to-Peer Networks An Approach to Cobating Free-riding in Peer-to-Peer Networks Victor Ponce, Jie Wu, and Xiuqi Li Departent of Coputer Science and Engineering Florida Atlantic University Boca Raton, FL 33431 April 7, 2008

More information

Software Quality Characteristics Tested For Mobile Application Development

Software Quality Characteristics Tested For Mobile Application Development Thesis no: MGSE-2015-02 Software Quality Characteristics Tested For Mobile Application Developent Literature Review and Epirical Survey WALEED ANWAR Faculty of Coputing Blekinge Institute of Technology

More information

A Fast Algorithm for Online Placement and Reorganization of Replicated Data

A Fast Algorithm for Online Placement and Reorganization of Replicated Data A Fast Algorith for Online Placeent and Reorganization of Replicated Data R. J. Honicky Storage Systes Research Center University of California, Santa Cruz Ethan L. Miller Storage Systes Research Center

More information

Dynamic Placement for Clustered Web Applications

Dynamic Placement for Clustered Web Applications Dynaic laceent for Clustered Web Applications A. Karve, T. Kibrel, G. acifici, M. Spreitzer, M. Steinder, M. Sviridenko, and A. Tantawi IBM T.J. Watson Research Center {karve,kibrel,giovanni,spreitz,steinder,sviri,tantawi}@us.ib.co

More information

Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network

Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network 2013 European Control Conference (ECC) July 17-19, 2013, Zürich, Switzerland. Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona

More information

Energy Proportionality for Disk Storage Using Replication

Energy Proportionality for Disk Storage Using Replication Energy Proportionality for Disk Storage Using Replication Jinoh Ki and Doron Rote Lawrence Berkeley National Laboratory University of California, Berkeley, CA 94720 {jinohki,d rote}@lbl.gov Abstract Energy

More information

Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web

Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web Entity Search Engine: Towards Agile Best-Effort Inforation Integration over the Web Tao Cheng, Kevin Chen-Chuan Chang University of Illinois at Urbana-Chapaign {tcheng3, kcchang}@cs.uiuc.edu. INTRODUCTION

More information

Approximately-Perfect Hashing: Improving Network Throughput through Efficient Off-chip Routing Table Lookup

Approximately-Perfect Hashing: Improving Network Throughput through Efficient Off-chip Routing Table Lookup Approxiately-Perfect ing: Iproving Network Throughput through Efficient Off-chip Routing Table Lookup Zhuo Huang, Jih-Kwon Peir, Shigang Chen Departent of Coputer & Inforation Science & Engineering, University

More information

RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure

RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION Henrik Kure Dina, Danish Inforatics Network In the Agricultural Sciences Royal Veterinary and Agricultural University

More information

Cooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks

Cooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks Cooperative Caching for Adaptive Bit Rate Streaing in Content Delivery Networs Phuong Luu Vo Departent of Coputer Science and Engineering, International University - VNUHCM, Vietna vtlphuong@hciu.edu.vn

More information

Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2

Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2 Exploiting Hardware Heterogeneity within the Sae Instance Type of Aazon EC2 Zhonghong Ou, Hao Zhuang, Jukka K. Nurinen, Antti Ylä-Jääski, Pan Hui Aalto University, Finland; Deutsch Teleko Laboratories,

More information

Partitioned Elias-Fano Indexes

Partitioned Elias-Fano Indexes Partitioned Elias-ano Indexes Giuseppe Ottaviano ISTI-CNR, Pisa giuseppe.ottaviano@isti.cnr.it Rossano Venturini Dept. of Coputer Science, University of Pisa rossano@di.unipi.it ABSTRACT The Elias-ano

More information

Local Area Network Management

Local Area Network Management Technology Guidelines for School Coputer-based Technologies Local Area Network Manageent Local Area Network Manageent Introduction This docuent discusses the tasks associated with anageent of Local Area

More information

Fuzzy Sets in HR Management

Fuzzy Sets in HR Management Acta Polytechnica Hungarica Vol. 8, No. 3, 2011 Fuzzy Sets in HR Manageent Blanka Zeková AXIOM SW, s.r.o., 760 01 Zlín, Czech Republic blanka.zekova@sezna.cz Jana Talašová Faculty of Science, Palacký Univerzity,

More information

Audio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA

Audio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA Audio Engineering Society Convention Paper Presented at the 119th Convention 2005 October 7 10 New York, New York USA This convention paper has been reproduced fro the authors advance anuscript, without

More information

Machine Learning Applications in Grid Computing

Machine Learning Applications in Grid Computing Machine Learning Applications in Grid Coputing George Cybenko, Guofei Jiang and Daniel Bilar Thayer School of Engineering Dartouth College Hanover, NH 03755, USA gvc@dartouth.edu, guofei.jiang@dartouth.edu

More information

6. Time (or Space) Series Analysis

6. Time (or Space) Series Analysis ATM 55 otes: Tie Series Analysis - Section 6a Page 8 6. Tie (or Space) Series Analysis In this chapter we will consider soe coon aspects of tie series analysis including autocorrelation, statistical prediction,

More information

The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs

The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs Send Orders for Reprints to reprints@benthascience.ae 206 The Open Fuels & Energy Science Journal, 2015, 8, 206-210 Open Access The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic

More information

Managing Complex Network Operation with Predictive Analytics

Managing Complex Network Operation with Predictive Analytics Managing Coplex Network Operation with Predictive Analytics Zhenyu Huang, Pak Chung Wong, Patrick Mackey, Yousu Chen, Jian Ma, Kevin Schneider, and Frank L. Greitzer Pacific Northwest National Laboratory

More information

Data Set Generation for Rectangular Placement Problems

Data Set Generation for Rectangular Placement Problems Data Set Generation for Rectangular Placeent Probles Christine L. Valenzuela (Muford) Pearl Y. Wang School of Coputer Science & Inforatics Departent of Coputer Science MS 4A5 Cardiff University George

More information

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation Media Adaptation Fraework in Biofeedback Syste for Stroke Patient Rehabilitation Yinpeng Chen, Weiwei Xu, Hari Sundara, Thanassis Rikakis, Sheng-Min Liu Arts, Media and Engineering Progra Arizona State

More information

Modeling Parallel Applications Performance on Heterogeneous Systems

Modeling Parallel Applications Performance on Heterogeneous Systems Modeling Parallel Applications Perforance on Heterogeneous Systes Jaeela Al-Jaroodi, Nader Mohaed, Hong Jiang and David Swanson Departent of Coputer Science and Engineering University of Nebraska Lincoln

More information

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive MANUFACTURING & SERVICE OPERATIONS MANAGEMENT Vol., No. 3, Suer 28, pp. 429 447 issn 523-464 eissn 526-5498 8 3 429 infors doi.287/so.7.8 28 INFORMS INFORMS holds copyright to this article and distributed

More information

Calculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance

Calculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance Calculating the Return on nvestent () for DMSMS Manageent Peter Sandborn CALCE, Departent of Mechanical Engineering (31) 45-3167 sandborn@calce.ud.edu www.ene.ud.edu/escml/obsolescence.ht October 28, 21

More information

Standards and Protocols for the Collection and Dissemination of Graduating Student Initial Career Outcomes Information For Undergraduates

Standards and Protocols for the Collection and Dissemination of Graduating Student Initial Career Outcomes Information For Undergraduates National Association of Colleges and Eployers Standards and Protocols for the Collection and Disseination of Graduating Student Initial Career Outcoes Inforation For Undergraduates Developed by the NACE

More information

Evaluating Inventory Management Performance: a Preliminary Desk-Simulation Study Based on IOC Model

Evaluating Inventory Management Performance: a Preliminary Desk-Simulation Study Based on IOC Model Evaluating Inventory Manageent Perforance: a Preliinary Desk-Siulation Study Based on IOC Model Flora Bernardel, Roberto Panizzolo, and Davide Martinazzo Abstract The focus of this study is on preliinary

More information

A quantum secret ballot. Abstract

A quantum secret ballot. Abstract A quantu secret ballot Shahar Dolev and Itaar Pitowsky The Edelstein Center, Levi Building, The Hebrerw University, Givat Ra, Jerusale, Israel Boaz Tair arxiv:quant-ph/060087v 8 Mar 006 Departent of Philosophy

More information

Exercise 4 INVESTIGATION OF THE ONE-DEGREE-OF-FREEDOM SYSTEM

Exercise 4 INVESTIGATION OF THE ONE-DEGREE-OF-FREEDOM SYSTEM Eercise 4 IVESTIGATIO OF THE OE-DEGREE-OF-FREEDOM SYSTEM 1. Ai of the eercise Identification of paraeters of the euation describing a one-degree-of- freedo (1 DOF) atheatical odel of the real vibrating

More information

Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks

Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks SECURITY AND COMMUNICATION NETWORKS Published online in Wiley InterScience (www.interscience.wiley.co). Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks G. Kounga 1, C. J.

More information

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks Reliability Constrained acket-sizing for inear Multi-hop Wireless Networks Ning Wen, and Randall A. Berry Departent of Electrical Engineering and Coputer Science Northwestern University, Evanston, Illinois

More information

ADJUSTING FOR QUALITY CHANGE

ADJUSTING FOR QUALITY CHANGE ADJUSTING FOR QUALITY CHANGE 7 Introduction 7.1 The easureent of changes in the level of consuer prices is coplicated by the appearance and disappearance of new and old goods and services, as well as changes

More information

- 265 - Part C. Property and Casualty Insurance Companies

- 265 - Part C. Property and Casualty Insurance Companies Part C. Property and Casualty Insurance Copanies This Part discusses proposals to curtail favorable tax rules for property and casualty ("P&C") insurance copanies. The syste of reserves for unpaid losses

More information

Data Streaming Algorithms for Estimating Entropy of Network Traffic

Data Streaming Algorithms for Estimating Entropy of Network Traffic Data Streaing Algoriths for Estiating Entropy of Network Traffic Ashwin Lall University of Rochester Vyas Sekar Carnegie Mellon University Mitsunori Ogihara University of Rochester Jun (Ji) Xu Georgia

More information

ASIC Design Project Management Supported by Multi Agent Simulation

ASIC Design Project Management Supported by Multi Agent Simulation ASIC Design Project Manageent Supported by Multi Agent Siulation Jana Blaschke, Christian Sebeke, Wolfgang Rosenstiel Abstract The coplexity of Application Specific Integrated Circuits (ASICs) is continuously

More information

Work Travel and Decision Probling in the Network Marketing World

Work Travel and Decision Probling in the Network Marketing World TRB Paper No. 03-4348 WORK TRAVEL MODE CHOICE MODELING USING DATA MINING: DECISION TREES AND NEURAL NETWORKS Chi Xie Research Assistant Departent of Civil and Environental Engineering University of Massachusetts,

More information

Equivalent Tapped Delay Line Channel Responses with Reduced Taps

Equivalent Tapped Delay Line Channel Responses with Reduced Taps Equivalent Tapped Delay Line Channel Responses with Reduced Taps Shweta Sagari, Wade Trappe, Larry Greenstein {shsagari, trappe, ljg}@winlab.rutgers.edu WINLAB, Rutgers University, North Brunswick, NJ

More information

Markovian inventory policy with application to the paper industry

Markovian inventory policy with application to the paper industry Coputers and Cheical Engineering 26 (2002) 1399 1413 www.elsevier.co/locate/copcheeng Markovian inventory policy with application to the paper industry K. Karen Yin a, *, Hu Liu a,1, Neil E. Johnson b,2

More information

Use of extrapolation to forecast the working capital in the mechanical engineering companies

Use of extrapolation to forecast the working capital in the mechanical engineering companies ECONTECHMOD. AN INTERNATIONAL QUARTERLY JOURNAL 2014. Vol. 1. No. 1. 23 28 Use of extrapolation to forecast the working capital in the echanical engineering copanies A. Cherep, Y. Shvets Departent of finance

More information

arxiv:0805.1434v1 [math.pr] 9 May 2008

arxiv:0805.1434v1 [math.pr] 9 May 2008 Degree-distribution stability of scale-free networs Zhenting Hou, Xiangxing Kong, Dinghua Shi,2, and Guanrong Chen 3 School of Matheatics, Central South University, Changsha 40083, China 2 Departent of

More information

Adaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel

Adaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel Recent Advances in Counications Adaptive odulation and Coding for Unanned Aerial Vehicle (UAV) Radio Channel Airhossein Fereidountabar,Gian Carlo Cardarilli, Rocco Fazzolari,Luca Di Nunzio Abstract In

More information

Airline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN

Airline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN Airline Yield Manageent with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN Integral Developent Corporation, 301 University Avenue, Suite 200, Palo Alto, California 94301 SHALER STIDHAM

More information

The Design and Implementation of an Enculturated Web-Based Intelligent Tutoring System

The Design and Implementation of an Enculturated Web-Based Intelligent Tutoring System The Design and Ipleentation of an Enculturated Web-Based Intelligent Tutoring Syste Phaedra Mohaed Departent of Coputing and Inforation Technology The University of the West Indies phaedra.ohaed@gail.co

More information

Efficient Key Management for Secure Group Communications with Bursty Behavior

Efficient Key Management for Secure Group Communications with Bursty Behavior Efficient Key Manageent for Secure Group Counications with Bursty Behavior Xukai Zou, Byrav Raaurthy Departent of Coputer Science and Engineering University of Nebraska-Lincoln Lincoln, NE68588, USA Eail:

More information

AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES

AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES Int. J. Appl. Math. Coput. Sci., 2014, Vol. 24, No. 1, 133 149 DOI: 10.2478/acs-2014-0011 AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES PIOTR KULCZYCKI,,

More information

Evaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects

Evaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects Evaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects Lucas Grèze Robert Pellerin Nathalie Perrier Patrice Leclaire February 2011 CIRRELT-2011-11 Bureaux

More information

Optimal Resource-Constraint Project Scheduling with Overlapping Modes

Optimal Resource-Constraint Project Scheduling with Overlapping Modes Optial Resource-Constraint Proect Scheduling with Overlapping Modes François Berthaut Lucas Grèze Robert Pellerin Nathalie Perrier Adnène Hai February 20 CIRRELT-20-09 Bureaux de Montréal : Bureaux de

More information

SAMPLING METHODS LEARNING OBJECTIVES

SAMPLING METHODS LEARNING OBJECTIVES 6 SAMPLING METHODS 6 Using Statistics 6-6 2 Nonprobability Sapling and Bias 6-6 Stratified Rando Sapling 6-2 6 4 Cluster Sapling 6-4 6 5 Systeatic Sapling 6-9 6 6 Nonresponse 6-2 6 7 Suary and Review of

More information

An improved TF-IDF approach for text classification *

An improved TF-IDF approach for text classification * Zhang et al. / J Zheiang Univ SCI 2005 6A(1:49-55 49 Journal of Zheiang University SCIECE ISS 1009-3095 http://www.zu.edu.cn/zus E-ail: zus@zu.edu.cn An iproved TF-IDF approach for text classification

More information

SUPPORTING YOUR HIPAA COMPLIANCE EFFORTS

SUPPORTING YOUR HIPAA COMPLIANCE EFFORTS WHITE PAPER SUPPORTING YOUR HIPAA COMPLIANCE EFFORTS Quanti Solutions. Advancing HIM through Innovation HEALTHCARE SUPPORTING YOUR HIPAA COMPLIANCE EFFORTS Quanti Solutions. Advancing HIM through Innovation

More information

Implementation of Active Queue Management in a Combined Input and Output Queued Switch

Implementation of Active Queue Management in a Combined Input and Output Queued Switch pleentation of Active Queue Manageent in a obined nput and Output Queued Switch Bartek Wydrowski and Moshe Zukeran AR Special Research entre for Ultra-Broadband nforation Networks, EEE Departent, The University

More information

The AGA Evaluating Model of Customer Loyalty Based on E-commerce Environment

The AGA Evaluating Model of Customer Loyalty Based on E-commerce Environment 6 JOURNAL OF SOFTWARE, VOL. 4, NO. 3, MAY 009 The AGA Evaluating Model of Custoer Loyalty Based on E-coerce Environent Shaoei Yang Econoics and Manageent Departent, North China Electric Power University,

More information

A Scalable Application Placement Controller for Enterprise Data Centers

A Scalable Application Placement Controller for Enterprise Data Centers W WWW 7 / Track: Perforance and Scalability A Scalable Application Placeent Controller for Enterprise Data Centers Chunqiang Tang, Malgorzata Steinder, Michael Spreitzer, and Giovanni Pacifici IBM T.J.

More information

On Computing Nearest Neighbors with Applications to Decoding of Binary Linear Codes

On Computing Nearest Neighbors with Applications to Decoding of Binary Linear Codes On Coputing Nearest Neighbors with Applications to Decoding of Binary Linear Codes Alexander May and Ilya Ozerov Horst Görtz Institute for IT-Security Ruhr-University Bochu, Gerany Faculty of Matheatics

More information

Method of supply chain optimization in E-commerce

Method of supply chain optimization in E-commerce MPRA Munich Personal RePEc Archive Method of supply chain optiization in E-coerce Petr Suchánek and Robert Bucki Silesian University - School of Business Adinistration, The College of Inforatics and Manageent

More information

Energy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and migration algorithms

Energy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and migration algorithms Energy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and igration algoriths Chaia Ghribi, Makhlouf Hadji and Djaal Zeghlache Institut Mines-Téléco, Téléco SudParis UMR CNRS 5157 9, Rue

More information

Factored Models for Probabilistic Modal Logic

Factored Models for Probabilistic Modal Logic Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008 Factored Models for Probabilistic Modal Logic Afsaneh Shirazi and Eyal Air Coputer Science Departent, University of Illinois

More information

The Velocities of Gas Molecules

The Velocities of Gas Molecules he Velocities of Gas Molecules by Flick Colean Departent of Cheistry Wellesley College Wellesley MA 8 Copyright Flick Colean 996 All rights reserved You are welcoe to use this docuent in your own classes

More information

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS 641 CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS Marketa Zajarosova 1* *Ph.D. VSB - Technical University of Ostrava, THE CZECH REPUBLIC arketa.zajarosova@vsb.cz Abstract Custoer relationship

More information

The Fundamentals of Modal Testing

The Fundamentals of Modal Testing The Fundaentals of Modal Testing Application Note 243-3 Η(ω) = Σ n r=1 φ φ i j / 2 2 2 2 ( ω n - ω ) + (2ξωωn) Preface Modal analysis is defined as the study of the dynaic characteristics of a echanical

More information

Protecting Small Keys in Authentication Protocols for Wireless Sensor Networks

Protecting Small Keys in Authentication Protocols for Wireless Sensor Networks Protecting Sall Keys in Authentication Protocols for Wireless Sensor Networks Kalvinder Singh Australia Developent Laboratory, IBM and School of Inforation and Counication Technology, Griffith University

More information

Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure

Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure By Alan Radding and Nick Carr Abstract This paper discusses the issues related to storage design and anageent when an IT

More information

ESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET

ESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET ESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET Francisco Alonso, Roberto Blanco, Ana del Río and Alicia Sanchis Banco de España Banco de España Servicio de Estudios Docuento de

More information

Calculation Method for evaluating Solar Assisted Heat Pump Systems in SAP 2009. 15 July 2013

Calculation Method for evaluating Solar Assisted Heat Pump Systems in SAP 2009. 15 July 2013 Calculation Method for evaluating Solar Assisted Heat Pup Systes in SAP 2009 15 July 2013 Page 1 of 17 1 Introduction This docuent describes how Solar Assisted Heat Pup Systes are recognised in the National

More information

Resource Allocation in Wireless Networks with Multiple Relays

Resource Allocation in Wireless Networks with Multiple Relays Resource Allocation in Wireless Networks with Multiple Relays Kağan Bakanoğlu, Stefano Toasin, Elza Erkip Departent of Electrical and Coputer Engineering, Polytechnic Institute of NYU, Brooklyn, NY, 0

More information

The Benefit of SMT in the Multi-Core Era: Flexibility towards Degrees of Thread-Level Parallelism

The Benefit of SMT in the Multi-Core Era: Flexibility towards Degrees of Thread-Level Parallelism The enefit of SMT in the Multi-Core Era: Flexibility towards Degrees of Thread-Level Parallelis Stijn Eyeran Lieven Eeckhout Ghent University, elgiu Stijn.Eyeran@elis.UGent.be, Lieven.Eeckhout@elis.UGent.be

More information

Performance Evaluation of Machine Learning Techniques using Software Cost Drivers

Performance Evaluation of Machine Learning Techniques using Software Cost Drivers Perforance Evaluation of Machine Learning Techniques using Software Cost Drivers Manas Gaur Departent of Coputer Engineering, Delhi Technological University Delhi, India ABSTRACT There is a treendous rise

More information

An Application Research on the Workflow-based Large-scale Hospital Information System Integration

An Application Research on the Workflow-based Large-scale Hospital Information System Integration 106 JOURNAL OF COMPUTERS, VOL. 6, NO. 1, JANUARY 2011 An Application Research on the Workflow-based Large-scale Hospital Inforation Syste Integration Yang Guojun School of Coputer, Neijiang Noral University,

More information

REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES

REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES Charles Reynolds Christopher Fox reynolds @cs.ju.edu fox@cs.ju.edu Departent of Coputer

More information

AUC Optimization vs. Error Rate Minimization

AUC Optimization vs. Error Rate Minimization AUC Optiization vs. Error Rate Miniization Corinna Cortes and Mehryar Mohri AT&T Labs Research 180 Park Avenue, Florha Park, NJ 0793, USA {corinna, ohri}@research.att.co Abstract The area under an ROC

More information

Identification and Analysis of hard disk drive in digital forensic

Identification and Analysis of hard disk drive in digital forensic Identification and Analysis of hard disk drive in digital forensic Kailash Kuar Dr. Sanjeev Sofat Dr. Naveen Aggarwal Phd(CSE) Student Prof. and Head CSE Deptt. Asst. Prof. CSE Deptt. PEC University of

More information

Reconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)

Reconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut) Sandia is a ultiprogra laboratory operated by Sandia Corporation, a Lockheed Martin Copany, Reconnect 04 Solving Integer Progras with Branch and Bound (and Branch and Cut) Cynthia Phillips (Sandia National

More information

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO Bulletin of the Transilvania University of Braşov Series I: Engineering Sciences Vol. 4 (53) No. - 0 PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO V. CAZACU I. SZÉKELY F. SANDU 3 T. BĂLAN Abstract:

More information

A Study on the Chain Restaurants Dynamic Negotiation Games of the Optimization of Joint Procurement of Food Materials

A Study on the Chain Restaurants Dynamic Negotiation Games of the Optimization of Joint Procurement of Food Materials International Journal of Coputer Science & Inforation Technology (IJCSIT) Vol 6, No 1, February 2014 A Study on the Chain estaurants Dynaic Negotiation aes of the Optiization of Joint Procureent of Food

More information

How To Get A Loan From A Bank For Free

How To Get A Loan From A Bank For Free Finance 111 Finance We have to work with oney every day. While balancing your checkbook or calculating your onthly expenditures on espresso requires only arithetic, when we start saving, planning for retireent,

More information

AutoHelp. An 'Intelligent' Case-Based Help Desk Providing. Web-Based Support for EOSDIS Customers. A Concept and Proof-of-Concept Implementation

AutoHelp. An 'Intelligent' Case-Based Help Desk Providing. Web-Based Support for EOSDIS Customers. A Concept and Proof-of-Concept Implementation //j yd xd/_ ' Year One Report ":,/_i',:?,2... i" _.,.j- _,._".;-/._. ","/ AutoHelp An 'Intelligent' Case-Based Help Desk Providing Web-Based Support for EOSDIS Custoers A Concept and Proof-of-Concept Ipleentation

More information

Multi-level Metadata Management Scheme for Cloud Storage System

Multi-level Metadata Management Scheme for Cloud Storage System , pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1

More information

Image restoration for a rectangular poor-pixels detector

Image restoration for a rectangular poor-pixels detector Iage restoration for a rectangular poor-pixels detector Pengcheng Wen 1, Xiangjun Wang 1, Hong Wei 2 1 State Key Laboratory of Precision Measuring Technology and Instruents, Tianjin University, China 2

More information

An Improved Decision-making Model of Human Resource Outsourcing Based on Internet Collaboration

An Improved Decision-making Model of Human Resource Outsourcing Based on Internet Collaboration International Journal of Hybrid Inforation Technology, pp. 339-350 http://dx.doi.org/10.14257/hit.2016.9.4.28 An Iproved Decision-aking Model of Huan Resource Outsourcing Based on Internet Collaboration

More information

High Performance Chinese/English Mixed OCR with Character Level Language Identification

High Performance Chinese/English Mixed OCR with Character Level Language Identification 2009 0th International Conference on Docuent Analysis and Recognition High Perforance Chinese/English Mixed OCR with Character Level Language Identification Kai Wang Institute of Machine Intelligence,

More information

Construction Economics & Finance. Module 3 Lecture-1

Construction Economics & Finance. Module 3 Lecture-1 Depreciation:- Construction Econoics & Finance Module 3 Lecture- It represents the reduction in arket value of an asset due to age, wear and tear and obsolescence. The physical deterioration of the asset

More information

SOME APPLICATIONS OF FORECASTING Prof. Thomas B. Fomby Department of Economics Southern Methodist University May 2008

SOME APPLICATIONS OF FORECASTING Prof. Thomas B. Fomby Department of Economics Southern Methodist University May 2008 SOME APPLCATONS OF FORECASTNG Prof. Thoas B. Foby Departent of Econoics Southern Methodist University May 8 To deonstrate the usefulness of forecasting ethods this note discusses four applications of forecasting

More information

Research Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises

Research Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises Advance Journal of Food Science and Technology 9(2): 964-969, 205 ISSN: 2042-4868; e-issn: 2042-4876 205 Maxwell Scientific Publication Corp. Subitted: August 0, 205 Accepted: Septeber 3, 205 Published:

More information

New for 2016! Get Licensed

New for 2016! Get Licensed Financial Manageent 2016 HS There s only one place you need to go for all your professional developent needs. The Power to Know. NEW Experience a different school of learning! New for 2016! Online courses

More information

Modeling Cooperative Gene Regulation Using Fast Orthogonal Search

Modeling Cooperative Gene Regulation Using Fast Orthogonal Search 8 The Open Bioinforatics Journal, 28, 2, 8-89 Open Access odeling Cooperative Gene Regulation Using Fast Orthogonal Search Ian inz* and ichael J. Korenberg* Departent of Electrical and Coputer Engineering,

More information

Online Community Detection for Large Complex Networks

Online Community Detection for Large Complex Networks Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Online Counity Detection for Large Coplex Networks Wangsheng Zhang, Gang Pan, Zhaohui Wu, Shijian Li Departent

More information

A Multi-Core Pipelined Architecture for Parallel Computing

A Multi-Core Pipelined Architecture for Parallel Computing Parallel & Cloud Coputing PCC Vol, Iss A Multi-Core Pipelined Architecture for Parallel Coputing Duoduo Liao *1, Sion Y Berkovich Coputing for Geospatial Research Institute Departent of Coputer Science,

More information

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX 77843-3128

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX 77843-3128 ON SELF-ROUTING IN CLOS CONNECTION NETWORKS BARRY G. DOUGLASS Electrical Engineering Departent Texas A&M University College Station, TX 778-8 A. YAVUZ ORUÇ Electrical Engineering Departent and Institute

More information

Study on the development of statistical data on the European security technological and industrial base

Study on the development of statistical data on the European security technological and industrial base Study on the developent of statistical data on the European security technological and industrial base Security Sector Survey Analysis: France Client: European Coission DG Migration and Hoe Affairs Brussels,

More information

Multi-Class Deep Boosting

Multi-Class Deep Boosting Multi-Class Deep Boosting Vitaly Kuznetsov Courant Institute 25 Mercer Street New York, NY 002 vitaly@cis.nyu.edu Mehryar Mohri Courant Institute & Google Research 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu

More information

Halloween Costume Ideas for the Wii Game

Halloween Costume Ideas for the Wii Game Algorithica 2001) 30: 101 139 DOI: 101007/s00453-001-0003-0 Algorithica 2001 Springer-Verlag New York Inc Optial Search and One-Way Trading Online Algoriths R El-Yaniv, 1 A Fiat, 2 R M Karp, 3 and G Turpin

More information