Using Bloom Filters to Refine Web Search Results



Similar documents
A framework for performance monitoring, load balancing, adaptive timeouts and quality of service in digital libraries

An Innovate Dynamic Load Balancing Algorithm Based on Task

Applying Multiple Neural Networks on Large Scale Data

Information Processing Letters

Online Bagging and Boosting

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy

Searching strategy for multi-target discovery in wireless networks

Preference-based Search and Multi-criteria Optimization

INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS

Real Time Target Tracking with Binary Sensor Networks and Parallel Computing

An Approach to Combating Free-riding in Peer-to-Peer Networks

Software Quality Characteristics Tested For Mobile Application Development

A Fast Algorithm for Online Placement and Reorganization of Replicated Data

Dynamic Placement for Clustered Web Applications

Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network

Energy Proportionality for Disk Storage Using Replication

Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web

Approximately-Perfect Hashing: Improving Network Throughput through Efficient Off-chip Routing Table Lookup

RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure

Cooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks

Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2

Partitioned Elias-Fano Indexes

Local Area Network Management

Fuzzy Sets in HR Management

Audio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA

Machine Learning Applications in Grid Computing

6. Time (or Space) Series Analysis

The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs

Managing Complex Network Operation with Predictive Analytics

Data Set Generation for Rectangular Placement Problems

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation

Modeling Parallel Applications Performance on Heterogeneous Systems

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive

Calculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance

Standards and Protocols for the Collection and Dissemination of Graduating Student Initial Career Outcomes Information For Undergraduates

Evaluating Inventory Management Performance: a Preliminary Desk-Simulation Study Based on IOC Model

A quantum secret ballot. Abstract

Exercise 4 INVESTIGATION OF THE ONE-DEGREE-OF-FREEDOM SYSTEM

Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks

ADJUSTING FOR QUALITY CHANGE

Part C. Property and Casualty Insurance Companies

Data Streaming Algorithms for Estimating Entropy of Network Traffic

ASIC Design Project Management Supported by Multi Agent Simulation

Work Travel and Decision Probling in the Network Marketing World

Equivalent Tapped Delay Line Channel Responses with Reduced Taps

Markovian inventory policy with application to the paper industry

Use of extrapolation to forecast the working capital in the mechanical engineering companies

arxiv: v1 [math.pr] 9 May 2008

Adaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel

Airline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN

The Design and Implementation of an Enculturated Web-Based Intelligent Tutoring System

Efficient Key Management for Secure Group Communications with Bursty Behavior

AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES

Evaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects

Optimal Resource-Constraint Project Scheduling with Overlapping Modes

SAMPLING METHODS LEARNING OBJECTIVES

An improved TF-IDF approach for text classification *

SUPPORTING YOUR HIPAA COMPLIANCE EFFORTS

Implementation of Active Queue Management in a Combined Input and Output Queued Switch

The AGA Evaluating Model of Customer Loyalty Based on E-commerce Environment

A Scalable Application Placement Controller for Enterprise Data Centers

On Computing Nearest Neighbors with Applications to Decoding of Binary Linear Codes

Method of supply chain optimization in E-commerce

Energy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and migration algorithms

Factored Models for Probabilistic Modal Logic

The Velocities of Gas Molecules

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS

The Fundamentals of Modal Testing

Protecting Small Keys in Authentication Protocols for Wireless Sensor Networks

Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure

ESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET

Calculation Method for evaluating Solar Assisted Heat Pump Systems in SAP July 2013

Resource Allocation in Wireless Networks with Multiple Relays

The Benefit of SMT in the Multi-Core Era: Flexibility towards Degrees of Thread-Level Parallelism

Performance Evaluation of Machine Learning Techniques using Software Cost Drivers

An Application Research on the Workflow-based Large-scale Hospital Information System Integration

REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES

AUC Optimization vs. Error Rate Minimization

Identification and Analysis of hard disk drive in digital forensic

Reconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO

A Study on the Chain Restaurants Dynamic Negotiation Games of the Optimization of Joint Procurement of Food Materials

How To Get A Loan From A Bank For Free

AutoHelp. An 'Intelligent' Case-Based Help Desk Providing. Web-Based Support for EOSDIS Customers. A Concept and Proof-of-Concept Implementation

Multi-level Metadata Management Scheme for Cloud Storage System

Image restoration for a rectangular poor-pixels detector

An Improved Decision-making Model of Human Resource Outsourcing Based on Internet Collaboration

High Performance Chinese/English Mixed OCR with Character Level Language Identification

Construction Economics & Finance. Module 3 Lecture-1

SOME APPLICATIONS OF FORECASTING Prof. Thomas B. Fomby Department of Economics Southern Methodist University May 2008

Research Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises

New for 2016! Get Licensed

Modeling Cooperative Gene Regulation Using Fast Orthogonal Search

Online Community Detection for Large Complex Networks

A Multi-Core Pipelined Architecture for Parallel Computing

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX

Study on the development of statistical data on the European security technological and industrial base

Multi-Class Deep Boosting

Halloween Costume Ideas for the Wii Game

Transcription:

Using Bloo Filters to Refine Web Search Results Navendu Jain Departent of Coputer Sciences University of Texas at Austin Austin, TX, 78712 nav@cs.utexas.edu Mike Dahlin Departent of Coputer Sciences University of Texas at Austin Austin, TX, 78712 dahlin@cs.utexas.edu Renu Tewari IBM Aladen Research Center 65 Harry Road San Jose, CA, 95111 tewarir@us.ib.co ABSTRACT Search engines have priarily focused on presenting the ost relevant pages to the user quickly. A less well explored aspect of iproving the search experience is to reove or group all near-duplicate docuents in the results presented to the user. In this paper, we apply a Bloo filter based siilarity detection technique to address this issue by refining the search results presented to the user. First, we present and analyze our technique for finding siilar docuents using contentdefined chunking and Bloo filters, and deonstrate its effectiveness in copactly representing and quickly atching pages for siilarity testing. Later, we deonstrate how a nuber of results of popular and rando search queries retrieved fro different search engines, Google, Yahoo, MSN, are siilar and can be eliinated or re-organized. 1. INTRODUCTION Enterprise and web search has becoe a ubiquitous part of the web experience. Nuerous studies have shown that the ad-hoc distribution of inforation on the web has resulted in a high degree of content aliasing (i.e., the sae data contained in pages fro different URLs) [14] and which adversely affects the perforance of search engines [6]. The initial study by Broder et al., in 1997 [7], and the later one by Fetterly et al. [11], shows that around 29.2% of data is coon across pages in a saple of 15 illion pages. This coon data when presented to the user on a search query degrades user-experience by repeating the sae inforation on every click. Siilar data can be grouped or eliinated to iprove the search experience. Siilarity based grouping is also useful for organizing the results presented by eta-crawlers (e.g., vivisio, etacrawler, dogpile, copernic). The findings by searchenginejournal.co [2] show a significant overlap of search results returned by Google and Yahoo search engines the top 2 keyword searches fro Google had about 4% identical or siilar pages to the Yahoo results. Soeties search results ay appear different purely due to the restructuring and reforatting of data. For exaple, one site ay forat a docuent into ultiple web pages, with the top level page only containing a fraction of the docuent along with a next link to follow to the reaining part, while an- This work was supported in part by the Texas Advanced Technology Progra, the National Science Foundation (CNS- 41126), and an IBM Faculty Partnership Award. This work was done during an internship at IBM Aladen. Copyright is held by the author/owner(s). Eighth International Workshop on the Web and Databases (WebDB 25), June 16-17, 25, Baltiore, Maryland. other site ay have the entire docuent in the sae web page. An effective siilarity detection technique should find these contained docuents and label the as siilar. Although iproving search results by identifying nearduplicates had been proposed for Altavista [6], we found that popular search engines, Google, Yahoo, MSN, even today have a significant fraction of near-duplicates in their top results 1. For exaple, consider the results of the query eacs anual using the Google search engine. We focus on the top 2 results (i.e., first 2 pages) as they represent the results ost likely to be viewed by the user. Four of the results, www.delorie.co/gnu/docs/eacs/eacs toc.htl, www.cs.utah.edu/dept/old/texinfo/eacs19/eacs toc.htl, www. dc.urkuak.fi/docs/gnu/eacs/eacs toc.htl, and www.linuxselfhelp.co/gnu/eacs/htl chapter/eacs toc.htl, on the first page (top-1 results), were highly siilar in fact, they had nearly identical content but different page headers, disclaiers, and logo iages. For this particular query, on the whole, 7 out of 2 docuents were redundant (3 identical pairs and 4 siilar to one top page docuent). Siilar results were found using Yahoo, MSN 2, and A9 3 search engines. In this paper, we study the current state of popular search engines and evaluate the application of a Bloo filter based near-duplicate detection technique on search results. We deonstrate, using ultiple search engines, how a nuber of results (ranging fro 7% to 6%) on search queries are siilar and can be eliinated or re-organized. Later, we explore the use of Bloo filters for finding siilar objects and deonstrate their effectiveness in copactly representing and quickly atching pages for siilarity testing. Although Bloo filters have been extensively used for set ebership checks, they have not been analyzed for siilarity detection between text docuents. Finally, we apply our Bloo filter based technique to effectively reove siilar search results and iprove user experience. Our evaluation of search results shows that the occurrence of near-duplicates is strongly correlated to: i) the relevance of the docuent and ii) the popularity of the query. Docuents that are considered ore relevant and have a higher rank also have ore near-duplicates copared to less relevant docuents. Siilarly, results fro the ore popular queries have ore near-duplicates copared to the less popular ones. Our siilarity atcher can be deployed as a filter over 1 Google does have a patent [17] for near-duplicate detection although it is not clear which approach they use. 2 Results for a recently popular query, ohio court battle fro both Google and MSN search had a siilar behavior, with 1 and 4 out of the top 2 results being identical resp. 3 A9 states that it uses a Google back-end for part of its search.

any search engine s result set. The overhead of integrating our siilarity detection algorith with search engines only associates about.4% extra bytes per docuent and provides fast atching on the order of illiseconds as described later in section 3. Note that we focus on one ain aspect of siilarity text content. This ight not copletely capture the huan-judgeent notion of siilarity in all cases. However, our technique can be easily extended to include link structure based siilarity easures by coparing Bloo filters generated fro hyperlinks ebedded in web pages. The rest of the paper is organized as follows. Siilarity detection using Bloo filters is described and analyzed in Section 2. Section 3 evaluates and copares our siilarity technique to iprove search results fro ultiple engines and for different workloads. Finally, Section 4 covers related work and we conclude with Section 5. 2. SIMILARITY DETECTION USING BLOOM FILTERS Our siilarity detection algorith proceeds in three steps as follows. First, we use content-defined chunking (CDC) to extract docuent features that are resilient to odifications. Second, we use these features as set eleents for generating Bloo filters 4. Third, we copare the Bloo filters to detect near-duplicate docuents above a certain siilarity threshold (say 7%). We start with an overview of Bloo filters and CDCs, and later present and analyze the siilarity detection technique for refining web search results. 2.1 Bloo Filter Overview A Bloo filter of a set U is ipleented as an array of bits [4]. Each eleent u (u U) of the set is hashed using k independent hash functions h 1,..., h k. Each hash function h i(u) for 1 i k aps to one bit in the array {1... }. Thus, when an eleent is added to the set, it sets k bits, each bit corresponding to a hash function, in the Bloo filter array to 1. If a bit was already set it stays 1. For set ebership checks, Bloo filters ay yield a false positive, where it ay appear that an eleent v is in U even though it is not. Fro the analysis in [8], given n = U and the Bloo filter size, the optial value of k that iniizes the false positive probability, p k, where p denotes that probability that a given bit is set in the Bloo filter, is k = ln 2. Previously, Bloo filters have priarily n been used for finding set-ebership [8]. 2.2 Content-defined Chunking Overview To copute the Bloo filter of a docuent, we first need to split it into a set of eleents. Observe that splitting a docuent using a fixed block size akes it very susceptible to odifications, thereby, aking it useless for siilarity coparison. For effective siilarity detection, we need a echanis that is ore resilient to changes in the docuent. CDC splits a docuent into variable-sized blocks whose boundaries are deterined by its Rabin fingerprint atching a predeterined arker value [18]. The nuber of bits in the Rabin fingerprint that are used to atch the arker deterine the expected chunk size. For exaple, given a arker x78 and an expected chunk size of 2 k, a rolling (overlapping sequence) 48-byte fingerprint is coputed. If the lower k bits of the fingerprint equal x78, a new chunk boundary is set. Since the chunk boundaries are content-based, any odifications should affect only a couple of neighboring chunks and 4 Within a search engine context, the CDCs and the Bloo filters of the docuents can be coputed offline and stored. not the entire docuent. CDC has been used in LBFS [15], REBL [13] and other systes for redundancy eliination. 2.3 Bloo Filters for Siilarity Testing Observe that we can view each docuent to be a set in Bloo filter parlance whose eleents are the CDCs that it is coposed of 5. Given that Bloo filters copactly represent a set, they can also be used to approxiately atch two sets. Bloo filters, however, cannot be used for exact atching as they have a finite false-atch probability but they are naturally suited for siilarity atching. For finding siilar docuents, we copare the Bloo filter of one with that of the other. In case the two docuents share a large nuber of 1 s (bit-wise AND) they are arked as siilar. In this case, the bit-wise AND can also be perceived as the dot product of the two bit vectors. If the set bits in the Bloo filter of a docuent are a coplete subset of that of another filter then it is highly probable that the docuent is included in the other. Web pages are typically coposed of fragents, either static ones (e.g., logo iages), or dynaic (e.g., personalized product prootions, local weather) [19]. When targeting pages for a siilarity based grouping, the test for siilarity should be on the fragent of interest and not the entire page. Bloo filters, when applied to siilarity detection, have several advantages. First, the copactness of Bloo filters is very attractive for storage and transission whenever we want to iniize the eta-data overheads. Second, Bloo filters enable fast coparison as atching is a bitwise-and operation. Third, since Bloo filters are a coplete representation of a set rather than a deterinistic saple (e.g., shingling), they can deterine inclusions effectively. To deonstrate the effectiveness of Bloo filters for siilarity detection, consider, for exaple, the pages fro the Money/CNN web server (oney.cnn.co). We crawled 13 MB of data fro the site that resulted in 1753 docuents. We copared the top-level page arsh ceo/index.htl with all the other pages fro the site. For each docuent, we converted it into a canonical representation as described later in Section 3. The CDCs of the pages were coputed using an expected and axiu chunk size of 256 bytes and 64 KB respectively. The corresponding Bloo filter was of size 256 bytes. Figure 1 shows that two other copies of the page one with the URI /24/1/25/news/fortune5/arsh\ ceo/index.ht and another one with a dynaic URI /24/ 1/25/news/fortune5/arsh ceo/index.ht?cnn=yes atched with all set bits in the Bloo filter of the original docuent. As another exaple, we crawled around 2 MB of data (59 docuents) fro the ib web site (www.ib.co). We copared the page/investor/corpgovernance/index.phtl with all the other crawled pages fro the site. The chunk sizes were chosen as above. Figure 2 shows that two other pages with the URIs/investor/corpgovernance/cgcoi.phtl and/investor/ corpgovernance/cgblaws.phtl appeared siilar, atching in 53% and 69% of the bits in the Bloo filter, respectively. To further illustrate that Bloo filters can differentiate between ultiple siilar docuents, we extracted a technical docuentation file foo (say) (of size 17 KB) increentally fro a CVS archive, generating 2 different versions, with foo being the original, foo.1 being the first version (with a change of 415 bytes fro foo ) and foo.19 being the last. As shown in Figure 3, the Bloo filter for foo atched the ost (98%) with the closest version foo.1. 5 For ultisets, we ake each CDC unique before Bloo filter generation to differentiate ultiple copies of the sae CDC.

Fraction of 1 s atched in the AND outputs 1.8.6.4.2 Docuent Siilarity using Bloo Filter: arsh_ceo/index.htl arsh_ceo/index.htl arsh_ceo/index.ht arsh_ceo/index.ht?cnn=yes 2 4 6 8 1 12 14 16 18 Web docuents in oney.cnn.co Source Tree Figure 1: Coparison of the docuent arsh ceo/index.htl with all pages fro the oney.cnn.co web site Fraction of 1 s atched in the AND outputs Docuent Siilarity using Bloo Filter: investor/corpgovernance/index.phtl 1 investor/corpgovernance/index.phtl.8.6.4.2 investor/corpgovernance/cgcoi.phtl 2.3.1 Analysis The ain consideration when using Bloo filters for siilarity detection is the false atch probability of the above algorith as a function of siilarity between the source and a candidate docuent. Extending the analysis for ebership testing in [4] to siilarity detection, we proceed to deterine the expected nuber of inferred atches between the two sets. Let A and B be the two sets being copared for siilarity. Let denote the nuber of bits (size) in the Bloo filter. For siplicity, assue that both sets have the sae nuber of eleents. Let n denote the nuber of eleents in both sets A and B i.e., A = B = n. As before, k denotes the nuber of hash functions. The probability that a bit is set by a hash function h i for 1 i k is 1. A bit can be set by any of the k hash functions for each of the n eleents. Therefore, the probability that a bit is not set by any hash function for any eleent is (1 1 )nk. Thus, the probability, p, that a given bit is set in the Bloo filter of A is given by: p = 1 `1 1 nk 1 e nk (1) For an eleent to be considered a eber of the set, all the corresponding k bits should be set. Thus, the probability of a false atch, i.e., an outside eleent is inferred as being in set A, is p k. Let C denote the intersection of sets A and B and c denote its cardinality, i.e., C = A B and C = c. For siilarity coparison, let us take each eleent in set B and check if it belongs to the Bloo filter of the given set A. We should find that the c coon eleents will definitely atch and a few of the other (n c) ay also atch due to the false atch probability. By Linearity of Expectation, the expected nuber of eleents of B inferred to have atched with A is E[# of inferred atches] = (c) + (n c)p k To iniize the false atches, this expected nuber should be as close to c as possible. For that (n c)p k should be close to, i.e., p k should approach. This happens to be the sae as iniizing the probability of a false positive. Expanding p and under asyptotic analysis, it reduces to iniizing (1 e nk ) k. Using the sae analysis for iniizing the false positive rate given in [8], the inia obtained after differentiation is when k = ln 2. Thus, the expected nuber n of inferred atches for this value of k becoes investor/corpgovernance/cgblaws.phtl 1 2 3 4 5 6 Web docuents in www.ib.co Source Tree Figure 2: Coparison of the docuent investor/corpgovernance/index.phtl with pages fro www.ib.co E[# of inferred atches] = c + (n c)(.6185) n Thus, the expected nuber of bits set corresponding to inferred atches is h E[# of atched bits] = 1 1 1 k`c + (n c)(.6185) n i Under the assuption of perfectly rando hash functions, the expected nuber of total bits set in the Bloo filter of Fraction of 1 s atched in the AND outputs 1.98.96.94.92.9.88.86.84.82 File Siilarity using Bloo Filter: CVS Repository Benchark foo.1.8 2 4 6 8 1 12 14 16 18 2 foo versions Figure 3: Coparison of the original file foo with later versions foo.1, foo.2 foo.19 the source set A, is p. The ratio, then, of the expected nuber of atched bits corresponding to inferred atches in A B to the expected total nuber of bits set in the Bloo filter of A is: E[# of atched bits] E[# total bits set] = 1 e k (c + (n c)(.6185) n ) `1 e nk Observe that this ratio equals 1 when all the eleents atch, i.e., c = n. If there are no atching eleents, i.e., c =, the ratio = 2(1 (.5) (.6185) n ). For = n, this evaluates to.6973, i.e., 69% of atching bits ay be false. For larger values, = 2n,4n, 8n, 1n,11n, the corresponding ratios are.4658,.1929,.295,.113,.7 respectively. Thus, for = 11n, on an average, less than 1% of the bits set ay atch incorrectly. The expected ratio of atching bits is highly correlated to the expected ratio of atching eleents. Thus, if a large fraction of the bits atch, then it s highly likely that a large fraction of the eleents are coon. 2.4 Discussion Previous work on docuent siilarity has ostly been based on shingling or super fingerprints. Using this ethod, for each object, all the k consecutive words of a docuent (called k-shingles) are hashed using Rabin fingerprint [18] to create a set of fingerprints (also called features or preiages). These fingerprints are then sapled to copute a super-fingerprint of the docuent. Many variants have been proposed that use different techniques on how the shingle fingerprints are sapled (in-hashing, Mod, Min s etc.) and atched [7, 6, 5]. While Mod selects all fingerprints whose value odulo is zero; Min s selects the set of s fingerprints with the sallest value. The in-hashing approach further refines the sapling to be the in values of say 84 rando in-wise independent perutations (or hashes) of the set of all shingle fingerprints. This results in a fixed size saple of 84 fingerprints that is the resulting feature vector. To further siplify atching, these 84 fingerprints can be grouped as 6 super-shingles by concatenating 14 adjacent fingerprints [11]. In [13] these are called super-fingerprints. A pair of objects are then considered siilar if either all or a large fraction of the values in the super-fingerprints atch. Our Bloo filter based siilarity detection differs fro the shingling technique in several ways. It should be noted, however, that the variants of shingling discussed above iprove upon the original approach and we provide a coparison of our technique with these variants wherever applicable. First, shingling (Mod, Min s) coputes docuent siilarity using the intersection of the two feature sets. In our approach, it requires only the bit-wise AND of the two Bloo filters (e.g., two 128 bit vectors). Next, shingling has a higher coputational overhead as it first segents the docuent into k-word shingles (k = 5 in [11]) resulting in shingle set size

of about S k + 1, where S is the docuent size. Later, it coputes the iage (value) of each shingle by applying set (say H) of in-wise independent hash functions ( H =84 as used in [11]) and then for each function, selecting the shingle corresponding to the iniu iage. On the other hand, we apply a set of independent hash functions (typically less than 8) to the chunk set of size on average S where c is the c expected chunk size (e.g., c = 256 bytes for S = 8 KB docuent). Third, the size of the feature set (nuber of shingles) depends on the sapling technique in shingling. For exaple, in Mod, even soe large docuents ight have very few features whereas sall docuents ight have zero features. Soe shingling variants (e.g., Min s, Mod 2 i) ai to select roughly a constant nuber of features. Our CDC based approach only varies the chunk size c, to deterine the nuber of chunks as a trade-off between perforance and fine-grained atching. We leave the epirical coparison with shingling as future work. In general, a copact Bloo filter is easier to attach as a docuent tag and can be copared siply by atching the bits. Thus, Bloo filter based atching is ore suitable for eta crawlers and can be added on to existing search engines without any significant changes. 3. EXPERIMENTAL EVALUATION In this section, we evaluate Bloo filter-based siilarity detection using several types of query results obtained fro querying different search engines using the keywords posted on Google Zeitgeist www.google.co/press/zeitgeist. htl, Yahoo Buzz buzz.yahoo.co, and MSN Search Insider www.iagine-sn.co/insider. 3.1 Methodology We have ipleented our siilarity detection odule using C and Perl. The code for content defined chunking is based on the CDC ipleentation of LBFS [15]. The experiental testbed used a 933 MHz Intel Pentiu III workstation with 512 MB of RAM running Linux kernel 2.4.22. The three coercial search engines used in our evaluation are Google www.google.co, Yahoo Search www.yahoo.co, and MSN Search www.snsearch.co. The Google search results were obtained using the GoogleAPI [1], for each of the search queries, the API was called to return the top 1 search results. Although we requested 1 results, the API, due to soe internal errors, always returned less than 1 entries varying fro 481 to 897. For each search result, the docuent fro the corresponding URL was fetched fro the original web server to copute its Bloo filter. Each docuent was converted into a canonical for by reoving all the HTML arkups and tags, bullets and nuberings such as a.1, extra white space, colons, replacing dashes, single-quotes and double-quotes with single space, and converting all the text to lower case to ake the coparison case insensitive. In any cases, due to server unavailability, incorrect docuent links, page not found errors, and network tieouts, the entire set of requested docuents could not always be retrieved. 3.1.1 Size of the Bloo Filter As we discussed in the section 2, the fraction of bits that atch incorrectly depends on the size of the Bloo filter. For a 97% accurate atch, the nuber of bits in the Bloo filter should be 8x the nuber of eleents (chunks) in the set (docuent). When applying CDC to each docuent, we use the expected chunk size of 256 bytes, while liiting the axiu chunk size to 64 KB. For an average docuent of size 8 KB, this results in around 32 chunks. The Bloo filter is set to be 8x this value i.e., 256 bits. To accoodate large docuents, we set the axiu docuent size to 64 KB (corresponding to the axiu chunk size). Therefore, the Bloo filter size is set to be 8x the expected nuber of chunks (256 for docuent size 64 KB) i.e., 248 bits or 256 bytes, which is a 3.2% and.4% overhead for docuent size of 8 KB and 64 KB respectively. Exaple. When we applied the Bloo filter based atcher to the eacs anual query (Section 1), we found that the page www.linuxselfhelp.co/gnu/eacs/htl chapter/eacs toc. htl atched the other three, www.delorie.co/gnu/docs/eacs/ eacs toc.htl, www.cs.utah.edu/dept/old/texinfo/eacs19/eacs toc.htl, and www.dc.turkuak.fi/docs/gnu/eacs/eacs toc. htl, with 74%, 81% and 95% of the Bloo filter bits atching, respectively. A 7% atching threshold would have identified and grouped all these 4 pages together. Percentage of Duplicate Docuents Near-Duplicate Results for "eacs anual" search on Google 1 8 6 4 2 5% Siilar 6% Siilar 7% Siilar 8% Siilar 9% Siilar 5 1 15 2 25 3 35 4 45 5 Figure 4: eacs anual query search results (Google) 3.2 Effect of the Degree of Siilarity In this section, we evaluate how the degree of siilarity affects the nuber of docuents that are arked siilar. The degree of siilarity is the percentage of the docuent data that atches (e.g., a 1% degree of siilarity is an identical docuent). Intuitively, the higher the degree of siilarity, the lower the nuber of docuents that should atch. Moreover, the nuber of docuents that are siilar depends on the total nuber of docuents retrieved by the query. Although, we initially expected a linear behavior, we observed that the higher ranked results (the top 1 to 2 results) were also the ones that were ore duplicated. Using GoogleAPI, we retrieved 493 results for the eacs anual query. To deterine the nuber of docuents that are siilar aong the set of retrieved docuents, we use a union-find data structure for clustering Bloo filters of the docuents based on siilarity. Figure 4 shows that for 493 docuents retrieved, the nuber of docuent clusters were 56, 22, 317, 328, 34, when the degree of siilarity was 5, 6, 7, 8, 9%, respectively. Each cluster represents a set of siilar docuents (or a single docuent if no siilar ones are found). We assue that a docuent belongs to a cluster if it is siilar to a docuent in the cluster, i.e., we assue that siilarity is transitive for high values of the degree of siilarity (as in [9]). The fraction of duplicate docuents as shown in figure 4, decreases fro 88% to 31% as the degree of siilarity increases fro 5% to 9%. As the nuber of retrieved queries increase fro 1 to 493, the fraction of duplicate docuents initially decrease and then increase foring a inia around 25 results. The decrease was due to the larger aliasing of better ranked docuents. However, as the nuber of results increase, the initial set of docuents get repeated ore frequently, increasing the nuber of duplicates. Siilar results were obtained for a nuber of other queries that we evaluated. 3.3 Effect of the Search Query Popularity To get a representative collection of the types of queries

Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for Popular search 4 "jon stewart crossfire" query 35 "electoral college" query "day of the dead" query 3 25 2 15 1 5 1 2 3 4 5 6 7 8 9 Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for Mediu-Popular search 25 "republican national convention" query "national hurricane center" query 2 "indian larry" query 15 1 5 1 2 3 4 5 6 7 8 9 Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for Rando search 7 6 5 4 3 2 "olypics 24 doping" query 1 "hawking black hole bet" query "x prize spaceship" query 5 1 15 2 25 3 35 4 Figure 5: Search results for the top 3 Figure 6: Search results for 3 ediupopular Figure 7: Search results for 3 rando perfored on search engines, we selected saples fro Google Zeitgeist (Nov. 24) of three different query popularities: i) Most Popular, ii) Mediu-Popular, and iii) Rando. For ost-popular search queries, the three queries selected in order were jon stewart crossfire (TP1), electoral college (TP2) and day of the dead (TP3). We coputed the nuber of duplicates having 7% siilarity (atleast 7% of the bits in the filter atched) in the search results. Figure 5 shows the corresponding nuber of duplicates for a axiu of 87 search results fro the Google search API. The TP1 query had the axiu fraction of near-duplicates, 44.3%, while the other two TP2 and TP3 had 29.7% and 24.3%, respectively. Observe that the ost popular query TP1was the one with the ost duplicates. For the ediu popular queries, we selected three queries fro the list Google Top 1 Gaining Queries for the week ending Aug. 3, 24 on the Google Zeitgeist indian larry (MP1), national hurricane center (MP2) and republican national convention (MP3). Figure 6 shows the corresponding search results having 7% siilarity for a axiu of 88 docuents fro the Google search engine. The fraction of near-duplicates aong 88 search results ranged fro 16% for MP1 to 28% for MP2. For a non-popular query saple, we selected three queries at rando olypics 24 doping, hawking black hole bet, and x prize spaceship. The Google API retrieved only about 36 results for the first two queries and 32 results for the third query. Figure 7 shows the nuber of near-duplicate docuents in the search results corresponding to the three queries. The fraction of near-duplicates in all these queries were in the sae range, around 18%. As we observed earlier, as the popularity of queries decrease so do the nuber of duplicate results. The ost popular queries had the largest nuber of near-duplicate results, the ediu ones fewer, and the rando queries the lowest. 3.4 Behavior of different search engines The previous experients all copared the results fro the Google search engine. We next evaluate the behavior of all three search engines, Google, Yahoo and MSN search in returning near-duplicate docuents for the 1 popular queries featured on their respective web sites. To our knowledge, Yahoo and MSN search do not provide an API siilar to the GoogleAPI for doing autoated retrieval of search results. Therefore, we anually ade HTTP requests to the URLs corresponding to the first 5 search results for a query. We plot iniu, average and axiu nuber of nearduplicate (atleast 7% siilar) search results in the 1 popular queries. The three whiskers on each vertical bar in Figures 8,9,1 represent in., avg., and ax. in order. Figure 8 shows the results for Google, with average nuber of nearduplicates ranging fro 7% to 23%. Figure 9 shows nearduplicates in Yahoo results ranging fro 12% to 25%. Figure 1 shows the results for MSN, where the near-duplicates range fro 18% to 26%. Coparing the earlier eacs anual query, MSN had 32% near duplicates while Yahoo had 22%. These experients support our hypothesis that current search engines return a significant nuber of near-duplicates. However, these results do not in any way suggest that any particular search engine perfors better than the others. 3.5 Analyzing Response Ties In this section, we analyze the response ties for perforing siilarity coparisons using Bloo filters. The tiings include (a) the (offline) coputation tie to copute the docuent CDC hashes and generating the Bloo filter, and (b) the (online) atching tie to deterine siilarity using bitwise AND on Bloo filters and tie for insertions and unions in a union-find data structure for clustering. Exp. Chunk Sizes 256 Bytes 512 Bytes 2 KB 8 KB File Size (s) (s) (s) (s) 1 KB.3.3.2.2 1 KB 4 3 3 2 1 MB 29 27 26 24 1 MB 45 321 267 259 Table 1: CDC hash coputation tie for different files and expected chunk sizes # of chunks k = 2 k = 4 k = 8 Docuent Size (n) (s) (s) (s) 1 KB 35 11 12 14 1 KB 39 118 12 126 1 MB 2959 961 142 1198 1 MB 3463 11792 1196 1286 Table 2: Tie (s) for Bloo filter generation for different docuent sizes (expected chunk size 256 bytes) Bloo Filter Size 1 3 625 125 25 5 (Bits) Tie (µsec) 1.9 2.4 2.9 3.9 6.2 1.7 Table 3: Tie (icroseconds) for coputing the bitwise AND of Bloo filters for different sizes Table 1 shows the CDC hash coputation ties for a coplete docuent (of size 1 KB, 1 KB, 1 MB, 1 MB) for different expected chunk sizes (256 bytes, 512 bytes, 2 KB, 8 KB). The Bloo filter generation ties are shown in Table 2 for different values (2, 4, 8) of the nuber of hash functions (k) and different nuber of chunks (n). Although the Bloo filter generation ties appear high relative to the CDC ties, it is ore an artifact of the ipleentation of the Bloo filter code in Perl instead of C and not due to any inherent coplexity in the Bloo filter code. A preliinary ipleentation in C reduced the Bloo filter generation tie by an order of agnitude. For the atching tie overhead, Table 3 shows the pairwise atching tie for two Bloo filters for different filter

Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for 1 popular 1 1 popular GOOGLE queries 8 6 4 2 1 1 Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for 1 popular queries on Yahoo 12 1 popular Yahoo queries 1 8 6 4 2 1 2 3 4 5 Nuber of Near-Duplicate Docuents (7% siilar) Near-Duplicate Results for 1 popular queries on MSN 2 1 popular MSN queries 15 1 5 1 2 3 4 5 Figure 8: Search results for 1 popular Figure 9: Search results for 1 popular queries on Yahoo Search Figure 1: Search results for 1 popular queries on MSN Search No. of Results 1 2 4 8 16 32 Search Query eacs anual 1 4 15 66 286 1233 ohio court battle 1 7 24 98 369 1426 hawking black hole bet 1 6 23 88 364 147 Table 4: Matching and Clustering tie (in s) sizes ranging fro 1 bits to 5 bits. The overall atching and clustering tie for different query requests is shown in Table 4. Overall, using untuned Perl and C code, for clustering 8 results each of size 1 KB for the eacs anual query would take around 8*.3 s + 8* 14 s + 66s = 121 s. However, the Bloo filters can be coputed and stored apriori reducing the tie to 66 s. 4. RELATED WORK The proble of near-duplicate detection consists of two ajor coponents: (a) extracting docuent representations aka features (e.g., shingles using Rabin fingerprints [18], supershingles [11], super-fingerprints [13]), and (b) coputing the siilarity between the feature sets. As discussed in Section 2, any variants have been proposed that use different techniques on how the shingle fingerprints are sapled (e.g., in-hashing, Mod, Min s) and atched [7, 6, 5]. Google s patent for near-duplicate detection uses another shingling variant to copute fingerprints fro the shingles [17]. Our siilar detection algorith uses CDC [15] for coputing docuent features and then applies Bloo filters for siilarity testing. In contrast to existing approaches, our technique is siple to ipleent, incurs only about.4% extra bytes per docuent, and perfors faster atching using only bit-wise AND operations. Bloo filters have been proposed to estiate the cardinality of set intersection in [8] but have not been applied for near-duplicate eliination in web search. We recently learned about Bloo filter replaceents [16] which we will explore in the future. Page and site siilarity has been extensively studied for web data in various contexts, fro syntactic clustering of web data [7] and its applications for filtering near duplicates in search engines [6] to storage space and bandwidth reduction for web crawlers and search engines. In [9], replica identification was also proposed for organizing web search results. Fetterly et al. exained the aount of textual changes in individual web pages over tie in the PageTurner study [12] and later investigated the teporal evolution of clusters of near-duplicate pages [11]. Bharat and Broder investigated the proble of identifying irrored host pairs on the web [3]. Dasu et al. used in hashing and sketches to identify fields having siilar values in database tables [1]. 5. CONCLUSIONS In this paper, we applied a Bloo filter based siilarity detection technique to refine the search results presented to the user. Bloo filters copactly represent the entire docuent and can be used for quick atching. We deonstrated how a nuber of results of popular and rando search queries retrieved fro different search engines, Google, Yahoo, MSN, are siilar and can be eliinated or re-organized. 6. ACKNOWLEDGMENTS We thank Rezaul Chowdhury, Vijaya Raachandran, Sridhar Rajagopalan, Madhukar Korupolu, and the anonyous reviewers for giving us valuable coents. 7. REFERENCES [1] Google web apis (beta), http://www.google.co/apis. [2] Yahoo results getting ore siilar to google http: // www. searchenginejournal. co/ index. php? p= 584&c= 1. [3] K. Bharat and A. Broder. Mirror, irror on the web: a study of host pairs with replicated content. Coput. Networks, 31(11-16):1579 159, 1999. [4] B. H. Bloo. Space/tie trade-offs in hash coding with allowable errors. Coun. ACM, 13(7):422 426, 197. [5] A. Z. Broder. On the reseblance and containent of docuents. In SEQUENCES, 1997. [6] A. Z. Broder. Identifying and filtering near-duplicate docuents. In COM, pages 1 1, 2. [7] A. Z. Broder, S. C. Glassan, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW 97. [8] A. Z. Broder and M. Mitzenacher. Network applications of bloo filters: A survey. In Allerton 2. [9] J. Cho, N. Shivakuar, and H. Garcia-Molina. Finding replicated web collections. SIGMOD Rec., 2. [1] T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, 22. [11] D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In LA-WEB, 23. [12] D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In WWW, 23. [13] P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M. Tracey. Redundancy eliination within large collections of files. In USENIX Annual Technical Conference, General Track, pages 59 72, 24. [14] J. C. Mogul, Y.-M. Chan, and T. Kelly. Design, ipleentation, and evaluation of duplicate transfer detection in HTTP. In NSDI, pages 43 56, 24. [15] A. Muthitacharoen, B. Chen, and D. Mazieres. A low-bandwidth network file syste. In SOSP, 21. [16] R. Pagh, A. Pagh, and S. S. Rao. An optial bloo filter replaceent. In SODA, 25. [17] W. Pugh and M. Henzinger. Detecting duplicate and near-duplicate files, US Patent # 6658423. [18] M. O. Rabin. Fingerprinting by rando polynoials. Technical Report TR-15-81, Harvard University, 1981. [19] L. Raasway, A. Iyengar, L. Liu, and F. Douglis. Autoatic detection of fragents in dynaically generated web pages. In WWW, 24.