Clustering and Load Balancing Optimization for Redundant Content Removal
|
|
|
- Ginger Fitzgerald
- 10 years ago
- Views:
Transcription
1 Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu Ask.com Alexandra Potapova University of California Santa Barbara Maha Alabduljalil University of California Santa Barbara Xin Liu Amazon.com Tao Yang University of California Santa Barbara ABSTRACT Removing redundant content is an important data processing operation in search engines and other web applications. An offline approach can be important for reducing the engine s cost, but it is challengingto scale such an approachfor a large data set which is updated continuously. This paper discusses our experience in developing a scalable approach with parallel clustering that detects and removes near duplicates incrementally when processing billions of web pages. It presents a multidimensional mapping to balance the load among multiple machines. It further describes several approximation techniques to efficiently manage distributed duplicate groups with transitive relationship. The experimental results evaluate the efficiency and accuracy of the incremental clustering, assess the effectiveness of the multidimensional mapping, and demonstrate the impact on online cost reduction and search quality. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval ]: Search Process, Clustering; H.3.4[Systems and Software]: Distributed Systems General Terms Parallel Algorithms, Performance Keywords Parallel Near Duplicate Analysis, Incremental Computing, Scalability 1. INTRODUCTION A large portion of pages crawled from the web have similar content. Detection of redundant content is an important data processing operation which facilitates the removal of near duplicates and mirrors, and also other low quality content such as spam and dead pages in search and web mining applications. Duplicate comparison algorithms have been studied extensively in previous work [5, 17, 6, 13, 8, Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2012, April 16-20, 2012, Lyon, France ACM /12/04. 15, 21]. Since the precision requirement is high for removing redundant content in a production search engine, it is time consuming to use a combination of these comparison algorithms and conduct accurate computation for billions of documents in an offline system. As a web data collection is updated continuously, it is even challenging to keep an accurate and consistent view of near duplicate relations. Major search engine companies have often taken a compromised approach: while a relatively small percentage of duplicates can be removed conservatively in the offline process, a large portion of duplicates are kept in a live online database. Duplicates are detected and removed during online query processing among those matched results. Such an approach simplifies offline processing, but increases the cost of online serving. From the cost saving perspective, it is attractive to reduce and balance online and offline expense because online expense is often proportional to the amount of traffic. Cost reduction is important for companies in developing cost-conscious search applications. This paper discusses our experience in developing scalable offline techniques for removing a large amount of redundant content and provides evaluation results on their effectiveness. It addresses two technical issues arising in such a setting. First, we study the balancing of distributed workload on parallel machines by partitioning offline data. We use document lengths to guide data mapping which helps in the elimination of unnecessary machine communication for page comparisons. Because web document length distribution is highly skewed, we have developed a multidimensional mapping scheme to optimize load balancing and improve the overall data processing throughput during parallel comparisons. Second, because it is easier and less expensive to manage a cluster of duplicates instead of pair-wise duplicate pairs, the design of our scheme uses a clustering approach with incremental update and transitive approximation. We present our experimental results to demonstrate the effectiveness of our scheme and the impact on cost saving and search quality. The rest of this paper is organized as follows. Section 2 discusses some background and related work. Section 3 discusses our overall design considerations. Section 4 presents the multi-dimensional mapping. Section 5 discusses incremental clustering. Section 6 presents experimental results. Finally, Section 7 concludes the paper.
2 2. BACKGROUND AND RELATED WORK Search services need to return the most relevant answers to a query and presenting identical or near-identical results leads to bad user experience. By removing duplicates, search result presentation would be more compact, and save users time to browse results. Pair-wise comparison algorithms for duplicate detection have been studied extensively in previous works [5, 17, 6, 13, 8, 15, 21]. In general duplicate removal can be conducted in an online result displaying stage or in an offline processing stage. An offline data processing pipeline includes 1) a crawling system which continuously spiders pages from the web, 2) and an online data generation system that parses crawled pages and produces an inverted index for the online system to search. Duplicate detection and other data processing subsystems assist the main pipeline processing. The processing in the main pipeline is continuous as existing pages need to be refreshed to catch up with the updated content and new web pages need to be included as soon as possible. Removing redundant content in an offline system requires a very high precision in duplicate judgment. Fast approximation algorithms [5, 17, 6, 8, 15, 21] are often not sufficient to simultaneously deliver high precision and high recall. A combination of these algorithms and increasing the scope of feature sampling with additional features can improve precision (e.g. [13]). This increases processing cost, especially for conducting accurate comparison with billions of pages. Accurate offline duplicate analysis with both high precision and recall can become a processing bottleneck and affect the freshness of online index. As a result, major search engines often resort to online duplicate elimination. In this online process, duplicates are detected among top results matching a query and only one result from duplicate pages is shown in the final search results. The disadvantage of delaying duplicate removal to the online process is that duplicates appear in the live index, which increases the serving cost. Several previous studies [10, 4] indicate that there are over 30% of near duplicates in crawled pages. Our experience shows that a significant engine cost saving can be achieved by removing these duplicates as we discuss in Section 6. A related area in the database field is to study a similarity join operation which finds all pairs of objects whose similarity exceeds a given threshold. An algorithmic challenge is how to perform the similarity join in an efficient and scalable way. LSH [11] based approximation has been used to map pages with similar signatures into the same partition. Another study by Arasu et al. [2] shows that exact algorithms can still deliver performance competitive to LSH when using several candidate pair filtering methods including inverted index-based methods [20]. Additional filtering methods proposed are based on prefix and thresholds by Bayardo et al. [3] and Xiao et al. [23]. Other related work is sentence-level duplicate detection among web pages studied in Zhang et al. [24] with MapReduce [9]. MapReduce for parallel implementation is discussed in Lin [16] for top-k similarity computation and demonstrated by Wang et al. [22] for duplicate detection. The work by Peng and Dabek [19] presents systems support for incremental page processing using distributed transactions and notifications. Duplicate detection is accomplished through key lookup using content hash of pages or redirection targets. This technique is effective for handling exact duplicates, but is not targeted for similarity and near duplicate detection where complex all-to-all comparison is needed. It should be mentioned that there are two types of duplicates among web pages: 1) Content-based duplicates where two web pages have near-identical content. This is measured by the percentage of content overlapping between the two pages. 2) Redirection-based duplicates when one web page redirects to another, either through permanent or temporary HTTP redirect. Although the source and destination of redirection have identical contents, the crawler may visit them at different times and download different content versions. For two pages q 1 and q 2, they are considered duplicates or near-duplicates if a pair-wise duplicate detection function F(q 1,q 2) is true or if two pages redirect to each other through a sequence of redirection. For content-based duplicates, the similarity function between two pages q 1 and q 2 is defined as Sim(q 1,q 2). Let τ be the controlthresholdof near-duplicate similarity, F(q 1,q 2) is true if Sim(q 1,q 2) > τ. The primary metric for computing Sim(q 1,q 2) in the previous work [5, 17, 6, 13, 21] has been the Jaccard formula while the Cosine formula can also be a choice [12, 15]. 3. DESIGN CONSIDERATIONS Our task is to shift the most of duplicate removal operations from the online engine to offline processing. The key saving is that the online service does not need to host an index with duplicated documents and an online engine can collect more candidate results without duplicates in matching a query when there is a limitation on the total number of candidates selected. In this way, an online mult-tier serving architecture can become more effective. We discuss evaluation results in Section 6. For offline near duplicate removal from a large dataset, we first study parallelism optimization for task and data mapping to improve load balancing and reduce communication. We assume exact duplicates are removed first in the offline data processing using techniques such as representing each page using one hash signature [19]. Ourobjectiveistoaddanofflineservicewiththefollowing functionality: given a set of newly crawled pages, this service answers if a page is a duplicate of others or not. The result of such a query is advisory in the sense that a page may be indexed and delivered to the live online database before its duplicate status is determined and in that case, the system can decide if such a page should be removed from the online database or not. In this way, the duplicate analysis does not become a bottleneck for the main pipeline in delivering fresh documents to the online database. On the other hand, this duplicate analysis service needs to have a high throughput so that the offline system can notify the duplicate status of a page as soon as possible. The focus of our optimization to improve the overall throughput of an offline parallel near duplicate service is in two aspects: 1) data partitioning for load balancing; 2) distributed duplicate information management and incremental update. We will discuss these two issues in Sections 4 and LOAD BALANCING WITH MULTIDIME- NATIONAL PARTITIONING The duplicate service uses a pair-wise algorithm for identifying duplicate pairs. Since many web pages do not change for a long period of time, we only compare a set of new or
3 Possible documents similar to (100,500) Possible documents similar to (280,500) Figure 1: Size distribution for a web page collection. The X-axis lists the page size ranged from 20 to 1686 words. The Y-axis is the number of pages with each particular size. updated pages with existing pages. In using a cluster of machines for a many-to-all comparison service, we need to map data and computation to each machine. We choose to use page based mapping which assigns each page to one partition and assign one or few partitions to a physical machine. Another option for machine mapping is to use a fingerprint hash value based on shingling[4] or LSH [11]. There are two reasons we did not use the later mapping. First, such a method maps a page to multiple values and similarity computation counts the match over multiple fingerprint computing functions for each page pair, which increases inter-machine communications for result aggregation. Such overhead is significant considering the number of near duplicate pairs is huge. Second, to deliver a high accuracy and recall in near duplicate detection, a combination of multiple methods with different types of features [13, 12] is needed in practice. For example, Henzinger [13] reported 38% precision for the shingling algorithm by Broder et al. [5] and 50% precision for Simhash by Charikar [6] and a combined algorithm achieves 79% precision. Ask.com has combined multiple pairwise comparison heuristics with various features to reach high precision and high recall for offline duplicate removal. Thus, it is not easy to map documents to machines using a shingling or LSH based mapping. We have the following design objectives for data and computation mapping. 1) Pages assigned to different machines should be less likely to be near duplicates so that we send a new or updated page to least numbers of machines for comparisons. 2) The number of pages assigned to each machine should be about the same. This helps load balancing of comparison computation because load generated on each machine is approximately proportional to the number of pages assigned. 3) As page content changes, page-tomachine mapping may need to be changed. The algorithm should minimize the chance of remapping as this causes an overhead in data migration and service disruption. We consider to use the page length to guide the mapping of data to machines. For similarity join, Arasu et al. [2] describes the use of feature set size for dividing a general similarity join instance by observing that two sets have similar sizes if they have high Jaccard similarity. For near duplicate detection on a single machine, Theobald et al. [21] described a condition using the length of signatures to identify pages that cannot be duplicates. Motivated by such 300 Figure 2: Examples of mapping a page with a 2D length vector to partitions. approaches, we initially have used the page length to map documents. Our experiments find that one dimensional data mapping based on the length of pages does not balance the computations well across machines as web document length distribution is highly skewed. Also it is not easy for machine load balancing to adapt to page length change. Figure 1 shows the sampled size distribution of 2 million pages randomly selected from a 100 million page dataset. The X- axis is the size value from 20 to 1686 words, and the Y-axis is the number of pages in the corresponding page size. The size distribution has certain spikes and high concentration at sizes around 237. With a skewed distribution of web page lengths, there are some partitions containing an excessive amount of pages, which becomes a throughput bottleneck. Often, one heavily-loaded partition has much more pages, which significantly slows down the overall throughput of a many-to-all comparison service. 4.1 A multidimensional mapping algorithm We have developed a multi-dimensional algorithm that maps data to machines with better load balancing while assigning dissimilar pages to different partitions as much as possible. For simplicity of presentation, we first assume each machine hosts one partition. When there are larger numbers of partitions, we can evenly map partitions to machines. This algorithm uses a combination of language and multidimensional signatures to partition web pages. It first partitions pages with respect to their languages (this paper assumes to deal with English documents only). It segments an English dictionary into N disjoint subsets, and then splits each page document into N subdocuments. Each subdocument only uses words in the corresponding dictionary subset and has a feature length computed only based on these words. Therefore, we can obtain N length features for each page, one for each dimension. Next, we describe how to partition each dimension. We slice and divide values at each dimension i into d i intervals. The slicing method is discussed in the next subsection. We define a length range of N dimensions as a tuple containing a specific interval for each dimension. The total number of N-dimensional rangesisd 1 d 2 d N. Eachpartitioncontainspages
4 whose length vectors within a specific N-dimensional range. For example, in a 2-dimensional setting illustrated in Figure 2, assume that each dimension is divided into two intervals [0,300) and [300, ]. There are 4 ranges: ([0,300), [0,300)), ([0,300), [300, )), ([300, ), [0,300)), and([300, ), [300, )). Thus, we can create 4 corresponding partitions and every page will be mapped to one of them. When receiving a new or updated page, an N-dimension length vector for this page (called A) is computed, and this page will be further sent to machines that host partitions with pages possibly similar to A. We would like to send this page to the least possible number of partitions for comparison. For the example in Fig. 2, when A = (100,500) with threshold τ = 0.9, only one partition containing length range ([0,300), [300, )) is contacted for further duplicate comparison. If A = (280,500), then two partitions with ranges ([0,300), [300, )), ([300, ), [300, )) will be contacted. Formally, let A k denote a set of k-gram shingle signatures derived from a page feature vector A. let A 1 i denote the subdocument in the i-th dimension of A 1, and let L(A 1 i) be the length of subdocument A 1 i. Notice that this length function is the summation of frequencies of all words appearing in this subdocument. Let vector (L(A 1 1),L(A 1 2),,L(A 1 N)) be an N- dimensional length vector of this document. We can show that only documents whose length vectors fall into a bounding space with the following i-th dimension range might be candidates of being near duplicates to A: ( L(A1 i)τ, ρ L(A 1 i)ρ ) (1) τ where ρ is an interval enlarging factor defined in the next subsection. Thus feature vector A only needs to be sent to machines that host partitions with ranges intersecting with the above bounding space. Namely only those partitions can contain pages which may be near duplicates of A and a justification is provided below. 4.2 Similarity analysis We analyze partitions that should be compared when identifying near duplicates of a page with feature vector A. We use the Jaccard similarity of two pages represented by consecutive k-gram shingles as defined by Broder et al. [5, 4]. Let A k be the set size which only counts its unique members. In computing k-gram shingles of a page, we append k 1 dummy words at the end of this document. When k is not too small, our experiments show that L(A 1 ) is relatively close to A k and let ǫ be the relative difference ratio between these two terms satisfying 1 ǫ Ak L(A 1 ) 1+ǫ. There is a small length variation δ in distributing words in page A into N subdocuments. Namely L(A 1 ) N L(A 1 1 δ. i ) We define the interval enlarging factor ρ = (1+δ)(1+ǫ) (1 δ)(1 ǫ). For another existing page B = (B 1,B 2,,B i) which does not fall in the range defined in Expression (1), there exists dimension i such that L(B 1 i) L(A1 i)τ ρ, or L(Bi) 1 L(A1 i)ρ. (2) τ Note that the Jaccard similarity of these two pages with respect to k-grams is Sim(A,B) = Ak B k A k B k. Given the above terminology, we show that A and B A cannot be duplicate. Since k and Bk are within L(A 1 ) L(B 1 ) [1 ǫ,1+ǫ], and we have A k B k A k B k min( Ak B k, Bk A k ), Sim(A,B) min( Ak B k, Bk A k ) min(l(a1 ) L(B 1 ), L(B1 ) L(A 1 ) )1+ǫ 1 ǫ. Thus, we can work on a multidimensional condition that operates on unigrams. In our scheme, A is divided into N subdocuments under N sub-dictionaries and each of them produces a unigram signature set A 1 1, A 1 2,,A 1 N respectively. The distribution satisfies the following inequalities and (1 δ)n L(A 1 i) L(A 1 ) (1+δ)N L(A 1 i) (1 δ)n L(B 1 i) L(B 1 ) (1+δ)N L(B 1 i). From the above two inequalities, we have min( L(A1 ) L(B 1 ), L(B1 ) L(A 1 ) ) i) min(l(a1 L(B1 i) L(Bi 1), L(A 1 i ))1+δ 1 δ. Based on condition (2), Thus, min( L(A1 i) L(B 1 i ), L(B1 i) L(A 1 i )) τ ρ. min( L(A1 ) L(B 1 ), L(B1 ) L(A 1 ) ) τ1 ǫ 1+ǫ. Hence the Jaccard similarity of these two pages with respect to k-grams satisfies Sim(A,B) τ. Namely A and B cannot be a near duplicate pair. The interval enlarging factor ρ is estimated from computation of δ and ǫ, which are sampled in our implementation with an approximation. Since the above analysis between vectors A and B uses worst-case bounds in inequality derivation, there is a room to tighten ρ while still preserving the correctness approximately. We have tested a ρ value between 1 and 1.3 and the above approximation may cause some partitions not to receive pages to compare and this can reduce duplicate recall. We evaluate its impact on accuracy in Section Dimension slicing and partition mapping We discuss how each dimension is sliced. For a 1D size space, we first consider SpotSigs partitioning strategy [21] which scans size values of all pages from smallest to biggest and merge consecutive values to produce d intervals with the
5 left boundsas p 1,p 2, p d and those values satisfy p i p i+1 < τ. For example, given a size list: 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10 with τ = 0.8, the following intervals are produced: [1,2), [2,3), [3,4),[4,6), [6,8), [8,10]. We call these intervals finegrain intervals. Next we consider page distribution under the above finegrain interval and merge them into coarse-grain intervals so that the number of pages in each coarse-grain interval is about the same. Given k fine-grain intervals, we consecutively consolidate some of these intervals with their neighbors and map them to targeted S intervals so that each interval has close to M/S pages where M is the total of documents to be mapped. For example, given a page distribution2, 3, 4, 5, 2, and1forthefollowingfine-grainintervals [1,2), [2,3), [3,4),[4,6), [6,8), [8,10], respectively. To produce 4 balanced intervals, we have [1,3), [3,4),[4,6), [6,10] so that 5, 4, 5, and 3 pages are distributed into each interval respectively. If the original page distribution is 2, 3, 4, 10, 2, and 1 for six fine-grain intervals, then the merged intervals have the following distribution 5, 4, 10, and 3. This example also illustrates that it is not easy to balance a skewed distribution using a 1D mapping. For an N size space, we apply the above 1D slicing and merging method at each dimension to balance page distribution at each interval. Multidimensional mapping is more effective in dealing with distribution spikes compared to 1D as the 1D page size is approximately divided by N at each dimension. When the data set is updated and expanded, the multi-dimensional mapping is less sensitive to size variation. 5. DUPLICATE CLUSTERING AND INCRE- MENTAL UPDATE Once duplicate pairs are detected, the offline system keeps the winners of duplicates in the online database and remove the losers. Winners are pages that need to be kept in the final database while losers are pages that need to be removed. There are a number of page features that can be used for winner selection. For example, pages with a high link popularity or frequently clicked in the search logs may be selected. Such features are used in the query result ranking [18, 14, 7, 1]. When the popularity score of pages is about the same, URLs with a long length may be less preferred compared to a shorter one and a dynamic page may be less preferred compared to a static URL. When pages are selected for a different market, there could be additional rules imposed. For example, a database produced for the United Kingdom may prefer hosts ended with.uk domain instead of.com or.org. It is not easy to apply the above rules to duplicate pairs. There are two reasons. 1) We often cannot get a consistent or stable winner selection among duplicate pairs. Some pages are initially considered as losers and can become winners later on. For example, with two pairs of duplicates (x,z), (x,y), if page x is the winner for the first pair, and it is a loser for the second pair, then both x and y will be kept. 2) When a winner becomes dead, we would need to revisit its losers quickly and find a replacement for this winner from these losers. Storing all duplicate pairs explicitly is expensive as there are a huge number of duplicate pairs. Given the above considerations, we choose to cluster duplicates in a transitive relationship, that simplifies duplicate data management. Each page either does not have any du- Figure 3: Two-tier architecture for incremental duplicate clustering with continuous data update. plicate or belongs to one and only one duplicate group. Two duplicate groups are merged if there exists a duplicate pair betweenonepagefromthefirstgroupandanotherpagefrom the second group. When answering a query on the duplicate status of a page, if this page is a loser in a group, the system verifies its similarity threshold with the group winner to ensure detection precision. This has three advantages. 1) Winner selection is always conducted in a group of pages, which yields a relatively stable selection compared to a pair-wise manner. Once duplicate groups are detected, a database generation scheme can conduct winner selection and use the selected winners to produce different online databases for different search markets. 2) Space cost for storing such duplicate groups has an affordable linear complexity, proportional to the number of pages. 3) Clustering also allows a simpler integration with incremental computing. If content of a page is changed, the previously computed duplicate status regarding this page may not be valid any more. We can revise the previously detected groups incrementally. 5.1 Distributed Group Management and Information Serving We discuss how duplicate groups are constructed and updated as pages are added or updated to an offline data collection. Given a set of pages recently updated or added, we need to identify if they belong to existing duplicate groups, or form new duplicate groups. As shown in Figure 3, we use a two-tier scheme to detect and maintain duplicate group membership as data is being updated. Tier 1 service is to manage duplicate groups detected previously from all pages in the database and use such information without performing extensive comparisons to answer if a page is a duplicate of others or not. There are a large number of duplicate groups to be managed in Tier 1. We can distribute those groups to a large number of machines so that each machine is responsible for a subset of duplicate groups. Such a cluster can quickly answer if a page belongs to an existing duplicate group or not. Tier 1 service examines if given page q 1 has an existing duplicate group. If q 1 s content has not been changed much, then the system can make a decision of whether this page should still stay within this group without going through a full comparison with other pages. To facilitate this, we use a group representative signature vector R to model the content characteristic of this
6 group. An updated page q 1 stays within this group if F(q 1,R) is true. This representative signature is normally the signature of the winner page in the group because such a page tends to be more stable to stay within the group. When a winner page has a content change, we will update the representative signature. We don t use feature aggregation such as a centroid of all signature vectors to represent a group. That is because it is more meaningful to assess the impact over the relevancy when using this winner to represent other losers in the online search results. Since this winner will be in the live index. Besides, it also allows us to impose a strict rule to double check the winner s signature with a loser and reduce error caused by transitive clustering. When there is no existing duplicate groups found for a given page q 1, or page content of q 1 has been changed significantly and it does not belong to its current group, then this page needs to be compared with other pages. We defer this operation to tier 2 detection service. Tier 2 conducts a sequence of many-to-all comparisons or all-to-all comparisons if needed to derive duplicate information that Tier 1 does not have. This process repeats while more data is being sent to the duplicate service for processing. It accumulates a set of new or updated pages sent from Tier 1 when tier 1 cannot make a decision which group pages should belong to as discussed above. If a duplicate of a page q 1 is found through Tier 2 service, then Tier 1 updates its duplicate groups by letting q 1 join the newly-detected group. If no duplicates of q 1 are found, q 1 forms a new group which only contains one member. This tier can use the multi-dimensional partitioning and mapping algorithm discussed in Section 4. The reason that Tier 2 accumulates a set of pages before starting the comparison is to reduce the excessive overhead such as I/O in loading signatures of other pages from the secondary storage to memory. When pages are updated, duplicate groups among them need be updated too. We describe group splitting and merging rules to maintain content-based and redirection-based duplicates as follows. Given a page q 1 crawled and added to the database at time t, there are three cases to handle. Case1: q 1 isanexistingpagewiththeupdatedcontent and page q 1 does not redirect to another page. Let current duplicate group of q 1 be g 1, detected before time t. If q 1 is found to be a duplicate of another page q 2 in group g 2 through Tier 2 service after time t, then q 1 needs to be removed from g 1 and added to group g 2. Other pages that redirects to q 1 in the current group g 1 should also be moved to g 2. If q 1 is not a duplicate of any existing page, then q 1 forms a new group along with any other known pages that redirect to q 1. For those pages in g 1 that redirect to q 1, they need to be moved out and join this new group. Case 2: page q 1 does not redirect to another page, and q 1 is a new page. If q 1 is found to be a duplicate of an existing page q 2 after time t through Tier 2 service, then q 1 is added to the current group of q 2. If q 1 is not a duplicate of any existing page, then q 1 forms a new singleton group. Case 3: page q 1 redirects to another page q t directly or indirectly through multiple levels. q 1 q 2 q t where q t has the real content. We apply the same rules as described in Case 1 and Case 2 for q t to identify the group that q t should join. In addition to this, any page that redirects to page q t (directly or indirectly) need to be moved to the new group of q t. 5.2 Tradeoff of Performance with Accuracy Clustering and distributed duplicate data management improves performance, but plays a tradeoff for accuracy. We discuss the impact of these techniques on accuracy and additional techniques used to reduce error. Two types of error are discussed. A relative error in precision occurs in cases when a pair of pages is classified as duplicate by our scheme while the baseline disagrees. A relative error in recall occurs in cases when a duplicate pair identified by the baseline algorithm is not detected in this two-tier scheme. Transitive clustering. Pages clustered by transitive relationship carries an approximation that affects detection precision while it could improve the duplicate recall. Distributed duplicate group management. Group splitting. When a duplicate group splits according to the above design, the algorithm does not recompute the duplicate relationship within each divided group and sometime it may create a false positive. Specifically, when F(q 1,q 2) is true and F(q 2,q 3) is true before time t, all of these three pages are in one group. If content of q 2 changesaftertanditismovedoutfromthegroup, pages q 1 and q 3 still stay as one group after time t even there does not exist any page in this group to connect q 1 and q 3 transitively. Similarity with group representatives. We compute the similarity with group representatives to reconfirmthe groupmembershipof a pagewith small content change. That speeds up processing, but can affect precision and recall. We discuss two cases as follows. 1) There are two pages q 1 and q 3 such that F(q 1,q 2) and F(q 2,q 3) are true beforetimet. Aftertimet, thenewversionofq 2 is similar to the group representative, but there is no page to bridge the duplicate relation of q 1 and q 3 transitively. 2) Also since the new version of page q 2 is found to be close to the group representative signature in Tier 1, Tier 2 comparison is skipped. However, there can exist pages stored in Tier 2 machines which are near duplicate to this new version of q 2. Content update latency. The detection scheme relies on page features to conduct comparison. Given a distributed architecture, there is a latency for receiving the latest version while the comparison tasks are
7 conducted concurrently and continuously. Thus the staleness of page content due to update latency causes a certain degree of precision and recall loss. The loss of duplicate recall only affects the scope of cost reduction in the online engine because online duplicate removal is still available. It is not critical for our goal as long as the percentage of recall loss is small. The impact of precision loss can be significant to engine s relevancy. We have used two strategies to retain detection precision. 1) This duplication processing service is mainly used when the offline indexing pipeline asks for the duplicate status update of pages after they are recrawled or discovered. Thus the system verifies again the similarity of a queried page with its group winner. If the similarity falls below a threshold, we will not report this page to be a duplicate of this winner. 2) When there is a significant number of reports indicating member s dis-similarity with the winner within a group, this group needs to be repaired. One way is to recompute the duplicate membership of this group. With these strategies, our experience is that the above verification process rejects less than 8% of clustered near duplicate pairs and the cost of repairing duplicate groups is small in terms of frequency and time. Thus these clustered near duplicates which satisfy the required similarity threshold compared to winners can be removed in offline. 6. EVALUATIONS Our evaluation has the following objectives. 1) Demonstrate the usefulness of incremental computation and the effectiveness of using representative group signatures in saving computation cost. 2) Examine the impact of load balancing and compare 1D, 2D, and 3D mappings for parallel comparisons, and demonstrate the speedup of the service throughput as the number of machines increases. 3) Assess the accuracy of the two-tier architecture, the overall impact of aggressive offline duplicate elimination on the cost and relevancy of the online search engine. We have implemented an incremental offline duplicate handling system with C++ at Ask.com s platform and used it for processing billions of web pages. This system uses a pairwise signature comparison algorithm developed internally at Ask.com with multiple heuristics and a high threshold to achieve high precision and recall rates. Because of the high precision and recall requirement for offline duplicate removal, an earlier implementation of all-to-all comparison based on this pairwise algorithm is time consuming and has become a bottleneck for the data processing pipeline. The use of this incremental solution with the optimized mapping and balancing scheme speeds up the overall data processing by one order of magnitude. Clustered duplicate groups are also used for various purposes including page selection for online database for different markets, bad/spam page removal, and crawling control. We discuss our experience and evaluation results in this section. For the demonstration purpose, we have also used a subset of a data collection crawled in three weeks with a size grown from 10 million URLs to 100 million URLs. Some of those pages were recrawled several times during this period. The experiments are conducted in a cluster of machines with dual-core Xeon 3GHz and 6G memory. 6.1 Effectiveness of multi-dimensional mapping Figure 4: Load imbalance factor of 1D, 2D, and 3D mapping when the number of machines varies from 2 to 64. We assess the effectiveness of load balancing with 3D mapping as a dataset of 100 million pages is being updated. The load imbalance factor is computed as the ratio of the maximum page numbers hosted in a machine over the average number among all machines. Figure 4 shows load imbalance factor of 1D, 2D and 3D mapping on the Y-axis with a different number of machines on the X-axis. The reason for 1D mapping has a higher load imbalance is that the skewed size distribution caused a difficulty in assigning pages evenly to partitions while satisfying the minimum similarity overlapping condition. Both 2D and 3D mapping can improve load balancing while 3D mapping gives more flexibility and performance gains. As the number of desired partitions (or machines) increases, the imbalance factor for 3D becomes relatively smaller than others. Increasing the dimension number further does not offer significantly more benefits and because of this reason, we have chosen 3D mapping for the production setting D 2D 3D Threshold 0.5 Threshold 0.6 Threshold 0.7 Threshold 0.8 Threshold 0.9 Figure 5: Percentage of inter-machine communication reduced with 3D mapping. Figure 5 shows communication saved with 3D mapping and mapping. The X-axis is the number of machines and the Y-axis is the percentage of reduction in terms of the number of web pages communicated using the 3D algorithm in Section 4. To estimate communication saving, we compute the total number of web pages that need to be exchanged among machines and the total number of pages that need not to be sent. With more machines, the communication volume among machines increases for exchanging pages. 3D
8 mapping considers partition dissimilarity and can still effectively avoid unnecessary communication. Figure 6 shows the throughput speedup ratio of a single machine over a cluster setting with 2, 4, 8, 16, 32, 64 machines in processing parallel comparisons. In the single machine setting, we record the maximum number of new or updated pages it can handle per second for Tier 2 duplicate comparisons. For each cluster setting, we record the overall maximum number of new or updated pages it can handle per second. The 3D scheme delivers much more speedup than 1D scheme when the number of machines increases and the 1D scheme faces more challenges in balancing workload and thus overloaded machines slowdown the entire system throughput significantly. For the case of 64 machines, the throughput with 3D mapping is 3.3 times faster than 1D Incremental vs. nonincremental Figure 7: Ratio of non-incremental duplicate detection time over incremental detection time when the data collection size increases from 0 to 100 million URLs % 30.00% 28.00% 26.00% 24.00% 22.00% 20.00% Figure 6: Speedup of processing throughput when the number of tier 2 machines increases from 2 to 64 for parallel comparisons. 3Dmappingalsohasanadvantageinadaptingtopagesize variation as the database content changes. Because mapping of partitions to machines is based on the sampled data, we compute the coefficient of variation (CV) of load imbalance factorswhendatasetschange. Asanexample, weusetendifferent datasets after initial mapping is determined through a training dataset. CV is defined as the standard deviation divided by the mean value of load imbalance factors. The CV value is 4.5% for 1D mapping while it is 0.29% for 3D mapping. Thus in addition to having a better load balance, 3D mapping can be less sensitive to size variation when content of a data collection changes. 6.2 Effectiveness of incremental computation Figure 7 shows the processing time ratio of our incremental detection method over a non-incremental detection approach in a single server when a data collection hosted increases from 0 to about 100 million URLs on 50 machines. X axis is the time when the duplicate detection time is recorded in comparing two approaches. Initially, the machine has a very small data collection, the per-page response time in detecting its duplicates non-incrementally is smaller than the one with an incremental approach. Then the ratio in Y axis is less than 1. As the database increases, it takes a longer time to conduct comparison from scratch and thus per-page response time using a non-incremental method is slower than the incremental version. Thus the ratio in Y axis starts to exceed 1 and reach a 24-fold improvement at the end. Figure 8 illustrates the usefulness of incorporating representative group signatures in tier-1 groups, which avoids Figure 8: Percentages of updated pages with signatures similar to representative group signatures. certain comparison cost. In this experiment, we start with 30 million URLs crawled in one week, and then examine another 30 million URLs crawled one week later. There are about 50% of URLs overlapping between these two sets of URLs. In processing the second set of URLs, we record the percentage of updated URLs that have signatures similar to representative group signatures within the defined threshold as near duplicates. Figure 8 shows the percentage of updated URLs similar to group signatures when the number of pages processed in the second batch is 2.5, 5, 10, 15, 20, 25, and 30 millions, respectively. The result shows that around 30% of updated URLs do not have a significant content change, and are still similar to group signatures. The computation for those URLs is localized within tier 1 and comparisons with tier 2 datasets are skipped to save cost and improve the tier-2 service throughput. 6.3 Accuracy of distributed duplicate clustering and group management Next we compare detection error of our offline analysis service distributed on a cluster of machines with a single node non-parallel setting where the proposed approximation techniques are not applied. In this experiment, our data collection was crawled in three weeks with a size grown from 10 million URLs to 100 million URLs. Some of those pages were recrawled several times during this period. To establish a baseline performance, we run a centralized comparison among all 100 million pages to establish duplicate groups using the Ask.com s pairwise comparison procedure.
9 Then we replay the trace of crawling to send these 100 million pages gradually to a two-tier system. When the data collection size reaches each milestone in 10 millions, 20 millions, 30 millions, 40 millions, 75 millions and 100 millions, we measure the relative error in precision and recall compared to the single-machine solution. Table 1 show relative error rates in percentages when the dataset grew from 10 millions, 20 millions, and up to 100 millions. Three configurations are tested with 12, 24, and 36 machines in Tier 2 respectively. The result shows that the accuracy impact of the approximation techniques in the distributed design is fairly small T1.4 T2.5 Table 1: Percentage of relative error in precision (REP) and in recall (RER) in duplicate clustering. The number of partitions (machines) is 12, 24, and M 20M 30M 50M 75M 100M REP, % 0.85% 0.9% 0.85% 1% 1.3% RER, % 1.55% 1.6% 1.67% 1.6% 1.65% REP, % 1.2% 1.22% 1.29% 1.3% 1.25% RER, % 1.64% 1.59% 1.68% 1.7% 1.74% REP, % 1.07% 1.09% 1.1% 1.1% 1.1% RER, % 1.7% 1.65% 1.69% 1.7% 1.7% 6.4 Impacts on relevancy and cost While the use of this offline duplicate removal solution has increased the overall index processing speed by an order of magnitude, this scheme does not affect the freshness of the search engine index. The freshness is measured in terms of the latency from the time a URL is discovered to the time this URL appears in the live index if it is not a duplicate. This is because the offline pipeline queries the duplicate statues of newly crawled and updated pages, and continues to send URLs to the live search engine without waiting for a response from the duplicate analysis service. The pipeline uses the previously developed results stored in the pipeline for online database update. Once the new duplicate status is received for a set of pages, the pipeline updates the stored duplicate information. A number of tests have been conducted to evaluate the accuracy of offline page removal and the relevancy impact to the queries. One experiment was to collect top 10 results from other search engines such as Google for 400K queries in an extended period. These top results are considered to be non-duplicate to each other. To assess if our duplicate handling system removed some of them by mistake, those top URLs from other engines which are classified as a loser in our system were identified and their corresponding winners selected by our system were checked to see if duplicate relation is true or not. Dividing the number of incorrect loserwinner pairs found by the total number of URLs examined gives the error rate of duplicate judgment. The monitoring result showed that the daily error rate of duplicate judgment was from 0.017% to 0.028%. Many of the failed cases have Javascript/flash content which is difficult to handle accurately. We assess the cost saving by shifting most of near duplicate removal from online to offline. We compare the fol- Figure 9: The saving ratio of total machine cost with a fixed QPS target when the database size changes. lowing two schemes after the exact duplicates are removed first: A) The offline system first uses a conservative scheme to remove about 5% of all pages and the online scheme removes additional duplicates among the top results in answering a query. B) The offline system with our approach removes near duplicates as much as possible(about 33% of all pages) and the online scheme removes the remaining duplicates on the fly. Figure 9 shows the percentage of total machine reduction by scheme B compared to A in accomplishing the same onlinetrafficcapacity, whichismeasuredbythenumberofuser queries that can be handled per second. The x-axis is the database size processed in the offline pipeline, which varies from 3 billions to 8 billions. The Y-axis is the percentage of machines reduced: (1 Cost B/Cost A). Cost A is the online cost in terms of the number of machines. Cost B includes both online cost and the extra offline cost for the added duplicate analysis service. There are four curves displayed with 4 different configurations and we explain them in details as follows. In this evaluation, the replication degree of the online system is chosen as 2. The database for the online system is managed in two tiers. The first online tier contains 1.4 billions (marked as T1.4) in one configuration and 2.5 billion (marked as T2.5) in another configuration. The URLs in the first online tier are searched for each query and the second tier is searched only if the results matched from the first online tier are not satisfactory in reaching an internal relevancy score. The selection of online tier 1 data is based on an offline assessment of URL quality such as web link popularity (if there are other web pages pointing such pages) and click popularity (if users have visited such a page). The chance that a duplicate appears in the first online tier is smaller than the second tier, however, there is still a significant percentage of duplicates in the first tier. Figure 9 indicates that the overall cost saving varies from 8.2% to 25.6%. The total saving increases as the replication of online machines increases to handle more traffic. There are factors contributed this saving using scheme B. 1) Save machines needed for hosting duplicates. Even the second tier receives about 1/3 to 1/4 of total traffic, the total number of machines to host near duplicates in Scheme A is still significantly more than B. 2) Answer more traffic using Tier
10 1 machines. Because online tier 1 has less near duplicates, tier 1 machines in Scheme B can host more unique content and can answer 5% to 10% more traffic, which in turns saves machine budget in Tier 1. 3) Reduce communication overhead for result collection and improve query handling throughput of the online engine. Since matched candidates in scheme B in answering a query contain more unique results, scheme B collects less matched candidates for ranking from each individual machine compared to scheme A. That significantly reduces communication usage and increases the online engine throughput. It should be noted that one may consider the reduction of words or position information indexed to save cost. We do not choose this option because of relevancy concerns. 7. CONCLUDING REMARKS The contribution of this paper is to present our experience in developing a large-scale offline duplicate removal system for processing billions of pages updated continuously. Our parallel clustering scheme is supported by a two-tier architecture for detecting and managing near duplicate groups with incremental computing and multidimensional data mapping. Transitive approximation simplifies duplicate information management, reduces storage space, and avoids certain comparisons. Our evaluation shows that removing redundant content in an offline system significantly reduces the overall engine cost while sustaining relevancy quality. Incremental update of duplicate groups in this offline two-tier architecture with several approximation techniques greatly speeds up computing time and processing throughput. Multidimensional mapping offers a flexibility in improving load balancing and processing throughput efficiency. Acknowledgments We thank Apostolos Gerasoulis, Hong Tang, Rahul Lahiri, Vivek Pathak, Yufan Hu, Honghui Lu, and other Ask.com engineers for their kind support in the earlier studies, and the anonymous referees for their helpful comments. This work was supported in part by NSF IIS Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 8. REFERENCES [1] Eugene Agichtein, Eric Brill, and Susan Dumais. Improving web search ranking by incorporating user behavior information. In Proceedings of ACM SIGIR 06, pages [2] Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on Very large data bases, VLDB 06, pages , [3] Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In WWW 07: Proceedings of the 16th international conference on World Wide Web, pages , [4] Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Proc. of Combinatorial Pattern Matching, 11th Annual Symposium, pages 1 10, [5] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In Proc. of WWW 1997, pages [6] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC 02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages , [7] Junghoo Cho, Sourashis Roy, and Robert E. Adams. Page quality: in search of an unbiased web ranking. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, SIGMOD 05, pages , [8] Abdur Chowdhury, Ophir Frieder, David A. Grossman, and M. Catherine McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2): , [9] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In Proc. of OSDI [10] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A large-scale study of the evolution of web pages. In WWW 03: Proceedings of the 12th international conference on World Wide Web, pages , [11] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, pages , [12] Hannaneh Hajishirzi, Wen tau Yih, and Aleksander Kolcz. Adaptive near-duplicate detection via similarity learning. In Proc. of ACM SIGIR, pages , [13] Monika Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proc. of SIGIR 06, pages , [14] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 02, pages , [15] Aleksander Kolcz, Abdur Chowdhury, and Joshua Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In Proceedings of KDD, pages , [16] Jimmy Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In SIGIR, pages , [17] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates for web crawling. In Proc. of WWW 07, pages , [18] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Standford Digital Library project, [19] Daniel Peng and Frank Dabek. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of OSDI 10, pages [20] Sunita Sarawagi and Alok Kirpal. Efficient set joins on similarity predicates. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD 04, pages , [21] Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR 08, pages [22] Chaokun Wang, Jianmin Wang, Xuemin Lin, Wei Wang, Haixun Wang, Hongsong Li, Wanpeng Tian, Jun Xu, and Rui Li. Mapdupreducer: detecting near duplicates over massive datasets. In SIGMOD 10, pages [23] Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient similarity joins for near duplicate detection. In Proceeding of the 17th international conference on World Wide Web, WWW 08, pages ACM, [24] Qi Zhang, Yue Zhang, Haomin Yu, and Xuanjing Huang. Efficient partial-duplicate detection based on sequence matching. In Proc. of SIGIR, pages , 2010.
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages
ISBN 978-952-5726-07-7 (Print), 978-952-5726-08-4 (CD-ROM) Proceedings of the Second Symposium International Computer Science and Computational Technology(ISCSCT 09) Huangshan, P. R. China, 26-28,Dec.
How To Cluster On A Search Engine
Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING
Big Data. Lecture 6: Locality Sensitive Hashing (LSH)
Big Data Lecture 6: Locality Sensitive Hashing (LSH) Nearest Neighbor Given a set P of n oints in R d Nearest Neighbor Want to build a data structure to answer nearest neighbor queries Voronoi Diagram
Near Duplicate Document Detection Survey
Near Duplicate Document Detection Survey Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa Faculty of Computing and Information Technology King AbdulAziz University Jeddah, Saudi Arabia Abstract
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Load Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Optimization of Search Results with De-Duplication of Web Pages In a Mobile Web Crawler
Optimization of Search Results with De-Duplication of Web Pages In a Mobile Web Crawler Monika 1, Mona 2, Prof. Ela Kumar 3 Department of Computer Science and Engineering, Indira Gandhi Delhi Technical
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
Image Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015
Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Optimization of Distributed Crawler under Hadoop
MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*
Predictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 [email protected] John Langford Yahoo! Research New York, NY 10018 [email protected] Alex Strehl Yahoo! Research New York,
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS
ISBN: 978-972-8924-93-5 2009 IADIS RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS Ben Choi & Sumit Tyagi Computer Science, Louisiana Tech University, USA ABSTRACT In this paper we propose new methods for
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Load Balancing on a Grid Using Data Characteristics
Load Balancing on a Grid Using Data Characteristics Jonathan White and Dale R. Thompson Computer Science and Computer Engineering Department University of Arkansas Fayetteville, AR 72701, USA {jlw09, drt}@uark.edu
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
Data Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
Graph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Proxy-Assisted Periodic Broadcast for Video Streaming with Multiple Servers
1 Proxy-Assisted Periodic Broadcast for Video Streaming with Multiple Servers Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables
Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables 1 M.Naveena, 2 S.Sangeetha 1 M.E-CSE, 2 AP-CSE V.S.B. Engineering College, Karur, Tamilnadu, India. 1 [email protected],
How To Balance In Cloud Computing
A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi [email protected] Yedhu Sastri Dept. of IT, RSET,
Identifying Best Bet Web Search Results by Mining Past User Behavior
Identifying Best Bet Web Search Results by Mining Past User Behavior Eugene Agichtein Microsoft Research Redmond, WA, USA [email protected] Zijian Zheng Microsoft Corporation Redmond, WA, USA [email protected]
SRC Technical Note. Syntactic Clustering of the Web
SRC Technical Note 1997-015 July 25, 1997 Syntactic Clustering of the Web Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig Systems Research Center 130 Lytton Avenue Palo Alto, CA 94301
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Efficient Parallel Block-Max WAND Algorithm
Efficient Parallel Block-Max WAND Algorithm Oscar Rojas, Veronica Gil-Costa 2,3, and Mauricio Marin,2 DIINF, University of Santiago, Chile 2 Yahoo! Labs Santiago, Chile 3 CONICET,University of San Luis,
Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing
www.ijcsi.org 227 Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing Dhuha Basheer Abdullah 1, Zeena Abdulgafar Thanoon 2, 1 Computer Science Department, Mosul University,
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.
Is your database application experiencing poor response time, scalability problems, and too many deadlocks or poor application performance? One or a combination of zparms, database design and application
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing
SCHEDULING IN CLOUD COMPUTING
SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism
Dynamical Clustering of Personalized Web Search Results
Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC [email protected] Hong Cheng CS Dept, UIUC [email protected] Abstract Most current search engines present the user a ranked
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Application of Predictive Analytics for Better Alignment of Business and IT
Application of Predictive Analytics for Better Alignment of Business and IT Boris Zibitsker, PhD [email protected] July 25, 2014 Big Data Summit - Riga, Latvia About the Presenter Boris Zibitsker
On demand synchronization and load distribution for database grid-based Web applications
Data & Knowledge Engineering 51 (24) 295 323 www.elsevier.com/locate/datak On demand synchronization and load distribution for database grid-based Web applications Wen-Syan Li *,1, Kemal Altintas, Murat
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
Distributed file system in cloud based on load rebalancing algorithm
Distributed file system in cloud based on load rebalancing algorithm B.Mamatha(M.Tech) Computer Science & Engineering [email protected] K Sandeep(M.Tech) Assistant Professor PRRM Engineering College
A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage
Volume 2, No.4, July August 2013 International Journal of Information Systems and Computer Sciences ISSN 2319 7595 Tejaswini S L Jayanthy et al., Available International Online Journal at http://warse.org/pdfs/ijiscs03242013.pdf
1. Comments on reviews a. Need to avoid just summarizing web page asks you for:
1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of
Search Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
Self-Compressive Approach for Distributed System Monitoring
Self-Compressive Approach for Distributed System Monitoring Akshada T Bhondave Dr D.Y Patil COE Computer Department, Pune University, India Santoshkumar Biradar Assistant Prof. Dr D.Y Patil COE, Computer
Optimizing the Performance of Your Longview Application
Optimizing the Performance of Your Longview Application François Lalonde, Director Application Support May 15, 2013 Disclaimer This presentation is provided to you solely for information purposes, is not
MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services
MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services Jiansheng Wei, Hong Jiang, Ke Zhou, Dan Feng School of Computer, Huazhong University of Science and Technology,
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 Load Balancing Heterogeneous Request in DHT-based P2P Systems Mrs. Yogita A. Dalvi Dr. R. Shankar Mr. Atesh
Big Table A Distributed Storage System For Data
Big Table A Distributed Storage System For Data OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Rahul Malviya Why BigTable? Lots of (semi-)structured data at Google - - URLs: Contents,
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
INTRO TO BIG DATA. Djoerd Hiemstra. http://www.cs.utwente.nl/~hiemstra. Big Data in Clinical Medicinel, 30 June 2014
INTRO TO BIG DATA Big Data in Clinical Medicinel, 30 June 2014 Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra WHY BIG DATA? 2 Source: http://en.wikipedia.org/wiki/mount_everest 3 19 May 2012: 234 people
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
Web Database Integration
Web Database Integration Wei Liu School of Information Renmin University of China Beijing, 100872, China [email protected] Xiaofeng Meng School of Information Renmin University of China Beijing, 100872,
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Fast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster
, pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing
Efficient Similarity Joins for Near Duplicate Detection
Efficient Similarity Joins for Near Duplicate Detection Chuan Xiao Wei Wang Xuemin Lin School of Computer Science and Engineering University of New South Wales Australia {chuanx, weiw, lxue}@cse.unsw.edu.au
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo. VLDB 2009 CS 422 Decision Trees: Main Components Find Best Split Choose split
