SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages

Transcription

1 ISBN (Print), (CD-ROM) Proceedings of the Second Symposium International Computer Science and Computational Technology(ISCSCT 09) Huangshan, P. R. China, 26-28,Dec. 2009, pp SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages Bingfeng Pi, Shunkai Fu, Weilei Wang, and Song Han Roboo Inc., Suzhou, P.R.China {winter.pi, shunkai.fu, willer.wang, Abstract Detecting near-duplicates within huge repository of short message is known as a challenge due to its short length, frequent happenings of typo when typing on mobile phone, flexibility and diversity nature of Chinese language, and the target we prefer, near-duplicate. In this paper, we discuss the real problem met in real application, and try to look for a suitable technique to solve this problem. We start with the discussion of the seriousness of near-duplicate existing in short messages. Then, we review how SimHash works, and its possible merits for finding near-duplicates. Finally, we demonstrate a series of findings, including the problem itself and the benefits brought by SimHash-based approach, based on experiments with 500 thousands of real short messages crawled from Internet. The discussion here is believed a valuable reference for both researchers and applicants. Index Terms Near-duplicate, SimHash, short text I. INTRODUCTION Duplicate and near-duplicate web documents are posing large problems on Web search engines: They increase the space required to store the index, slow down serving results, and annoy the users [2, 3]. Among the data available on Internet, a large proportion are short texts, such as mobile phone short messages, instant messages, chat log, BBS titles etc [1]. It was reported by Information Industry Ministry of China that more than 1.56 billion mobile phone short messages are sent each day in Mainland China [5]. Being an active and popular mobile search service provider in China, our history query log indicates that the short message search enjoys a similar scale of monthly PVs (Page Visit) as Web page search on Roboo [6]. These two vivid facts motivate us to pay enough attention to the quality of our short message repository since it is the basis for quality search service. Unfortunately, the status of duplicate or nearduplicate messages is very severe, especially nearduplicates. For example, the following are two typical examples near-duplicates (all in Chinese): In the first pair, the one above has 4 more characters (highlighted in gray) than the other one, and the remaining part is exactly the same; And in the second pair, the differences contain one character, and two punctuations (all highlighted in gray). These differences may result from several causes: 1) same contents appearing on different sites are all crawled, processed and indexed; 2) mistake introduced while parsing these loosely structured and noisy text (HTML page may contain ads., and it is known as shorting of 2009 ACADEMY PUBLISHER AP-PROC-CS-09CN semantics useful for parsing); 3) manual typos (all information on Internet are created by people originally) and manual revising while being referred and reused; 4) explicit modification to make the short message suitable for difference usage (for example, replacing 春节 (Spring Festival, Chinese traditional New Year) with 新年 (Near Year), though they are actually similar in meaning. Manual checking may be applicable when the scale of repository is small, e.g. hundreds or thousands of instances. When the amount of instances increases to millions and more, obviously, it becomes impossible for human beings to check them one by one, which is tedious, costly and prone to error. Resorting to computers for such kind of repeatable job is desired, of which the core is an algorithm to measure the difference between any pair of short messages, including duplicated and nearduplicated ones. Manku et al. [3] showed that Charikar s SimHash [4] is practically useful identifying near-duplicates in web documents. SimHash is a fingerprint technique enjoying the property that fingerprints of near-duplicates differ only in a small number of bit positions. A SimHash fingerprint is generated for each object. If the fingerprints of two objects are similar, then they are deemed to be near-duplicates. As for a SimHash fingerprint f, Manku et al. developed a technique for identifying whether an existing fingerprint f differs from f in at most k bits. Their experiments show that for a repository of 8 billion pages, 64-bit SimHash fingerprints and k = 3 are reasonable. Another work by Pi et al. [2] confirmed the effect of SimHash and the work by Manku et al; besides, they proposed to do the detection among the results retrieved by a query, i.e. so-called query-biased approach. It reduces the problem scale via divide-and-conquer, replacing global search with local search, and it is open to more settings possibly met in application, e.g. smaller k to remove fewer documents under some condition, and bigger k to delete more documents under other condition. In this paper, we show that SimHash is indeed effective and efficient in detecting both duplicate (with k = 0 ) and near-duplicate (with k > 0 ) (see the two typical examples in TABLE II. ) among large short message repository. However, we also notice that due to the born feature of short messages, k = 3 may not be an ideal parameter for. For example, as shown in TABLE III., k = 2 is enough to detect the one-character

2 difference, but k has to be 5 to detect the same pair of messages with two-character difference. Besides, with the same one-character difference, short messages require larger k for effective detection (TABLE IV. ). This may be explained by an observation, that the same difference, e.g. having one different character on the same position of two short messages, would be more influential to short text than to long text. This is a paper focusing on discussing practical solution for real application, and our contribution is threefold. Firstly, we demonstrate a series of practical values of SimHash-based approach by experiments and our experience. Secondly, we point out that k = 3 may be suitable for near-duplicated Web page detection, but obviously not suitable for short messages. Thirdly, we propose one empirical choice, k = 5, as applied on our online short message search ( In Section 2, we describe how SimHash works, its advantages and disadvantages. Then in Section 3, we present a series of experiments, and discuss the results. A brief review of conventional work is presented in Section 4, followed by conclusion and future work in Section 5. TABLE I. TYPICAL NEAR-DUPLICATES OF SHORT MESSAGES, WITH DIFFERENCES HIGHLIGHTED IN GRAY (1) 春节搞笑春节搞笑祝福短信新年到了, 事儿多了吧? 招待客人别累着, 狼吞虎咽别撑着, 啤的白的别掺着, 孩子别忘照顾着, 最后我的惦念常带着新年快快乐乐的!! (2) 春节搞笑祝福短信新年到了, 事儿多了吧? 招待客人别累着, 狼吞虎咽别撑着, 啤的白的别掺着, 孩子别忘照顾着, 最后我的惦念常带着新年快快乐乐的!! (1) 又是你的生日了, 虽然残破的爱情让我彼此变得陌生, 然而我从未忘你的生日,happy birthday (2) 又是你的生日了, 虽然残破的爱情让我们彼此变得陌生, 然而我从未忘你的生日 Happy birthday! TABLE II. EXAMPLE: DETECT DUPLICATE WITH k = 0 AND NEAR-DUPLICATE WITH k > 0 (WITH DIFFERENCES HIGHLIGHTED IN GRAY) k = 0 (1) 今生今世, 你是我唯一的选择愿我们好好珍惜缘 k > 0 (1) 今生今世, 你是我唯一的选择愿我们好好珍惜缘分, 也请你答应我, 今生今世只为我守侯 k = 2 k = 5 (1) 今生今世, 你是我唯一的选择愿我们好好珍惜缘 (2) 今生今世, 你是我唯一的选择愿我们好好珍惜缘分, 也请你答应我, 今生今世只为我守侯 (1) 愿我们好好珍惜缘分, 也请你答应我, 今生今世只为我守候 (2) 愿我们好好珍惜缘分, 也请你答应我, 今生今世只为我守侯 I. NEAR-DUPLICATE DETECTION BY SIMHASH A. SimHash and Hamming Distance Charikar s SimHash [4], actually, is a fingerprinting technique that produces a compact sketch of the objects being studied, no matter documents discussed here or images. So, it allows for various processing, once applied to original data sets, to be done on the compact sketches, a much smaller and well formatted (fixed length) space. With documents, SimHash works as follows: a Web document is converted into a set of features, each feature tagged with its weight. Then, we transform such a highdimensional vector into an f bit fingerprint where f is quite small compared with the original dimensionality. An excellent comparison of SimHash and the traditional Broder s shingle-based fingerprints [7] can be found in Henzinger [8]. To make the document self contained, here, we give the algorithm s specification in Figure 1., and explain it with a little more detail. We assume the input, document D, is pre-processed and composed with a series of features (tokens). Firstly, we initialize an f -dimensional vector V with each dimension as zero (line 1). Then, for each feature, it is hashed into an f bit hash value. These f bits increment or decrement the f components of the vector by the weight of that features based on the value of each bit of the hash value calculated (line 4-8). Finally, the signs of the components determine the corresponding bits of the final fingerprint (line 9-11). TABLE III. EXAMPLE: DETECT SAME LONG TEXT BUT MORE DIFFRENCE REQUIRES LARGER k (WITH DIFFERENCES HIGHLIGHTED IN GRAY) k = 2 (1) 今生今世, 你是我唯一的选择愿我们好好珍惜缘分, 也请你答应我, 今生今世只为我守侯 k = 5 (1) 今生今世, 你是我唯一的选择愿我们好好珍惜缘分, 也请你答应我, 今生今世只为我等待 TABLE IV. EXAMPLE: DETECT SAME DIFFRENCE BUT SHORTER TEXT REQUIRES LARGER k (WITH DIFFERENCES HIGHLIGHTED IN GRAY) Figure 1. Algorithm specification of SimHash. 21

3 SimHash has two important but somewhat conflicting properties: (1) The fingerprint of a document is a hash of its features, and (2) Similar documents have similar hash values. The latter property is quite different from traditional hash function, like MD5 or SHA-1 (Secure Hash Algorithm), where the hash-values of two documents may be quite different even they are slightly different. This property makes SimHash an ideal technique for detecting near-duplicate ones, determining two documents are similar if their corresponding hashvalues are close to each other. The closer they are, the more similar are these two documents; when the two hash-values are completely same, we actually find two exact duplicates, as what MD5 can achieve. In this project, we choose to construct a 64 bit fingerprint for each web document because it also works well as shown in [1]. Then the detection of near-duplicate documents becomes the search of hash values with k bit difference, which is also known as searching for nearest neighbors in hamming space [3, 4]. How to realize this goal efficiently? One solution is to directly compare each pair of SimHash codes, and its complexity 2 is ON ( ), where N is the size of document repository and each unit comparison needs to compare 64 bits here. A more efficient method as proposed in [1] is implemented as well in this project. It is composed of two steps. Firstly, all f bit SimHash codes are divided into ( k + 1) block(s), and those codes with one same block, say 1,2,, ( k + 1), are grouped into different list. For example, with k = 3, all the SimHash codes with the same 1st, 2nd, 3rd, or 4th block are clustered together. Secondly, given one SimHash code, we can get its 1st block code easily and use it to retrieve a list of which all codes sharing the same 1st block as the given one. Normally, the length of such list is much smaller than the whole size of repository, N. Besides, given the found list, we need only check whether the remaining blocks of the codes differ with k or fewer bits. The same checking need to be applied to the other 3 lists before we find all SimHash codes, i.e. all near-duplicate documents. This search procedure is referred as hamming distance measure by us in the remaining text. B. Advantages and Disadvantages of SimHash SimHash has several advantages for application based on our experience: 1. Transforming into a standard fingerprint makes it applicable for different media content, no matter text, video or audio; 2. Fingerprinting provides compact representation, which not only reduces the storage space greatly but allows for quicker comparison and search; 3. Similar content has similar SimHash code, which permits easier distance function to be determined for application; 4. It is applicable for both duplicate and nearduplicate detection, with k = 0 and k > 0 respectively; 5. Similar processing time for different setting of k if via the proposed divide-and-search mentioned above, and this is valuable for practice since we are able to detect more nearduplicates with no extra cost; 6. The search procedure of similar encoded objects is easily to be implemented in distributed environment based on our implementation experience; 7. From the point of software engineering view, this procedure may be implemented into standard module and be re-used on similar applications, except that the applicants may determine the related parameters themselves. Standard and aligned encoded output (e.g., 64-bit SimHash code) plus the parameter k make it possible to figure out flexible, re-usable and scalable near-duplicate detecting algorithm, like the one implemented in [1,2] and this project as well. TABLE II., TABLE III. and TABLE IV. demonstrate several near-duplicated pairs detected with SimHash. The difference of each pair of short messages and the corresponding k value required for the detection are listed. As we discussed here, SimHash can be applied to short text without any modification on our previous work on page document, i.e. long text. Besides, it is noticed that k = 0 lets us to find exact duplicates, and larger k allows us to detect more difference. However, SimHash has its weak points as well. The text length has great influence on the effect. For example, k = 2 allows us to find the pairs with one different character in TABLE III., but it requires k = 5 in TABLE IV.. Of course, we are lucky to cost similar computing time with different k, but we have to tradeoff manually on the choice of k since determining whether or not nearduplicated is quite vague especially for those detected with large k. Besides, the size of the target objects being studied has influence on the choice of k. That s why we can t directly apply k = 3 here though it is proved effective in our Web page cleaning project. II. EXPERIMENTAL STUDY This is a project aiming at discussing practical solution for real-world large scale application, so it is believed that experiments with real data are highly desired. In this section, we are going to cover the following aspects: The algorithm is effective to find both duplicates and near-duplicates among short-text repository; The problem of near-duplicate is serious, so it is worthy of our effort; k = 3 is not good choice for detecting nearduplicated short texts; SimHash-based approach is flexible, customizable and scalable. A. Our Data We crawl and parse Web pages, extracting and indexing about 500 thousands of short messages for experimental study. Too short messages are filtered first, and the minimum threshold value is 20 here. Note that 22

4 this choice is arbitrary. TABLE V. summarizes the testing repository. TABLE V. BASIC STATISTICS ABOUT THE SHORT MESSAGE REPOSITORY USED FOR TESTING # of messages with length of ,959 Length of longest message 2,968 Length of shortest message 20 Mean length Standard deviation B. Correctness and Effectiveness To make the following discussion sound, it is necessary to verify the algorithm and our implementation. From the examples shown in TABLE II., we notice that: Duplicate pairs are indeed detected with k = 0 ; Near-duplicated pairs have to be detected with k > 0 ; Larger difference requires larger k ; If one near-duplicate can be detected will smaller k, definitely it can be detected by larger k. The reverse is not true. Therefore, with k > 0, we can find both duplicate and nearduplicates. Other than these sample examples, we further randomly select 1000 messages from the whole repository. With k = 3, 65 near-duplicated pairs are found, and they are checked one by one manually. The conclusion is that all 65 pairs are indeed near-duplicates. C. Seriousness of Near-duplicate Problem To reflect the seriousness of both duplicate and nearduplicate among short message repository, we conduct the search with different k on the same test repository respectively. Figure 2 shows the number of nearduplicated pairs detected given different k, ranging from 0 to 10. It is noticed that there are 87,604 pairs detected with k = 0, i.e. duplicate message. It means about 35% (87604 * 2 / ) of total messages are exactly duplicated (note: our hashing is token-based, and space is ignored.) This rate increases to about 57% when k = 10, that is more than half are duplicated or near-duplicated. With such many near-duplicates existing in the repository, we can image the quality of search result a series of same or similar results are piled together and presented to the users. Because normally there is no extra score,like PageRank score in page search, but the similarity score to consider given short message search application, we have no way to improve the user experience but filtering out those duplicated ones. Figure 2. also confirms the discussion in Section 3.2, i.e. larger k allows us find more near-duplicates. By removing those duplicated and near-duplicated ones, storage space is reduced greatly as well. Besides, by reducing the index scale, the online retrieval response should be quick. Therefore, there are several benefits if we are able to delete those repeating texts k<=0 k<=1 k<=2 k<=3 k<=4 k<=5 k<=6 k<=7 k<=8 k<=9 k<=10 Figure 2. The total number of near-duplicated pairs detected with different k, ranging from 0 to 10. D. k = 3 is not Good Choice k = 3 is demonstrated in [1] and [2] as suitable and practical choice for large scale near-duplicate detection of Web document. However, it seems not appropriate for detecting near-duplicated short messages. As the examples of TABLE III. and TABLE IV. indicate, we may only detect one-character difference even with k = 5. In other words, although the pair of short messages is quite similar one another, differ in only one character, they are not detected as near-duplicated with k = 3. Still with the experiments shown in Figure 2, the ratio of nearduplicates detected to all is about 37% for k = 3. That means that about extra 20% (57% - 37%, and 57% is the ratio when k = 10 ) near-duplicates are left in the short message repository if we apply k = 3. Why the same k = 3 doesn t work well enough on short text? It can be explained in a not so formal way. Given a Web page with 1000 characters, and a short text with 50 characters, the influence of adding or deleting or changing one character will have much less influence on the Web page than on the short text. This is also the feature of fingerprinting technique, and we can improve the sensitivity by using more fits while constructing the fingerprint. However, it is not free lunch since the corresponding computing and storage burden is increased meanwhile. Given 64-bit fingerprint, which k is most appropriate is hard to decide in practice. Though we may do a similar experiments like in [1], asking some persons to check manually the number of true positives, true negatives, false positives and false negatives, it is not employed here due to three causes: It is very costly a procedure, in term of money and time; Whether or not near-duplicate is not easy to determined in many cases, even by human beings, say nothing of machine; It is not our and any similar service providers goal to remove all near-duplicates. A practical goal is to alleviate the influence on users to an affordable level. 23

5 On our online short-message search service ( k = 5 is taken, and the general evaluation by several month-person testing is satisfactory, much better than before when there is no any action is taken on the message repository. Figure 3. is the snapshot of our online search of short message, and the result list appearing on the right screen is clean, no duplicated or near-duplicated. This is meaningful since (1) the small screen is made full use of by only displaying unique results; (2) it saves the communication flow for users by displaying no repeating ones; (3) the user is able to find what s/he like in a quicker manner (fewer times of paging down). Figure 3. The home page of our short-message search (left, accessible via and the result list given query 春节 (Spring Festival, Right). E. SimHash-based Approach is Flexible, Customizable and Scalable From the discussion above, we can see that SimHashbased near-duplicate detection algorithm allows us to find both duplicate and near-duplicate ones, which owes to its most nature similar object has similar SimHash code. It is not only applicable to Web documents, but short messages here. The only necessary adjustment is to find a suitable k. Applicants may customize the choice of k based on their goals, i.e. how strictly we want to control the result. Of course, increasing the value of k also increases the risk of removing false negatives. Our experiments with about 500 thousands of data are done on a common PC machine, with 3.06GHz CPU and 1GB memory. Each experiment only cost us dozens of seconds to do the search, and the time is similar for different k. In real production environment, we implement a Hadoop-based [9] version, which allows us easily scale to millions of cases with few machines. Actually, we notice that the encoding, grouping and the comparison procedure is easy to be programmed with MapReduce [10] framework, one famous divide-andconquer distributed computing model. III. RELATED WORK A variety of techniques have been proposed to identify academic plagiarism [11, 12, 13], Web page duplicates [2,3,8, 14] and duplicate database records [15,16]. However, it is noticed that there are very few works on the discussion of detecting near-duplicates among short text repository until recently, including [1, 17]. Gong et al. [1] proposed the SimFinder which employ three techniques, namely, the ad hoc term weighting, the discriminate-term selection and the optimization techniques. It is a fingerprinting-based method as well, but takes some special processing while choosing features and their corresponding weights. Muthmann et al. [17] discussed the near-duplicate detection for Web forums which is another critical resource of user-generated content (UGC) on Internet. It is also built on the basis of fingerprinting technique. However, there is no article about the related work on mobile search application upon preparing this paper. Though the theoretical basis may be similar, identification of near-duplicate short messages is believed much more difficult considering that: 1) it usually contains less than 200 characters, and there are few effective features to extract; 2) it tends to be informal and error prone; 3) the degree of duplicated and nearduplicated is known as more severe than Web documents. All these can be explained by the fact that short messages are very popular and welcome by mobile users, and they are so short to be distributed easily. IV. CONCLUSION AND FUTURE WORK While providing short message search service, we are short of other reference, like the measures by PageRank, to optimize the ranking of results retrieved, but their relative similarity to the query itself. Based on traditional search model, same or similar short messages may pile together in the result list. Besides, it is noticed and nearduplicates are abundant in short text database. Both facts together motive us to pay enough attention to detecting and eliminating them, to ensure the user experience. We review SimHash, and discuss the application of SimHash in detecting near-duplicated short text. SimHash has several advantages, and we prove them based on a series of experiments with real data. Deleting both duplicated and near-duplicate contents has several benefits, especially, for mobile application like us, including that (1) allow more useful information to present on the small screen; (2) save the time and bandwidth for users by reducing the possible times of paging down operation or asking the server for a new page; (3) reduce the storage requirement; (4) reduce the online retrieval time, so as the waiting time of users. User experience will never be over-emphasized on mobile application considering the small screen, difficult inputting and slow connection speed today. It is believed that our discussion here may be valuable reference for applicants like us since our own product is benefiting from this technique currently online. Although there is no special operation taken to process the features in our system, like those appearing in [1,17], it is observed that the existing framework works well online. However, we also notice that there is space there for improvement. For example, we may for further to study the relationship of text length, ratio of difference and suitable k s option. Besides, some advanced NLP 24

6 (Natural Language Processing) techniques may be applied to improve the outcome. For instance, we may recognize and fix typo first before applying SimHash encoding, which is possible to allow us to find more difference with same k. Of course, all extra finer modeling will be paid with more computing resource. REFERENCES [1] C. Gong., Y. Huang., X. Cheng. and S. Bai., Detecting Near-Duplicates in Large-Scale Short Text Databases, Proc. of PAKDD 2008, LNAI, vol. 5012, pp Springer, Heidelberg [2] B. Pi., S.-K. Fu., G. Zou, J. Guo. and H. Song, Querybiased Near-Duplicate Detection: Effective, Efficient and Customizable, Proc. of 4th International Conference on Data Mining (DMIN), Las Vegas, US., [3] G.S. Manku, A. Jain. and A.D. Sarma, Detecting Near- Duplicates for Web Crawling, Proc. of 16th International World Wide Web Conference (WWW), [4] M. Charikar, Similarity Estimation Techniques from Rounding Algorithm, Proc. of 34th Annual Symposium on Theory of Computing (STOC), 2008, pp [5] Official website of Chinese Information Industry Ministry of China: [6] Roboo mobile search engine: [7] A.Broder,, S.C. Glassman, M. Manasse. and G. Zweig, Syntactic clustering of the web, Computer Networks, vol.29, no.8-13, 1997, pp [8] M.R. Henzinger, Finding near-duplicate web documents: a large-scale evaluation of algorithms, Proc. of ACM SIGIR, 2006, pp [9] Hadoop official site: [10] J. Dean and S.Ghemawat, MapReduce: Simplified data processing on large cluster, Proc. of 6th Symposium on Operating System Design and Implementation (OSDI), [11] S.Brin, J.Davis and H.Garcia-Molina, Copy detection mechanisms for digital documents, Proc. of the ACM SIGMOD Annual Conference, San Francisco, CA, [12] N.Shivakumar and H.Garcia-Molina, SCAM : A copy detection mechanism for digital documents, Proc. of 2 nd International Conference in Theory and Practice of Digital Libraries, Austin, Texas, [13] M.Zini, M.Fabbri and M.Mongelia. Plagiarism detection through multilevel text comparison, Proc. of the 2 nd International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, Leeds, U.K., [14] N.Shivakumar and H.Garnia-Molina, Finding nearreplicas of documents on the web, Proc. of Workshop on Web Databases, Valencia, Spain, [15] Z.P.Tian, H.J.Lu and W.Y.Ji, An n-gram-based approach for detecting approximately duplicate data records, International Journal on Digital Libraries, 5(3): , [16] M.A. Hernandez and S.J.Stolfo, The merge/purge problem for large databases, Proc. of ACM SIGMOD Annual Conference, San Jose, CA., [17] K.Muthmann, W.M.Barczynski, F.Brauer and A.Loser, Near-duplicate detection for web-forums,, , International Database Engineering and Applications Symposium(IDEAS),