SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages

Size: px
Start display at page:

Download "SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages"

Transcription

1 ISBN (Print), (CD-ROM) Proceedings of the Second Symposium International Computer Science and Computational Technology(ISCSCT 09) Huangshan, P. R. China, 26-28,Dec. 2009, pp SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages Bingfeng Pi, Shunkai Fu, Weilei Wang, and Song Han Roboo Inc., Suzhou, P.R.China {winter.pi, shunkai.fu, willer.wang, Abstract Detecting near-duplicates within huge repository of short message is known as a challenge due to its short length, frequent happenings of typo when typing on mobile phone, flexibility and diversity nature of Chinese language, and the target we prefer, near-duplicate. In this paper, we discuss the real problem met in real application, and try to look for a suitable technique to solve this problem. We start with the discussion of the seriousness of near-duplicate existing in short messages. Then, we review how SimHash works, and its possible merits for finding near-duplicates. Finally, we demonstrate a series of findings, including the problem itself and the benefits brought by SimHash-based approach, based on experiments with 500 thousands of real short messages crawled from Internet. The discussion here is believed a valuable reference for both researchers and applicants. Index Terms Near-duplicate, SimHash, short text I. INTRODUCTION Duplicate and near-duplicate web documents are posing large problems on Web search engines: They increase the space required to store the index, slow down serving results, and annoy the users [2, 3]. Among the data available on Internet, a large proportion are short texts, such as mobile phone short messages, instant messages, chat log, BBS titles etc [1]. It was reported by Information Industry Ministry of China that more than 1.56 billion mobile phone short messages are sent each day in Mainland China [5]. Being an active and popular mobile search service provider in China, our history query log indicates that the short message search enjoys a similar scale of monthly PVs (Page Visit) as Web page search on Roboo [6]. These two vivid facts motivate us to pay enough attention to the quality of our short message repository since it is the basis for quality search service. Unfortunately, the status of duplicate or nearduplicate messages is very severe, especially nearduplicates. For example, the following are two typical examples near-duplicates (all in Chinese): In the first pair, the one above has 4 more characters (highlighted in gray) than the other one, and the remaining part is exactly the same; And in the second pair, the differences contain one character, and two punctuations (all highlighted in gray). These differences may result from several causes: 1) same contents appearing on different sites are all crawled, processed and indexed; 2) mistake introduced while parsing these loosely structured and noisy text (HTML page may contain ads., and it is known as shorting of 2009 ACADEMY PUBLISHER AP-PROC-CS-09CN semantics useful for parsing); 3) manual typos (all information on Internet are created by people originally) and manual revising while being referred and reused; 4) explicit modification to make the short message suitable for difference usage (for example, replacing 春 节 (Spring Festival, Chinese traditional New Year) with 新 年 (Near Year), though they are actually similar in meaning. Manual checking may be applicable when the scale of repository is small, e.g. hundreds or thousands of instances. When the amount of instances increases to millions and more, obviously, it becomes impossible for human beings to check them one by one, which is tedious, costly and prone to error. Resorting to computers for such kind of repeatable job is desired, of which the core is an algorithm to measure the difference between any pair of short messages, including duplicated and nearduplicated ones. Manku et al. [3] showed that Charikar s SimHash [4] is practically useful identifying near-duplicates in web documents. SimHash is a fingerprint technique enjoying the property that fingerprints of near-duplicates differ only in a small number of bit positions. A SimHash fingerprint is generated for each object. If the fingerprints of two objects are similar, then they are deemed to be near-duplicates. As for a SimHash fingerprint f, Manku et al. developed a technique for identifying whether an existing fingerprint f differs from f in at most k bits. Their experiments show that for a repository of 8 billion pages, 64-bit SimHash fingerprints and k = 3 are reasonable. Another work by Pi et al. [2] confirmed the effect of SimHash and the work by Manku et al; besides, they proposed to do the detection among the results retrieved by a query, i.e. so-called query-biased approach. It reduces the problem scale via divide-and-conquer, replacing global search with local search, and it is open to more settings possibly met in application, e.g. smaller k to remove fewer documents under some condition, and bigger k to delete more documents under other condition. In this paper, we show that SimHash is indeed effective and efficient in detecting both duplicate (with k = 0 ) and near-duplicate (with k > 0 ) (see the two typical examples in TABLE II. ) among large short message repository. However, we also notice that due to the born feature of short messages, k = 3 may not be an ideal parameter for. For example, as shown in TABLE III., k = 2 is enough to detect the one-character

2 difference, but k has to be 5 to detect the same pair of messages with two-character difference. Besides, with the same one-character difference, short messages require larger k for effective detection (TABLE IV. ). This may be explained by an observation, that the same difference, e.g. having one different character on the same position of two short messages, would be more influential to short text than to long text. This is a paper focusing on discussing practical solution for real application, and our contribution is threefold. Firstly, we demonstrate a series of practical values of SimHash-based approach by experiments and our experience. Secondly, we point out that k = 3 may be suitable for near-duplicated Web page detection, but obviously not suitable for short messages. Thirdly, we propose one empirical choice, k = 5, as applied on our online short message search ( In Section 2, we describe how SimHash works, its advantages and disadvantages. Then in Section 3, we present a series of experiments, and discuss the results. A brief review of conventional work is presented in Section 4, followed by conclusion and future work in Section 5. TABLE I. TYPICAL NEAR-DUPLICATES OF SHORT MESSAGES, WITH DIFFERENCES HIGHLIGHTED IN GRAY (1) 春 节 搞 笑 春 节 搞 笑 祝 福 短 信 新 年 到 了, 事 儿 多 了 吧? 招 待 客 人 别 累 着, 狼 吞 虎 咽 别 撑 着, 啤 的 白 的 别 掺 着, 孩 子 别 忘 照 顾 着, 最 后 我 的 惦 念 常 带 着 新 年 快 快 乐 乐 的!! (2) 春 节 搞 笑 祝 福 短 信 新 年 到 了, 事 儿 多 了 吧? 招 待 客 人 别 累 着, 狼 吞 虎 咽 别 撑 着, 啤 的 白 的 别 掺 着, 孩 子 别 忘 照 顾 着, 最 后 我 的 惦 念 常 带 着 新 年 快 快 乐 乐 的!! (1) 又 是 你 的 生 日 了, 虽 然 残 破 的 爱 情 让 我 彼 此 变 得 陌 生, 然 而 我 从 未 忘 你 的 生 日,happy birthday (2) 又 是 你 的 生 日 了, 虽 然 残 破 的 爱 情 让 我 们 彼 此 变 得 陌 生, 然 而 我 从 未 忘 你 的 生 日 Happy birthday! TABLE II. EXAMPLE: DETECT DUPLICATE WITH k = 0 AND NEAR-DUPLICATE WITH k > 0 (WITH DIFFERENCES HIGHLIGHTED IN GRAY) k = 0 (1) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 k > 0 (1) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 守 侯 k = 2 k = 5 (1) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 (2) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 守 侯 (1) 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 守 候 (2) 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 守 侯 I. NEAR-DUPLICATE DETECTION BY SIMHASH A. SimHash and Hamming Distance Charikar s SimHash [4], actually, is a fingerprinting technique that produces a compact sketch of the objects being studied, no matter documents discussed here or images. So, it allows for various processing, once applied to original data sets, to be done on the compact sketches, a much smaller and well formatted (fixed length) space. With documents, SimHash works as follows: a Web document is converted into a set of features, each feature tagged with its weight. Then, we transform such a highdimensional vector into an f bit fingerprint where f is quite small compared with the original dimensionality. An excellent comparison of SimHash and the traditional Broder s shingle-based fingerprints [7] can be found in Henzinger [8]. To make the document self contained, here, we give the algorithm s specification in Figure 1., and explain it with a little more detail. We assume the input, document D, is pre-processed and composed with a series of features (tokens). Firstly, we initialize an f -dimensional vector V with each dimension as zero (line 1). Then, for each feature, it is hashed into an f bit hash value. These f bits increment or decrement the f components of the vector by the weight of that features based on the value of each bit of the hash value calculated (line 4-8). Finally, the signs of the components determine the corresponding bits of the final fingerprint (line 9-11). TABLE III. EXAMPLE: DETECT SAME LONG TEXT BUT MORE DIFFRENCE REQUIRES LARGER k (WITH DIFFERENCES HIGHLIGHTED IN GRAY) k = 2 (1) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 守 侯 k = 5 (1) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 等 待 TABLE IV. EXAMPLE: DETECT SAME DIFFRENCE BUT SHORTER TEXT REQUIRES LARGER k (WITH DIFFERENCES HIGHLIGHTED IN GRAY) Figure 1. Algorithm specification of SimHash. 21

3 SimHash has two important but somewhat conflicting properties: (1) The fingerprint of a document is a hash of its features, and (2) Similar documents have similar hash values. The latter property is quite different from traditional hash function, like MD5 or SHA-1 (Secure Hash Algorithm), where the hash-values of two documents may be quite different even they are slightly different. This property makes SimHash an ideal technique for detecting near-duplicate ones, determining two documents are similar if their corresponding hashvalues are close to each other. The closer they are, the more similar are these two documents; when the two hash-values are completely same, we actually find two exact duplicates, as what MD5 can achieve. In this project, we choose to construct a 64 bit fingerprint for each web document because it also works well as shown in [1]. Then the detection of near-duplicate documents becomes the search of hash values with k bit difference, which is also known as searching for nearest neighbors in hamming space [3, 4]. How to realize this goal efficiently? One solution is to directly compare each pair of SimHash codes, and its complexity 2 is ON ( ), where N is the size of document repository and each unit comparison needs to compare 64 bits here. A more efficient method as proposed in [1] is implemented as well in this project. It is composed of two steps. Firstly, all f bit SimHash codes are divided into ( k + 1) block(s), and those codes with one same block, say 1,2,, ( k + 1), are grouped into different list. For example, with k = 3, all the SimHash codes with the same 1st, 2nd, 3rd, or 4th block are clustered together. Secondly, given one SimHash code, we can get its 1st block code easily and use it to retrieve a list of which all codes sharing the same 1st block as the given one. Normally, the length of such list is much smaller than the whole size of repository, N. Besides, given the found list, we need only check whether the remaining blocks of the codes differ with k or fewer bits. The same checking need to be applied to the other 3 lists before we find all SimHash codes, i.e. all near-duplicate documents. This search procedure is referred as hamming distance measure by us in the remaining text. B. Advantages and Disadvantages of SimHash SimHash has several advantages for application based on our experience: 1. Transforming into a standard fingerprint makes it applicable for different media content, no matter text, video or audio; 2. Fingerprinting provides compact representation, which not only reduces the storage space greatly but allows for quicker comparison and search; 3. Similar content has similar SimHash code, which permits easier distance function to be determined for application; 4. It is applicable for both duplicate and nearduplicate detection, with k = 0 and k > 0 respectively; 5. Similar processing time for different setting of k if via the proposed divide-and-search mentioned above, and this is valuable for practice since we are able to detect more nearduplicates with no extra cost; 6. The search procedure of similar encoded objects is easily to be implemented in distributed environment based on our implementation experience; 7. From the point of software engineering view, this procedure may be implemented into standard module and be re-used on similar applications, except that the applicants may determine the related parameters themselves. Standard and aligned encoded output (e.g., 64-bit SimHash code) plus the parameter k make it possible to figure out flexible, re-usable and scalable near-duplicate detecting algorithm, like the one implemented in [1,2] and this project as well. TABLE II., TABLE III. and TABLE IV. demonstrate several near-duplicated pairs detected with SimHash. The difference of each pair of short messages and the corresponding k value required for the detection are listed. As we discussed here, SimHash can be applied to short text without any modification on our previous work on page document, i.e. long text. Besides, it is noticed that k = 0 lets us to find exact duplicates, and larger k allows us to detect more difference. However, SimHash has its weak points as well. The text length has great influence on the effect. For example, k = 2 allows us to find the pairs with one different character in TABLE III., but it requires k = 5 in TABLE IV.. Of course, we are lucky to cost similar computing time with different k, but we have to tradeoff manually on the choice of k since determining whether or not nearduplicated is quite vague especially for those detected with large k. Besides, the size of the target objects being studied has influence on the choice of k. That s why we can t directly apply k = 3 here though it is proved effective in our Web page cleaning project. II. EXPERIMENTAL STUDY This is a project aiming at discussing practical solution for real-world large scale application, so it is believed that experiments with real data are highly desired. In this section, we are going to cover the following aspects: The algorithm is effective to find both duplicates and near-duplicates among short-text repository; The problem of near-duplicate is serious, so it is worthy of our effort; k = 3 is not good choice for detecting nearduplicated short texts; SimHash-based approach is flexible, customizable and scalable. A. Our Data We crawl and parse Web pages, extracting and indexing about 500 thousands of short messages for experimental study. Too short messages are filtered first, and the minimum threshold value is 20 here. Note that 22

4 this choice is arbitrary. TABLE V. summarizes the testing repository. TABLE V. BASIC STATISTICS ABOUT THE SHORT MESSAGE REPOSITORY USED FOR TESTING # of messages with length of ,959 Length of longest message 2,968 Length of shortest message 20 Mean length Standard deviation B. Correctness and Effectiveness To make the following discussion sound, it is necessary to verify the algorithm and our implementation. From the examples shown in TABLE II., we notice that: Duplicate pairs are indeed detected with k = 0 ; Near-duplicated pairs have to be detected with k > 0 ; Larger difference requires larger k ; If one near-duplicate can be detected will smaller k, definitely it can be detected by larger k. The reverse is not true. Therefore, with k > 0, we can find both duplicate and nearduplicates. Other than these sample examples, we further randomly select 1000 messages from the whole repository. With k = 3, 65 near-duplicated pairs are found, and they are checked one by one manually. The conclusion is that all 65 pairs are indeed near-duplicates. C. Seriousness of Near-duplicate Problem To reflect the seriousness of both duplicate and nearduplicate among short message repository, we conduct the search with different k on the same test repository respectively. Figure 2 shows the number of nearduplicated pairs detected given different k, ranging from 0 to 10. It is noticed that there are 87,604 pairs detected with k = 0, i.e. duplicate message. It means about 35% (87604 * 2 / ) of total messages are exactly duplicated (note: our hashing is token-based, and space is ignored.) This rate increases to about 57% when k = 10, that is more than half are duplicated or near-duplicated. With such many near-duplicates existing in the repository, we can image the quality of search result a series of same or similar results are piled together and presented to the users. Because normally there is no extra score,like PageRank score in page search, but the similarity score to consider given short message search application, we have no way to improve the user experience but filtering out those duplicated ones. Figure 2. also confirms the discussion in Section 3.2, i.e. larger k allows us find more near-duplicates. By removing those duplicated and near-duplicated ones, storage space is reduced greatly as well. Besides, by reducing the index scale, the online retrieval response should be quick. Therefore, there are several benefits if we are able to delete those repeating texts k<=0 k<=1 k<=2 k<=3 k<=4 k<=5 k<=6 k<=7 k<=8 k<=9 k<=10 Figure 2. The total number of near-duplicated pairs detected with different k, ranging from 0 to 10. D. k = 3 is not Good Choice k = 3 is demonstrated in [1] and [2] as suitable and practical choice for large scale near-duplicate detection of Web document. However, it seems not appropriate for detecting near-duplicated short messages. As the examples of TABLE III. and TABLE IV. indicate, we may only detect one-character difference even with k = 5. In other words, although the pair of short messages is quite similar one another, differ in only one character, they are not detected as near-duplicated with k = 3. Still with the experiments shown in Figure 2, the ratio of nearduplicates detected to all is about 37% for k = 3. That means that about extra 20% (57% - 37%, and 57% is the ratio when k = 10 ) near-duplicates are left in the short message repository if we apply k = 3. Why the same k = 3 doesn t work well enough on short text? It can be explained in a not so formal way. Given a Web page with 1000 characters, and a short text with 50 characters, the influence of adding or deleting or changing one character will have much less influence on the Web page than on the short text. This is also the feature of fingerprinting technique, and we can improve the sensitivity by using more fits while constructing the fingerprint. However, it is not free lunch since the corresponding computing and storage burden is increased meanwhile. Given 64-bit fingerprint, which k is most appropriate is hard to decide in practice. Though we may do a similar experiments like in [1], asking some persons to check manually the number of true positives, true negatives, false positives and false negatives, it is not employed here due to three causes: It is very costly a procedure, in term of money and time; Whether or not near-duplicate is not easy to determined in many cases, even by human beings, say nothing of machine; It is not our and any similar service providers goal to remove all near-duplicates. A practical goal is to alleviate the influence on users to an affordable level. 23

5 On our online short-message search service ( k = 5 is taken, and the general evaluation by several month-person testing is satisfactory, much better than before when there is no any action is taken on the message repository. Figure 3. is the snapshot of our online search of short message, and the result list appearing on the right screen is clean, no duplicated or near-duplicated. This is meaningful since (1) the small screen is made full use of by only displaying unique results; (2) it saves the communication flow for users by displaying no repeating ones; (3) the user is able to find what s/he like in a quicker manner (fewer times of paging down). Figure 3. The home page of our short-message search (left, accessible via and the result list given query 春 节 (Spring Festival, Right). E. SimHash-based Approach is Flexible, Customizable and Scalable From the discussion above, we can see that SimHashbased near-duplicate detection algorithm allows us to find both duplicate and near-duplicate ones, which owes to its most nature similar object has similar SimHash code. It is not only applicable to Web documents, but short messages here. The only necessary adjustment is to find a suitable k. Applicants may customize the choice of k based on their goals, i.e. how strictly we want to control the result. Of course, increasing the value of k also increases the risk of removing false negatives. Our experiments with about 500 thousands of data are done on a common PC machine, with 3.06GHz CPU and 1GB memory. Each experiment only cost us dozens of seconds to do the search, and the time is similar for different k. In real production environment, we implement a Hadoop-based [9] version, which allows us easily scale to millions of cases with few machines. Actually, we notice that the encoding, grouping and the comparison procedure is easy to be programmed with MapReduce [10] framework, one famous divide-andconquer distributed computing model. III. RELATED WORK A variety of techniques have been proposed to identify academic plagiarism [11, 12, 13], Web page duplicates [2,3,8, 14] and duplicate database records [15,16]. However, it is noticed that there are very few works on the discussion of detecting near-duplicates among short text repository until recently, including [1, 17]. Gong et al. [1] proposed the SimFinder which employ three techniques, namely, the ad hoc term weighting, the discriminate-term selection and the optimization techniques. It is a fingerprinting-based method as well, but takes some special processing while choosing features and their corresponding weights. Muthmann et al. [17] discussed the near-duplicate detection for Web forums which is another critical resource of user-generated content (UGC) on Internet. It is also built on the basis of fingerprinting technique. However, there is no article about the related work on mobile search application upon preparing this paper. Though the theoretical basis may be similar, identification of near-duplicate short messages is believed much more difficult considering that: 1) it usually contains less than 200 characters, and there are few effective features to extract; 2) it tends to be informal and error prone; 3) the degree of duplicated and nearduplicated is known as more severe than Web documents. All these can be explained by the fact that short messages are very popular and welcome by mobile users, and they are so short to be distributed easily. IV. CONCLUSION AND FUTURE WORK While providing short message search service, we are short of other reference, like the measures by PageRank, to optimize the ranking of results retrieved, but their relative similarity to the query itself. Based on traditional search model, same or similar short messages may pile together in the result list. Besides, it is noticed and nearduplicates are abundant in short text database. Both facts together motive us to pay enough attention to detecting and eliminating them, to ensure the user experience. We review SimHash, and discuss the application of SimHash in detecting near-duplicated short text. SimHash has several advantages, and we prove them based on a series of experiments with real data. Deleting both duplicated and near-duplicate contents has several benefits, especially, for mobile application like us, including that (1) allow more useful information to present on the small screen; (2) save the time and bandwidth for users by reducing the possible times of paging down operation or asking the server for a new page; (3) reduce the storage requirement; (4) reduce the online retrieval time, so as the waiting time of users. User experience will never be over-emphasized on mobile application considering the small screen, difficult inputting and slow connection speed today. It is believed that our discussion here may be valuable reference for applicants like us since our own product is benefiting from this technique currently online. Although there is no special operation taken to process the features in our system, like those appearing in [1,17], it is observed that the existing framework works well online. However, we also notice that there is space there for improvement. For example, we may for further to study the relationship of text length, ratio of difference and suitable k s option. Besides, some advanced NLP 24

6 (Natural Language Processing) techniques may be applied to improve the outcome. For instance, we may recognize and fix typo first before applying SimHash encoding, which is possible to allow us to find more difference with same k. Of course, all extra finer modeling will be paid with more computing resource. REFERENCES [1] C. Gong., Y. Huang., X. Cheng. and S. Bai., Detecting Near-Duplicates in Large-Scale Short Text Databases, Proc. of PAKDD 2008, LNAI, vol. 5012, pp Springer, Heidelberg [2] B. Pi., S.-K. Fu., G. Zou, J. Guo. and H. Song, Querybiased Near-Duplicate Detection: Effective, Efficient and Customizable, Proc. of 4th International Conference on Data Mining (DMIN), Las Vegas, US., [3] G.S. Manku, A. Jain. and A.D. Sarma, Detecting Near- Duplicates for Web Crawling, Proc. of 16th International World Wide Web Conference (WWW), [4] M. Charikar, Similarity Estimation Techniques from Rounding Algorithm, Proc. of 34th Annual Symposium on Theory of Computing (STOC), 2008, pp [5] Official website of Chinese Information Industry Ministry of China: [6] Roboo mobile search engine: [7] A.Broder,, S.C. Glassman, M. Manasse. and G. Zweig, Syntactic clustering of the web, Computer Networks, vol.29, no.8-13, 1997, pp [8] M.R. Henzinger, Finding near-duplicate web documents: a large-scale evaluation of algorithms, Proc. of ACM SIGIR, 2006, pp [9] Hadoop official site: [10] J. Dean and S.Ghemawat, MapReduce: Simplified data processing on large cluster, Proc. of 6th Symposium on Operating System Design and Implementation (OSDI), [11] S.Brin, J.Davis and H.Garcia-Molina, Copy detection mechanisms for digital documents, Proc. of the ACM SIGMOD Annual Conference, San Francisco, CA, [12] N.Shivakumar and H.Garcia-Molina, SCAM : A copy detection mechanism for digital documents, Proc. of 2 nd International Conference in Theory and Practice of Digital Libraries, Austin, Texas, [13] M.Zini, M.Fabbri and M.Mongelia. Plagiarism detection through multilevel text comparison, Proc. of the 2 nd International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, Leeds, U.K., [14] N.Shivakumar and H.Garnia-Molina, Finding nearreplicas of documents on the web, Proc. of Workshop on Web Databases, Valencia, Spain, [15] Z.P.Tian, H.J.Lu and W.Y.Ji, An n-gram-based approach for detecting approximately duplicate data records, International Journal on Digital Libraries, 5(3): , [16] M.A. Hernandez and S.J.Stolfo, The merge/purge problem for large databases, Proc. of ACM SIGMOD Annual Conference, San Jose, CA., [17] K.Muthmann, W.M.Barczynski, F.Brauer and A.Loser, Near-duplicate detection for web-forums,, , International Database Engineering and Applications Symposium(IDEAS),

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

A Review on Duplicate and Near Duplicate Documents Detection Technique

A Review on Duplicate and Near Duplicate Documents Detection Technique International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-03 E-ISSN: 2347-2693 A Review on Duplicate and Near Duplicate Documents Detection Technique Patil Deepali

More information

On the Efficiency of Collecting and Reducing Spam Samples

On the Efficiency of Collecting and Reducing Spam Samples On the Efficiency of Collecting and Reducing Spam Samples Pin-Ren Chiou, Po-Ching Lin Department of Computer Science and Information Engineering National Chung Cheng University Chiayi, Taiwan, 62102 {cpj101m,pclin}@cs.ccu.edu.tw

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

More information

Big Data. Lecture 6: Locality Sensitive Hashing (LSH)

Big Data. Lecture 6: Locality Sensitive Hashing (LSH) Big Data Lecture 6: Locality Sensitive Hashing (LSH) Nearest Neighbor Given a set P of n oints in R d Nearest Neighbor Want to build a data structure to answer nearest neighbor queries Voronoi Diagram

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Data Deduplication in Slovak Corpora

Data Deduplication in Slovak Corpora Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain

More information

SRC Technical Note. Syntactic Clustering of the Web

SRC Technical Note. Syntactic Clustering of the Web SRC Technical Note 1997-015 July 25, 1997 Syntactic Clustering of the Web Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig Systems Research Center 130 Lytton Avenue Palo Alto, CA 94301

More information

Make search become the internal function of Internet

Make search become the internal function of Internet Make search become the internal function of Internet Wang Liang 1, Guo Yi-Ping 2, Fang Ming 3 1, 3 (Department of Control Science and Control Engineer, Huazhong University of Science and Technology, WuHan,

More information

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt>

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt> Intelligent Log Analyzer André Restivo 9th January 2003 Abstract Server Administrators often have to analyze server logs to find if something is wrong with their machines.

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

Performance evaluation of Web Information Retrieval Systems and its application to e-business

Performance evaluation of Web Information Retrieval Systems and its application to e-business Performance evaluation of Web Information Retrieval Systems and its application to e-business Fidel Cacheda, Angel Viña Departament of Information and Comunications Technologies Facultad de Informática,

More information

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Feng Zhou, Li Zhuang, Ben Y. Zhao, Ling Huang, Anthony D. Joseph and John D. Kubiatowicz University of California, Berkeley The Problem

More information

IMPROVED NEAR DUPLICATE MATCHING SCHEME FOR E-MAIL SPAM DETECTION

IMPROVED NEAR DUPLICATE MATCHING SCHEME FOR E-MAIL SPAM DETECTION IMPROVED NEAR DUPLICATE MATCHING SCHEME FOR E-MAIL SPAM DETECTION M. SIVA KUMAR REDDY 1 & B. KRISHNA SAGAR 2 1,2 Department of CSE, Madanapalli Institute of Technology and Science, Madanapalli, Andhra,

More information

Pattern Insight Clone Detection

Pattern Insight Clone Detection Pattern Insight Clone Detection TM The fastest, most effective way to discover all similar code segments What is Clone Detection? Pattern Insight Clone Detection is a powerful pattern discovery technology

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b

The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b 3rd International Conference on Materials Engineering, Manufacturing Technology and Control (ICMEMTC 2016) The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

Digital Evidence Search Kit

Digital Evidence Search Kit Digital Evidence Search Kit K.P. Chow, C.F. Chong, K.Y. Lai, L.C.K. Hui, K. H. Pun, W.W. Tsang, H.W. Chan Center for Information Security and Cryptography Department of Computer Science The University

More information

An Efficient Load Balancing Technology in CDN

An Efficient Load Balancing Technology in CDN Issue 2, Volume 1, 2007 92 An Efficient Load Balancing Technology in CDN YUN BAI 1, BO JIA 2, JIXIANG ZHANG 3, QIANGGUO PU 1, NIKOS MASTORAKIS 4 1 College of Information and Electronic Engineering, University

More information

Image Search by MapReduce

Image Search by MapReduce Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

A Web Service for Scholarly Big Data Information Extraction

A Web Service for Scholarly Big Data Information Extraction 2014 IEEE International Conference on Web Services A Web Service for Scholarly Big Data Information Extraction Kyle Williams, Lichi Li, Madian Khabsa, Jian Wu, Patrick C. Shih and C. Lee Giles Information

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Near Duplicate Document Detection Survey

Near Duplicate Document Detection Survey Near Duplicate Document Detection Survey Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa Faculty of Computing and Information Technology King AbdulAziz University Jeddah, Saudi Arabia Abstract

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

Object Request Reduction in Home Nodes and Load Balancing of Object Request in Hybrid Decentralized Web Caching

Object Request Reduction in Home Nodes and Load Balancing of Object Request in Hybrid Decentralized Web Caching 2012 2 nd International Conference on Information Communication and Management (ICICM 2012) IPCSIT vol. 55 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V55.5 Object Request Reduction

More information

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural

More information

Optimization of Search Results with De-Duplication of Web Pages In a Mobile Web Crawler

Optimization of Search Results with De-Duplication of Web Pages In a Mobile Web Crawler Optimization of Search Results with De-Duplication of Web Pages In a Mobile Web Crawler Monika 1, Mona 2, Prof. Ela Kumar 3 Department of Computer Science and Engineering, Indira Gandhi Delhi Technical

More information

Open Access Research and Realization of the Extensible Data Cleaning Framework EDCF

Open Access Research and Realization of the Extensible Data Cleaning Framework EDCF Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2015, 7, 2039-2043 2039 Open Access Research and Realization of the Extensible Data Cleaning Framework

More information

Fault Analysis in Software with the Data Interaction of Classes

Fault Analysis in Software with the Data Interaction of Classes , pp.189-196 http://dx.doi.org/10.14257/ijsia.2015.9.9.17 Fault Analysis in Software with the Data Interaction of Classes Yan Xiaobo 1 and Wang Yichen 2 1 Science & Technology on Reliability & Environmental

More information

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Ranked Keyword Search in Cloud Computing: An Innovative Approach

Ranked Keyword Search in Cloud Computing: An Innovative Approach International Journal of Computational Engineering Research Vol, 03 Issue, 6 Ranked Keyword Search in Cloud Computing: An Innovative Approach 1, Vimmi Makkar 2, Sandeep Dalal 1, (M.Tech) 2,(Assistant professor)

More information

CiteSeer x in the Cloud

CiteSeer x in the Cloud Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

A Comparative Approach to Search Engine Ranking Strategies

A Comparative Approach to Search Engine Ranking Strategies 26 A Comparative Approach to Search Engine Ranking Strategies Dharminder Singh 1, Ashwani Sethi 2 Guru Gobind Singh Collage of Engineering & Technology Guru Kashi University Talwandi Sabo, Bathinda, Punjab

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

Intelligent Agents Serving Based On The Society Information

Intelligent Agents Serving Based On The Society Information Intelligent Agents Serving Based On The Society Information Sanem SARIEL Istanbul Technical University, Computer Engineering Department, Istanbul, TURKEY sariel@cs.itu.edu.tr B. Tevfik AKGUN Yildiz Technical

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Research of Postal Data mining system based on big data

Research of Postal Data mining system based on big data 3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Research of Postal Data mining system based on big data Xia Hu 1, Yanfeng Jin 1, Fan Wang 1 1 Shi Jiazhuang Post & Telecommunication

More information

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE) HIDDEN WEB EXTRACTOR DYNAMIC WAY TO UNCOVER THE DEEP WEB DR. ANURADHA YMCA,CSE, YMCA University Faridabad, Haryana 121006,India anuangra@yahoo.com http://www.ymcaust.ac.in BABITA AHUJA MRCE, IT, MDU University

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang 2011-10

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang 2011-10 Application and practice of parallel cloud computing in ISP Guangzhou Institute of China Telecom Zhilan Huang 2011-10 Outline Mass data management problem Applications of parallel cloud computing in ISPs

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Redundant Data Removal Technique for Efficient Big Data Search Processing

Redundant Data Removal Technique for Efficient Big Data Search Processing Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

On generating large-scale ground truth datasets for the deduplication of bibliographic records

On generating large-scale ground truth datasets for the deduplication of bibliographic records On generating large-scale ground truth datasets for the deduplication of bibliographic records James A. Hammerton j_hammerton@yahoo.co.uk Michael Granitzer mgrani@know-center.at Maya Hristakeva maya.hristakeva@mendeley.com

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

An Efficient Hybrid P2P MMOG Cloud Architecture for Dynamic Load Management. Ginhung Wang, Kuochen Wang

An Efficient Hybrid P2P MMOG Cloud Architecture for Dynamic Load Management. Ginhung Wang, Kuochen Wang 1 An Efficient Hybrid MMOG Cloud Architecture for Dynamic Load Management Ginhung Wang, Kuochen Wang Abstract- In recent years, massively multiplayer online games (MMOGs) become more and more popular.

More information

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer. RESEARCH ARTICLE ISSN: 2321-7758 GLOBAL LOAD DISTRIBUTION USING SKIP GRAPH, BATON AND CHORD J.K.JEEVITHA, B.KARTHIKA* Information Technology,PSNA College of Engineering & Technology, Dindigul, India Article

More information

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks OnurSoft Onur Tolga Şehitoğlu November 10, 2012 v1.0 Contents 1 Introduction 3 1.1 Purpose..............................

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

CS231M Project Report - Automated Real-Time Face Tracking and Blending

CS231M Project Report - Automated Real-Time Face Tracking and Blending CS231M Project Report - Automated Real-Time Face Tracking and Blending Steven Lee, slee2010@stanford.edu June 6, 2015 1 Introduction Summary statement: The goal of this project is to create an Android

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Figure 1. The cloud scales: Amazon EC2 growth [2].

Figure 1. The cloud scales: Amazon EC2 growth [2]. - Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension

Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension ISTE-ACEEE Int. J. in Computer Science, Vol. 1, No. 1, March 2014 Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension Kala Karun.A and Chitharanjan. K Sree

More information

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. ravirajesh.j.2013.mecse@rajalakshmi.edu.in Mrs.

More information

Hardware Configuration Guide

Hardware Configuration Guide Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...

More information

Resource Allocation Schemes for Gang Scheduling

Resource Allocation Schemes for Gang Scheduling Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian

More information

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network , pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and

More information

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 505-510 International Research Publications House http://www. irphouse.com /ijict.htm K-means

More information

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus Tadashi Ogino* Okinawa National College of Technology, Okinawa, Japan. * Corresponding author. Email: ogino@okinawa-ct.ac.jp

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

More information

A Catalogue of the Steiner Triple Systems of Order 19

A Catalogue of the Steiner Triple Systems of Order 19 A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of

More information

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet

More information

Security in Android apps

Security in Android apps Security in Android apps Falco Peijnenburg (3749002) August 16, 2013 Abstract Apps can be released on the Google Play store through the Google Developer Console. The Google Play store only allows apps

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Efficient Search in Gnutella-like Small-World Peerto-Peer

Efficient Search in Gnutella-like Small-World Peerto-Peer Efficient Search in Gnutella-like Small-World Peerto-Peer Systems * Dongsheng Li, Xicheng Lu, Yijie Wang, Nong Xiao School of Computer, National University of Defense Technology, 410073 Changsha, China

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward

More information

A Talari Networks White Paper. Turbo Charging WAN Optimization with WAN Virtualization. A Talari White Paper

A Talari Networks White Paper. Turbo Charging WAN Optimization with WAN Virtualization. A Talari White Paper A Talari Networks White Paper Turbo Charging WAN Optimization with WAN Virtualization A Talari White Paper 2 Introduction WAN Virtualization is revolutionizing Enterprise Wide Area Network (WAN) economics,

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Preventing and Detecting Plagiarism in Programming Course

Preventing and Detecting Plagiarism in Programming Course , pp.269-278 http://dx.doi.org/10.14257/ijsia.2013.7.5.25 Preventing and Detecting Plagiarism in Programming Course Wang Chunhui, Liu Zhiguo and Liu Dongsheng Computer & Information Engineering College,

More information

Distance Degree Sequences for Network Analysis

Distance Degree Sequences for Network Analysis Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation

More information

A Trust Evaluation Model for QoS Guarantee in Cloud Systems *

A Trust Evaluation Model for QoS Guarantee in Cloud Systems * A Trust Evaluation Model for QoS Guarantee in Cloud Systems * Hyukho Kim, Hana Lee, Woongsup Kim, Yangwoo Kim Dept. of Information and Communication Engineering, Dongguk University Seoul, 100-715, South

More information

Study on Redundant Strategies in Peer to Peer Cloud Storage Systems

Study on Redundant Strategies in Peer to Peer Cloud Storage Systems Applied Mathematics & Information Sciences An International Journal 2011 NSP 5 (2) (2011), 235S-242S Study on Redundant Strategies in Peer to Peer Cloud Storage Systems Wu Ji-yi 1, Zhang Jian-lin 1, Wang

More information