SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages
|
|
- Camilla Cook
- 8 years ago
- Views:
Transcription
1 ISBN (Print), (CD-ROM) Proceedings of the Second Symposium International Computer Science and Computational Technology(ISCSCT 09) Huangshan, P. R. China, 26-28,Dec. 2009, pp SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages Bingfeng Pi, Shunkai Fu, Weilei Wang, and Song Han Roboo Inc., Suzhou, P.R.China {winter.pi, shunkai.fu, willer.wang, Abstract Detecting near-duplicates within huge repository of short message is known as a challenge due to its short length, frequent happenings of typo when typing on mobile phone, flexibility and diversity nature of Chinese language, and the target we prefer, near-duplicate. In this paper, we discuss the real problem met in real application, and try to look for a suitable technique to solve this problem. We start with the discussion of the seriousness of near-duplicate existing in short messages. Then, we review how SimHash works, and its possible merits for finding near-duplicates. Finally, we demonstrate a series of findings, including the problem itself and the benefits brought by SimHash-based approach, based on experiments with 500 thousands of real short messages crawled from Internet. The discussion here is believed a valuable reference for both researchers and applicants. Index Terms Near-duplicate, SimHash, short text I. INTRODUCTION Duplicate and near-duplicate web documents are posing large problems on Web search engines: They increase the space required to store the index, slow down serving results, and annoy the users [2, 3]. Among the data available on Internet, a large proportion are short texts, such as mobile phone short messages, instant messages, chat log, BBS titles etc [1]. It was reported by Information Industry Ministry of China that more than 1.56 billion mobile phone short messages are sent each day in Mainland China [5]. Being an active and popular mobile search service provider in China, our history query log indicates that the short message search enjoys a similar scale of monthly PVs (Page Visit) as Web page search on Roboo [6]. These two vivid facts motivate us to pay enough attention to the quality of our short message repository since it is the basis for quality search service. Unfortunately, the status of duplicate or nearduplicate messages is very severe, especially nearduplicates. For example, the following are two typical examples near-duplicates (all in Chinese): In the first pair, the one above has 4 more characters (highlighted in gray) than the other one, and the remaining part is exactly the same; And in the second pair, the differences contain one character, and two punctuations (all highlighted in gray). These differences may result from several causes: 1) same contents appearing on different sites are all crawled, processed and indexed; 2) mistake introduced while parsing these loosely structured and noisy text (HTML page may contain ads., and it is known as shorting of 2009 ACADEMY PUBLISHER AP-PROC-CS-09CN semantics useful for parsing); 3) manual typos (all information on Internet are created by people originally) and manual revising while being referred and reused; 4) explicit modification to make the short message suitable for difference usage (for example, replacing 春 节 (Spring Festival, Chinese traditional New Year) with 新 年 (Near Year), though they are actually similar in meaning. Manual checking may be applicable when the scale of repository is small, e.g. hundreds or thousands of instances. When the amount of instances increases to millions and more, obviously, it becomes impossible for human beings to check them one by one, which is tedious, costly and prone to error. Resorting to computers for such kind of repeatable job is desired, of which the core is an algorithm to measure the difference between any pair of short messages, including duplicated and nearduplicated ones. Manku et al. [3] showed that Charikar s SimHash [4] is practically useful identifying near-duplicates in web documents. SimHash is a fingerprint technique enjoying the property that fingerprints of near-duplicates differ only in a small number of bit positions. A SimHash fingerprint is generated for each object. If the fingerprints of two objects are similar, then they are deemed to be near-duplicates. As for a SimHash fingerprint f, Manku et al. developed a technique for identifying whether an existing fingerprint f differs from f in at most k bits. Their experiments show that for a repository of 8 billion pages, 64-bit SimHash fingerprints and k = 3 are reasonable. Another work by Pi et al. [2] confirmed the effect of SimHash and the work by Manku et al; besides, they proposed to do the detection among the results retrieved by a query, i.e. so-called query-biased approach. It reduces the problem scale via divide-and-conquer, replacing global search with local search, and it is open to more settings possibly met in application, e.g. smaller k to remove fewer documents under some condition, and bigger k to delete more documents under other condition. In this paper, we show that SimHash is indeed effective and efficient in detecting both duplicate (with k = 0 ) and near-duplicate (with k > 0 ) (see the two typical examples in TABLE II. ) among large short message repository. However, we also notice that due to the born feature of short messages, k = 3 may not be an ideal parameter for. For example, as shown in TABLE III., k = 2 is enough to detect the one-character
2 difference, but k has to be 5 to detect the same pair of messages with two-character difference. Besides, with the same one-character difference, short messages require larger k for effective detection (TABLE IV. ). This may be explained by an observation, that the same difference, e.g. having one different character on the same position of two short messages, would be more influential to short text than to long text. This is a paper focusing on discussing practical solution for real application, and our contribution is threefold. Firstly, we demonstrate a series of practical values of SimHash-based approach by experiments and our experience. Secondly, we point out that k = 3 may be suitable for near-duplicated Web page detection, but obviously not suitable for short messages. Thirdly, we propose one empirical choice, k = 5, as applied on our online short message search ( In Section 2, we describe how SimHash works, its advantages and disadvantages. Then in Section 3, we present a series of experiments, and discuss the results. A brief review of conventional work is presented in Section 4, followed by conclusion and future work in Section 5. TABLE I. TYPICAL NEAR-DUPLICATES OF SHORT MESSAGES, WITH DIFFERENCES HIGHLIGHTED IN GRAY (1) 春 节 搞 笑 春 节 搞 笑 祝 福 短 信 新 年 到 了, 事 儿 多 了 吧? 招 待 客 人 别 累 着, 狼 吞 虎 咽 别 撑 着, 啤 的 白 的 别 掺 着, 孩 子 别 忘 照 顾 着, 最 后 我 的 惦 念 常 带 着 新 年 快 快 乐 乐 的!! (2) 春 节 搞 笑 祝 福 短 信 新 年 到 了, 事 儿 多 了 吧? 招 待 客 人 别 累 着, 狼 吞 虎 咽 别 撑 着, 啤 的 白 的 别 掺 着, 孩 子 别 忘 照 顾 着, 最 后 我 的 惦 念 常 带 着 新 年 快 快 乐 乐 的!! (1) 又 是 你 的 生 日 了, 虽 然 残 破 的 爱 情 让 我 彼 此 变 得 陌 生, 然 而 我 从 未 忘 你 的 生 日,happy birthday (2) 又 是 你 的 生 日 了, 虽 然 残 破 的 爱 情 让 我 们 彼 此 变 得 陌 生, 然 而 我 从 未 忘 你 的 生 日 Happy birthday! TABLE II. EXAMPLE: DETECT DUPLICATE WITH k = 0 AND NEAR-DUPLICATE WITH k > 0 (WITH DIFFERENCES HIGHLIGHTED IN GRAY) k = 0 (1) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 k > 0 (1) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 守 侯 k = 2 k = 5 (1) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 (2) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 守 侯 (1) 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 守 候 (2) 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 守 侯 I. NEAR-DUPLICATE DETECTION BY SIMHASH A. SimHash and Hamming Distance Charikar s SimHash [4], actually, is a fingerprinting technique that produces a compact sketch of the objects being studied, no matter documents discussed here or images. So, it allows for various processing, once applied to original data sets, to be done on the compact sketches, a much smaller and well formatted (fixed length) space. With documents, SimHash works as follows: a Web document is converted into a set of features, each feature tagged with its weight. Then, we transform such a highdimensional vector into an f bit fingerprint where f is quite small compared with the original dimensionality. An excellent comparison of SimHash and the traditional Broder s shingle-based fingerprints [7] can be found in Henzinger [8]. To make the document self contained, here, we give the algorithm s specification in Figure 1., and explain it with a little more detail. We assume the input, document D, is pre-processed and composed with a series of features (tokens). Firstly, we initialize an f -dimensional vector V with each dimension as zero (line 1). Then, for each feature, it is hashed into an f bit hash value. These f bits increment or decrement the f components of the vector by the weight of that features based on the value of each bit of the hash value calculated (line 4-8). Finally, the signs of the components determine the corresponding bits of the final fingerprint (line 9-11). TABLE III. EXAMPLE: DETECT SAME LONG TEXT BUT MORE DIFFRENCE REQUIRES LARGER k (WITH DIFFERENCES HIGHLIGHTED IN GRAY) k = 2 (1) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 守 侯 k = 5 (1) 今 生 今 世, 你 是 我 唯 一 的 选 择 愿 我 们 好 好 珍 惜 缘 分, 也 请 你 答 应 我, 今 生 今 世 只 为 我 等 待 TABLE IV. EXAMPLE: DETECT SAME DIFFRENCE BUT SHORTER TEXT REQUIRES LARGER k (WITH DIFFERENCES HIGHLIGHTED IN GRAY) Figure 1. Algorithm specification of SimHash. 21
3 SimHash has two important but somewhat conflicting properties: (1) The fingerprint of a document is a hash of its features, and (2) Similar documents have similar hash values. The latter property is quite different from traditional hash function, like MD5 or SHA-1 (Secure Hash Algorithm), where the hash-values of two documents may be quite different even they are slightly different. This property makes SimHash an ideal technique for detecting near-duplicate ones, determining two documents are similar if their corresponding hashvalues are close to each other. The closer they are, the more similar are these two documents; when the two hash-values are completely same, we actually find two exact duplicates, as what MD5 can achieve. In this project, we choose to construct a 64 bit fingerprint for each web document because it also works well as shown in [1]. Then the detection of near-duplicate documents becomes the search of hash values with k bit difference, which is also known as searching for nearest neighbors in hamming space [3, 4]. How to realize this goal efficiently? One solution is to directly compare each pair of SimHash codes, and its complexity 2 is ON ( ), where N is the size of document repository and each unit comparison needs to compare 64 bits here. A more efficient method as proposed in [1] is implemented as well in this project. It is composed of two steps. Firstly, all f bit SimHash codes are divided into ( k + 1) block(s), and those codes with one same block, say 1,2,, ( k + 1), are grouped into different list. For example, with k = 3, all the SimHash codes with the same 1st, 2nd, 3rd, or 4th block are clustered together. Secondly, given one SimHash code, we can get its 1st block code easily and use it to retrieve a list of which all codes sharing the same 1st block as the given one. Normally, the length of such list is much smaller than the whole size of repository, N. Besides, given the found list, we need only check whether the remaining blocks of the codes differ with k or fewer bits. The same checking need to be applied to the other 3 lists before we find all SimHash codes, i.e. all near-duplicate documents. This search procedure is referred as hamming distance measure by us in the remaining text. B. Advantages and Disadvantages of SimHash SimHash has several advantages for application based on our experience: 1. Transforming into a standard fingerprint makes it applicable for different media content, no matter text, video or audio; 2. Fingerprinting provides compact representation, which not only reduces the storage space greatly but allows for quicker comparison and search; 3. Similar content has similar SimHash code, which permits easier distance function to be determined for application; 4. It is applicable for both duplicate and nearduplicate detection, with k = 0 and k > 0 respectively; 5. Similar processing time for different setting of k if via the proposed divide-and-search mentioned above, and this is valuable for practice since we are able to detect more nearduplicates with no extra cost; 6. The search procedure of similar encoded objects is easily to be implemented in distributed environment based on our implementation experience; 7. From the point of software engineering view, this procedure may be implemented into standard module and be re-used on similar applications, except that the applicants may determine the related parameters themselves. Standard and aligned encoded output (e.g., 64-bit SimHash code) plus the parameter k make it possible to figure out flexible, re-usable and scalable near-duplicate detecting algorithm, like the one implemented in [1,2] and this project as well. TABLE II., TABLE III. and TABLE IV. demonstrate several near-duplicated pairs detected with SimHash. The difference of each pair of short messages and the corresponding k value required for the detection are listed. As we discussed here, SimHash can be applied to short text without any modification on our previous work on page document, i.e. long text. Besides, it is noticed that k = 0 lets us to find exact duplicates, and larger k allows us to detect more difference. However, SimHash has its weak points as well. The text length has great influence on the effect. For example, k = 2 allows us to find the pairs with one different character in TABLE III., but it requires k = 5 in TABLE IV.. Of course, we are lucky to cost similar computing time with different k, but we have to tradeoff manually on the choice of k since determining whether or not nearduplicated is quite vague especially for those detected with large k. Besides, the size of the target objects being studied has influence on the choice of k. That s why we can t directly apply k = 3 here though it is proved effective in our Web page cleaning project. II. EXPERIMENTAL STUDY This is a project aiming at discussing practical solution for real-world large scale application, so it is believed that experiments with real data are highly desired. In this section, we are going to cover the following aspects: The algorithm is effective to find both duplicates and near-duplicates among short-text repository; The problem of near-duplicate is serious, so it is worthy of our effort; k = 3 is not good choice for detecting nearduplicated short texts; SimHash-based approach is flexible, customizable and scalable. A. Our Data We crawl and parse Web pages, extracting and indexing about 500 thousands of short messages for experimental study. Too short messages are filtered first, and the minimum threshold value is 20 here. Note that 22
4 this choice is arbitrary. TABLE V. summarizes the testing repository. TABLE V. BASIC STATISTICS ABOUT THE SHORT MESSAGE REPOSITORY USED FOR TESTING # of messages with length of ,959 Length of longest message 2,968 Length of shortest message 20 Mean length Standard deviation B. Correctness and Effectiveness To make the following discussion sound, it is necessary to verify the algorithm and our implementation. From the examples shown in TABLE II., we notice that: Duplicate pairs are indeed detected with k = 0 ; Near-duplicated pairs have to be detected with k > 0 ; Larger difference requires larger k ; If one near-duplicate can be detected will smaller k, definitely it can be detected by larger k. The reverse is not true. Therefore, with k > 0, we can find both duplicate and nearduplicates. Other than these sample examples, we further randomly select 1000 messages from the whole repository. With k = 3, 65 near-duplicated pairs are found, and they are checked one by one manually. The conclusion is that all 65 pairs are indeed near-duplicates. C. Seriousness of Near-duplicate Problem To reflect the seriousness of both duplicate and nearduplicate among short message repository, we conduct the search with different k on the same test repository respectively. Figure 2 shows the number of nearduplicated pairs detected given different k, ranging from 0 to 10. It is noticed that there are 87,604 pairs detected with k = 0, i.e. duplicate message. It means about 35% (87604 * 2 / ) of total messages are exactly duplicated (note: our hashing is token-based, and space is ignored.) This rate increases to about 57% when k = 10, that is more than half are duplicated or near-duplicated. With such many near-duplicates existing in the repository, we can image the quality of search result a series of same or similar results are piled together and presented to the users. Because normally there is no extra score,like PageRank score in page search, but the similarity score to consider given short message search application, we have no way to improve the user experience but filtering out those duplicated ones. Figure 2. also confirms the discussion in Section 3.2, i.e. larger k allows us find more near-duplicates. By removing those duplicated and near-duplicated ones, storage space is reduced greatly as well. Besides, by reducing the index scale, the online retrieval response should be quick. Therefore, there are several benefits if we are able to delete those repeating texts k<=0 k<=1 k<=2 k<=3 k<=4 k<=5 k<=6 k<=7 k<=8 k<=9 k<=10 Figure 2. The total number of near-duplicated pairs detected with different k, ranging from 0 to 10. D. k = 3 is not Good Choice k = 3 is demonstrated in [1] and [2] as suitable and practical choice for large scale near-duplicate detection of Web document. However, it seems not appropriate for detecting near-duplicated short messages. As the examples of TABLE III. and TABLE IV. indicate, we may only detect one-character difference even with k = 5. In other words, although the pair of short messages is quite similar one another, differ in only one character, they are not detected as near-duplicated with k = 3. Still with the experiments shown in Figure 2, the ratio of nearduplicates detected to all is about 37% for k = 3. That means that about extra 20% (57% - 37%, and 57% is the ratio when k = 10 ) near-duplicates are left in the short message repository if we apply k = 3. Why the same k = 3 doesn t work well enough on short text? It can be explained in a not so formal way. Given a Web page with 1000 characters, and a short text with 50 characters, the influence of adding or deleting or changing one character will have much less influence on the Web page than on the short text. This is also the feature of fingerprinting technique, and we can improve the sensitivity by using more fits while constructing the fingerprint. However, it is not free lunch since the corresponding computing and storage burden is increased meanwhile. Given 64-bit fingerprint, which k is most appropriate is hard to decide in practice. Though we may do a similar experiments like in [1], asking some persons to check manually the number of true positives, true negatives, false positives and false negatives, it is not employed here due to three causes: It is very costly a procedure, in term of money and time; Whether or not near-duplicate is not easy to determined in many cases, even by human beings, say nothing of machine; It is not our and any similar service providers goal to remove all near-duplicates. A practical goal is to alleviate the influence on users to an affordable level. 23
5 On our online short-message search service ( k = 5 is taken, and the general evaluation by several month-person testing is satisfactory, much better than before when there is no any action is taken on the message repository. Figure 3. is the snapshot of our online search of short message, and the result list appearing on the right screen is clean, no duplicated or near-duplicated. This is meaningful since (1) the small screen is made full use of by only displaying unique results; (2) it saves the communication flow for users by displaying no repeating ones; (3) the user is able to find what s/he like in a quicker manner (fewer times of paging down). Figure 3. The home page of our short-message search (left, accessible via and the result list given query 春 节 (Spring Festival, Right). E. SimHash-based Approach is Flexible, Customizable and Scalable From the discussion above, we can see that SimHashbased near-duplicate detection algorithm allows us to find both duplicate and near-duplicate ones, which owes to its most nature similar object has similar SimHash code. It is not only applicable to Web documents, but short messages here. The only necessary adjustment is to find a suitable k. Applicants may customize the choice of k based on their goals, i.e. how strictly we want to control the result. Of course, increasing the value of k also increases the risk of removing false negatives. Our experiments with about 500 thousands of data are done on a common PC machine, with 3.06GHz CPU and 1GB memory. Each experiment only cost us dozens of seconds to do the search, and the time is similar for different k. In real production environment, we implement a Hadoop-based [9] version, which allows us easily scale to millions of cases with few machines. Actually, we notice that the encoding, grouping and the comparison procedure is easy to be programmed with MapReduce [10] framework, one famous divide-andconquer distributed computing model. III. RELATED WORK A variety of techniques have been proposed to identify academic plagiarism [11, 12, 13], Web page duplicates [2,3,8, 14] and duplicate database records [15,16]. However, it is noticed that there are very few works on the discussion of detecting near-duplicates among short text repository until recently, including [1, 17]. Gong et al. [1] proposed the SimFinder which employ three techniques, namely, the ad hoc term weighting, the discriminate-term selection and the optimization techniques. It is a fingerprinting-based method as well, but takes some special processing while choosing features and their corresponding weights. Muthmann et al. [17] discussed the near-duplicate detection for Web forums which is another critical resource of user-generated content (UGC) on Internet. It is also built on the basis of fingerprinting technique. However, there is no article about the related work on mobile search application upon preparing this paper. Though the theoretical basis may be similar, identification of near-duplicate short messages is believed much more difficult considering that: 1) it usually contains less than 200 characters, and there are few effective features to extract; 2) it tends to be informal and error prone; 3) the degree of duplicated and nearduplicated is known as more severe than Web documents. All these can be explained by the fact that short messages are very popular and welcome by mobile users, and they are so short to be distributed easily. IV. CONCLUSION AND FUTURE WORK While providing short message search service, we are short of other reference, like the measures by PageRank, to optimize the ranking of results retrieved, but their relative similarity to the query itself. Based on traditional search model, same or similar short messages may pile together in the result list. Besides, it is noticed and nearduplicates are abundant in short text database. Both facts together motive us to pay enough attention to detecting and eliminating them, to ensure the user experience. We review SimHash, and discuss the application of SimHash in detecting near-duplicated short text. SimHash has several advantages, and we prove them based on a series of experiments with real data. Deleting both duplicated and near-duplicate contents has several benefits, especially, for mobile application like us, including that (1) allow more useful information to present on the small screen; (2) save the time and bandwidth for users by reducing the possible times of paging down operation or asking the server for a new page; (3) reduce the storage requirement; (4) reduce the online retrieval time, so as the waiting time of users. User experience will never be over-emphasized on mobile application considering the small screen, difficult inputting and slow connection speed today. It is believed that our discussion here may be valuable reference for applicants like us since our own product is benefiting from this technique currently online. Although there is no special operation taken to process the features in our system, like those appearing in [1,17], it is observed that the existing framework works well online. However, we also notice that there is space there for improvement. For example, we may for further to study the relationship of text length, ratio of difference and suitable k s option. Besides, some advanced NLP 24
6 (Natural Language Processing) techniques may be applied to improve the outcome. For instance, we may recognize and fix typo first before applying SimHash encoding, which is possible to allow us to find more difference with same k. Of course, all extra finer modeling will be paid with more computing resource. REFERENCES [1] C. Gong., Y. Huang., X. Cheng. and S. Bai., Detecting Near-Duplicates in Large-Scale Short Text Databases, Proc. of PAKDD 2008, LNAI, vol. 5012, pp Springer, Heidelberg [2] B. Pi., S.-K. Fu., G. Zou, J. Guo. and H. Song, Querybiased Near-Duplicate Detection: Effective, Efficient and Customizable, Proc. of 4th International Conference on Data Mining (DMIN), Las Vegas, US., [3] G.S. Manku, A. Jain. and A.D. Sarma, Detecting Near- Duplicates for Web Crawling, Proc. of 16th International World Wide Web Conference (WWW), [4] M. Charikar, Similarity Estimation Techniques from Rounding Algorithm, Proc. of 34th Annual Symposium on Theory of Computing (STOC), 2008, pp [5] Official website of Chinese Information Industry Ministry of China: [6] Roboo mobile search engine: [7] A.Broder,, S.C. Glassman, M. Manasse. and G. Zweig, Syntactic clustering of the web, Computer Networks, vol.29, no.8-13, 1997, pp [8] M.R. Henzinger, Finding near-duplicate web documents: a large-scale evaluation of algorithms, Proc. of ACM SIGIR, 2006, pp [9] Hadoop official site: [10] J. Dean and S.Ghemawat, MapReduce: Simplified data processing on large cluster, Proc. of 6th Symposium on Operating System Design and Implementation (OSDI), [11] S.Brin, J.Davis and H.Garcia-Molina, Copy detection mechanisms for digital documents, Proc. of the ACM SIGMOD Annual Conference, San Francisco, CA, [12] N.Shivakumar and H.Garcia-Molina, SCAM : A copy detection mechanism for digital documents, Proc. of 2 nd International Conference in Theory and Practice of Digital Libraries, Austin, Texas, [13] M.Zini, M.Fabbri and M.Mongelia. Plagiarism detection through multilevel text comparison, Proc. of the 2 nd International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, Leeds, U.K., [14] N.Shivakumar and H.Garnia-Molina, Finding nearreplicas of documents on the web, Proc. of Workshop on Web Databases, Valencia, Spain, [15] Z.P.Tian, H.J.Lu and W.Y.Ji, An n-gram-based approach for detecting approximately duplicate data records, International Journal on Digital Libraries, 5(3): , [16] M.A. Hernandez and S.J.Stolfo, The merge/purge problem for large databases, Proc. of ACM SIGMOD Annual Conference, San Jose, CA., [17] K.Muthmann, W.M.Barczynski, F.Brauer and A.Loser, Near-duplicate detection for web-forums,, , International Database Engineering and Applications Symposium(IDEAS),
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
More informationA Review on Duplicate and Near Duplicate Documents Detection Technique
International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-03 E-ISSN: 2347-2693 A Review on Duplicate and Near Duplicate Documents Detection Technique Patil Deepali
More informationOn the Efficiency of Collecting and Reducing Spam Samples
On the Efficiency of Collecting and Reducing Spam Samples Pin-Ren Chiou, Po-Ching Lin Department of Computer Science and Information Engineering National Chung Cheng University Chiayi, Taiwan, 62102 {cpj101m,pclin}@cs.ccu.edu.tw
More informationSearch and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
More informationOptimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,
More informationBig Data. Lecture 6: Locality Sensitive Hashing (LSH)
Big Data Lecture 6: Locality Sensitive Hashing (LSH) Nearest Neighbor Given a set P of n oints in R d Nearest Neighbor Want to build a data structure to answer nearest neighbor queries Voronoi Diagram
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationBlog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
More informationData Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
More informationSRC Technical Note. Syntactic Clustering of the Web
SRC Technical Note 1997-015 July 25, 1997 Syntactic Clustering of the Web Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig Systems Research Center 130 Lytton Avenue Palo Alto, CA 94301
More informationMake search become the internal function of Internet
Make search become the internal function of Internet Wang Liang 1, Guo Yi-Ping 2, Fang Ming 3 1, 3 (Department of Control Science and Control Engineer, Huazhong University of Science and Technology, WuHan,
More informationIntelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt>
Intelligent Log Analyzer André Restivo 9th January 2003 Abstract Server Administrators often have to analyze server logs to find if something is wrong with their machines.
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationPerformance evaluation of Web Information Retrieval Systems and its application to e-business
Performance evaluation of Web Information Retrieval Systems and its application to e-business Fidel Cacheda, Angel Viña Departament of Information and Comunications Technologies Facultad de Informática,
More informationApproximate Object Location and Spam Filtering on Peer-to-Peer Systems
Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Feng Zhou, Li Zhuang, Ben Y. Zhao, Ling Huang, Anthony D. Joseph and John D. Kubiatowicz University of California, Berkeley The Problem
More informationIMPROVED NEAR DUPLICATE MATCHING SCHEME FOR E-MAIL SPAM DETECTION
IMPROVED NEAR DUPLICATE MATCHING SCHEME FOR E-MAIL SPAM DETECTION M. SIVA KUMAR REDDY 1 & B. KRISHNA SAGAR 2 1,2 Department of CSE, Madanapalli Institute of Technology and Science, Madanapalli, Andhra,
More informationPattern Insight Clone Detection
Pattern Insight Clone Detection TM The fastest, most effective way to discover all similar code segments What is Clone Detection? Pattern Insight Clone Detection is a powerful pattern discovery technology
More informationNgram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department
More informationThe Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b
3rd International Conference on Materials Engineering, Manufacturing Technology and Control (ICMEMTC 2016) The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b
More informationSEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
More informationDigital Evidence Search Kit
Digital Evidence Search Kit K.P. Chow, C.F. Chong, K.Y. Lai, L.C.K. Hui, K. H. Pun, W.W. Tsang, H.W. Chan Center for Information Security and Cryptography Department of Computer Science The University
More informationAn Efficient Load Balancing Technology in CDN
Issue 2, Volume 1, 2007 92 An Efficient Load Balancing Technology in CDN YUN BAI 1, BO JIA 2, JIXIANG ZHANG 3, QIANGGUO PU 1, NIKOS MASTORAKIS 4 1 College of Information and Electronic Engineering, University
More informationImage Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
More informationThe multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
More informationA Web Service for Scholarly Big Data Information Extraction
2014 IEEE International Conference on Web Services A Web Service for Scholarly Big Data Information Extraction Kyle Williams, Lichi Li, Madian Khabsa, Jian Wu, Patrick C. Shih and C. Lee Giles Information
More informationDistributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
More informationNear Duplicate Document Detection Survey
Near Duplicate Document Detection Survey Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa Faculty of Computing and Information Technology King AbdulAziz University Jeddah, Saudi Arabia Abstract
More informationSearch Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
More informationMicro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
More informationSo today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
More informationMapReduce With Columnar Storage
SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed
More informationObject Request Reduction in Home Nodes and Load Balancing of Object Request in Hybrid Decentralized Web Caching
2012 2 nd International Conference on Information Communication and Management (ICICM 2012) IPCSIT vol. 55 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V55.5 Object Request Reduction
More informationAccelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems
Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural
More informationOptimization of Search Results with De-Duplication of Web Pages In a Mobile Web Crawler
Optimization of Search Results with De-Duplication of Web Pages In a Mobile Web Crawler Monika 1, Mona 2, Prof. Ela Kumar 3 Department of Computer Science and Engineering, Indira Gandhi Delhi Technical
More informationOpen Access Research and Realization of the Extensible Data Cleaning Framework EDCF
Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2015, 7, 2039-2043 2039 Open Access Research and Realization of the Extensible Data Cleaning Framework
More informationFault Analysis in Software with the Data Interaction of Classes
, pp.189-196 http://dx.doi.org/10.14257/ijsia.2015.9.9.17 Fault Analysis in Software with the Data Interaction of Classes Yan Xiaobo 1 and Wang Yichen 2 1 Science & Technology on Reliability & Environmental
More informationCreating Synthetic Temporal Document Collections for Web Archive Benchmarking
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In
More informationEM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationRanked Keyword Search in Cloud Computing: An Innovative Approach
International Journal of Computational Engineering Research Vol, 03 Issue, 6 Ranked Keyword Search in Cloud Computing: An Innovative Approach 1, Vimmi Makkar 2, Sandeep Dalal 1, (M.Tech) 2,(Assistant professor)
More informationCiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
More informationMapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationMobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
More informationA Comparative Approach to Search Engine Ranking Strategies
26 A Comparative Approach to Search Engine Ranking Strategies Dharminder Singh 1, Ashwani Sethi 2 Guru Gobind Singh Collage of Engineering & Technology Guru Kashi University Talwandi Sabo, Bathinda, Punjab
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationExperiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
More informationCAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
More informationIntelligent Agents Serving Based On The Society Information
Intelligent Agents Serving Based On The Society Information Sanem SARIEL Istanbul Technical University, Computer Engineering Department, Istanbul, TURKEY sariel@cs.itu.edu.tr B. Tevfik AKGUN Yildiz Technical
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationResearch of Postal Data mining system based on big data
3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Research of Postal Data mining system based on big data Xia Hu 1, Yanfeng Jin 1, Fan Wang 1 1 Shi Jiazhuang Post & Telecommunication
More informationDr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)
HIDDEN WEB EXTRACTOR DYNAMIC WAY TO UNCOVER THE DEEP WEB DR. ANURADHA YMCA,CSE, YMCA University Faridabad, Haryana 121006,India anuangra@yahoo.com http://www.ymcaust.ac.in BABITA AHUJA MRCE, IT, MDU University
More informationLarge-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
More informationApplication and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang 2011-10
Application and practice of parallel cloud computing in ISP Guangzhou Institute of China Telecom Zhilan Huang 2011-10 Outline Mass data management problem Applications of parallel cloud computing in ISPs
More informationCLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
More informationKnowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationRedundant Data Removal Technique for Efficient Big Data Search Processing
Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National
More informationUPS battery remote monitoring system in cloud computing
, pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology
More informationOn generating large-scale ground truth datasets for the deduplication of bibliographic records
On generating large-scale ground truth datasets for the deduplication of bibliographic records James A. Hammerton j_hammerton@yahoo.co.uk Michael Granitzer mgrani@know-center.at Maya Hristakeva maya.hristakeva@mendeley.com
More informationDynamic Data in terms of Data Mining Streams
International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining
More informationPolicy-based Pre-Processing in Hadoop
Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides
More informationSentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationAn Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
More informationRandom forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationAn Efficient Hybrid P2P MMOG Cloud Architecture for Dynamic Load Management. Ginhung Wang, Kuochen Wang
1 An Efficient Hybrid MMOG Cloud Architecture for Dynamic Load Management Ginhung Wang, Kuochen Wang Abstract- In recent years, massively multiplayer online games (MMOGs) become more and more popular.
More informationInternational journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.
RESEARCH ARTICLE ISSN: 2321-7758 GLOBAL LOAD DISTRIBUTION USING SKIP GRAPH, BATON AND CHORD J.K.JEEVITHA, B.KARTHIKA* Information Technology,PSNA College of Engineering & Technology, Dindigul, India Article
More informationSystem Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks
System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks OnurSoft Onur Tolga Şehitoğlu November 10, 2012 v1.0 Contents 1 Introduction 3 1.1 Purpose..............................
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationCS231M Project Report - Automated Real-Time Face Tracking and Blending
CS231M Project Report - Automated Real-Time Face Tracking and Blending Steven Lee, slee2010@stanford.edu June 6, 2015 1 Introduction Summary statement: The goal of this project is to create an Android
More informationAnalysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
More informationFigure 1. The cloud scales: Amazon EC2 growth [2].
- Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationClustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension
ISTE-ACEEE Int. J. in Computer Science, Vol. 1, No. 1, March 2014 Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension Kala Karun.A and Chitharanjan. K Sree
More informationSEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. ravirajesh.j.2013.mecse@rajalakshmi.edu.in Mrs.
More informationHardware Configuration Guide
Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...
More informationResource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
More informationBig Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network
, pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and
More informationK-means Clustering Technique on Search Engine Dataset using Data Mining Tool
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 505-510 International Research Publications House http://www. irphouse.com /ijict.htm K-means
More informationAn Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus
An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus Tadashi Ogino* Okinawa National College of Technology, Okinawa, Japan. * Corresponding author. Email: ogino@okinawa-ct.ac.jp
More informationWeb Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationMALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph
MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of
More informationA Catalogue of the Steiner Triple Systems of Order 19
A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of
More informationChapter-1 : Introduction 1 CHAPTER - 1. Introduction
Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet
More informationSecurity in Android apps
Security in Android apps Falco Peijnenburg (3749002) August 16, 2013 Abstract Apps can be released on the Google Play store through the Google Developer Console. The Google Play store only allows apps
More informationResearch on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationEfficient Search in Gnutella-like Small-World Peerto-Peer
Efficient Search in Gnutella-like Small-World Peerto-Peer Systems * Dongsheng Li, Xicheng Lu, Yijie Wang, Nong Xiao School of Computer, National University of Defense Technology, 410073 Changsha, China
More informationComparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
More informationA Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
More informationA Talari Networks White Paper. Turbo Charging WAN Optimization with WAN Virtualization. A Talari White Paper
A Talari Networks White Paper Turbo Charging WAN Optimization with WAN Virtualization A Talari White Paper 2 Introduction WAN Virtualization is revolutionizing Enterprise Wide Area Network (WAN) economics,
More informationUniversity of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
More informationEvaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
More informationPreventing and Detecting Plagiarism in Programming Course
, pp.269-278 http://dx.doi.org/10.14257/ijsia.2013.7.5.25 Preventing and Detecting Plagiarism in Programming Course Wang Chunhui, Liu Zhiguo and Liu Dongsheng Computer & Information Engineering College,
More informationDistance Degree Sequences for Network Analysis
Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation
More informationA Trust Evaluation Model for QoS Guarantee in Cloud Systems *
A Trust Evaluation Model for QoS Guarantee in Cloud Systems * Hyukho Kim, Hana Lee, Woongsup Kim, Yangwoo Kim Dept. of Information and Communication Engineering, Dongguk University Seoul, 100-715, South
More informationStudy on Redundant Strategies in Peer to Peer Cloud Storage Systems
Applied Mathematics & Information Sciences An International Journal 2011 NSP 5 (2) (2011), 235S-242S Study on Redundant Strategies in Peer to Peer Cloud Storage Systems Wu Ji-yi 1, Zhang Jian-lin 1, Wang
More information