Construction Algorithms for Index Model Based on Web Page Classification

Size: px
Start display at page:

Download "Construction Algorithms for Index Model Based on Web Page Classification"

Transcription

1 Journal of Computational Information Systems 10: 2 (2014) Available at Construction Algorithms for Index Model Based on Web Page Classification Yangjie ZHANG 1,2,, Chungang YAN 1,2, Pengwei WANG 1,2, Haichun SUN 1,2 1 Department of Computer Science and Technology, Tongji University, Shanghai , China 2 The Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai , China Abstract Web pages on the Internet are massive, diverse, heterogeneous and redundant. How to organize and manage them effectively is an urgent problem. In this paper, we propose a method to index web pages and build an index model based on web page classification and hyperlink analysis. First, an initialization algorithm is given to construct an index model for an initial set of web pages. Then, considering the dynamics of web pages on the Internet, we propose an incremental updating algorithm which can update an index model incrementally. Through theoretical analysis, the complexity of proposed algorithms shows linear relationships with the scale of web pages and their growth. The experimental results show that initialization algorithm can construct an index model for a fixed number of web pages relatively fast, while the incremental updating algorithm can satisfy the update speed of web pages on the Internet. Thus, the proposed algorithms are feasible and effective. The constructed index model can provide supports for diversified information service systems to enable them to make better use of web resources and provide more valuable services to users. Keywords: Web Page Classification; Index Model; Hyperlinks; Naive Bayes 1 Introduction With the rapid development of the Internet, network resources become more and more abundant, and various kinds of information service systems are emerged to facilitate people s daily lives. However, owing to the openness, dynamics and complexity of the Internet, web pages on the Internet are massive, diverse, heterogeneous and redundant. Existing methods cannot provide a very proper way to organize and manage web pages for these systems. As a popular information service system, search engine solves the problem of acquiring web pages rapidly by its main Project supported by National Basic Research Program of China 973 (No. 2010CB328101), International Science & Technology Cooperation Program of China (No. 2013DFM10100), National Natural Science Funds of China (No ), Shanghai Science & Technology Research Plan (No ). Corresponding author. address: zyj177484@126.com (Yangjie ZHANG) / Copyright 2014 Binary Information Press DOI: /jcis9017 January 15, 2014

2 656 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) technologies including inverted index, link analysis, and distributed storage [1]. Search engine helps users to find relevant web pages through keywords matching. However, the rapid growth of web pages reduces search efficiency, increases redundancy, inaccuracy of the returned results and it cannot meet the personalized demands of information retrieval service only through keywords matching. Recommendation system is another popular representative. Since researches on collaborative filtering appeared in mid-1990s, diversified recommendation algorithms have been proposed and used in many research fields [2-4], including cognitive science [5], information retrieval [6] and management science [7]. Web page recommendation systems recommend similar web pages to those having common preferences, and improve the intelligence of internet information service systems. In general, there are two types of recommender systems in this field: one can only find web pages that are similar to the customer interests by content-based filtering, while the other one needs massive user comments. With the increase of web pages, the accuracy and speed will reduce by collaborative filtering. Due to lack of an effective organization and management method of web pages, web-based information systems performance is restricted [8]. To this end, this paper presents an index model for web pages, which can organize web pages and find out the relationships among web pages. We give an assumption that hyperlinks among web pages reflect some kinds of business relationships in the real world and based on web page classification, two construction algorithms for index model are designed and implemented: one is the initialization algorithm for a given set of web pages, and the other is an updating algorithm for an existing index model and an incremental set of web pages. Then the index model is completed by analyzing hyperlinks among web pages after web page classification. Experimental results show that the proposed algorithms are feasible and effective. First, to validate its effectiveness, an example index model is constructed based on one million web pages crawled from the Internet, which shows that the generated relationships among web page classes can reflect the real-world business associations. Second, for a given set of web pages, the initialization algorithm can complete the construction of index model within a limited time. Last, the incremental updating algorithm can update an index model with the actual growth of web pages. 2 Index Model Framework In this paper, c denotes a web page class; C denotes a set of web page classes; p denotes a web page; D denotes a sample set of web pages; t denotes a feature, and it is a word. Definition 1. (Index Model) An index model is defined as a weighted directed graph G = (V, E), where a) V = {c i c i C }, and C is the set of all web page classes in the index model; b) E = { c i, c j c i, c j V andc i c j }; c) Let w c i, c j be the weight of c i, c j, then w c i, c j = p t c j,p k c i,u k =url t (W u P (c j p t ) P (c i p k )) W u = 1 n

3 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) where u k is a hyperlink on web page p k, url t is the Uniform Resource Locator of web page p t, n is the number of hyperlinks on web page p k, P (c j p t ) is the probability that web page p t belongs to web page class c j. The entire construction process of index model is as follows: 1. Web page pretreatment. 1 Extract content enclosed by HTML tags from a web page. 2 Extract content text. 3 Combine the contents extracted in steps 1 and 2. 4 Web page segmentation. 4 Remove stop words. 6 Select features to represent each web page. 7 Extract hyperlinks in web pages. 2. Training classifier and web page classification. 1 Generate a classifier by learning a sample set of web pages. 2 Classify all preprocessed web pages. 3. Compute link relationships among web page classes. 1 Using index initialization algorithm to complete initialization of an index model. 2 Index incremental updating algorithm will be used to updating an index model. The building process of index model is shown in Fig. 1. a sample set of web pages an initial set of Web pages updating web page Web Page Pretreatment Extract tags Extract body Contents Segmentation Remove stop words Select features Extract hyperlinks Training Classifier Web Page Classification Classifier Index Initialization Algorithm Index Incremental Updating Algorithm Compute Link Relationships among Web Page Classes Fig. 1: Building process of index model based on web page classification

4 658 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) Algorithms for Constructing Index Model 3.1 Web page preprocessing A web page contains various HTML tags, and contents within different labels make different contributions to its subject. In order to classify a web page and obtain relationships among web classes, we first preprocess each web page, which includes extracting tags and body contents, content segmentation, extracting hyperlinks, etc. Step 1 Extract contents enclosed in Title, META, H1 H6, a tags in a HTML file. Step 2 Extract content text of a web page using a statistical approach [9]. Step 3 Content segmentation. Processing contents extracted from Step 1 and Step 2 using Chinese word segmentation tool IKAnalyzer [10], and remove stop words. Step 4 Weight each feature using TF*IDF [11]. Step 5 Extract feature vector of each web page. Based on the values obtained in Step 4, select top n feature items to constitute the feature vector ( n is an experimental value). Then, a feature vector of a web page is expressed as: p = (t 1, t 2,..., t n ). Step 6 Extract hyperlinks of web pages. All hyperlinks extracted from a web page is denoted as a vector P out = (u 1, u 2,..., u n ), where u i is a hyperlink on this web page. Step 7 Weight each hyperlink. The weight of a hyperlink u is calculated by W u = 1 n, where n is the total number of hyperlinks on this web page. The basic idea derives from Random Surfer Model [12], which assumes that all the hyperlinks on a web page are equally important. 3.2 Training classifier and web page classification This paper uses Naive Bayesian [13] method to train the classifier and classify all web pages we processed before. Compared to KNN [14] and other methods for dealing with massive web pages, Naive Bayesian method can guarantee better classification accuracy and faster classifying speed, and it is simple. Step 1 Generate a classifier through learning a sample set of web pages. In this process, we calculate the probability that each feature item belongs to each web page class based on sample pages. There have been various methods to complete this calculation. McCallum and Nigam proposed a multinomial model [13] whose misclassification accuracy is smaller than other models when the feature set is relatively large. In this paper, we use their multinomial model to calculate P (t j c), which represents the probability of t j belonging to c. The computing formula is as follows: P (t j c) = V T F (t j, c) V k=1 T F (t k, c) Where V represents the total number of feature items in a sample class c, T F (t j, c) represents the total frequency of t j appearing in c. Step 2 Classify all preprocessed web pages. Calculate the probabilities that p belongs to all web page classes, and then p is classified into c i with the greatest probability. The probability is calculated as follows: i = arg max{p (c j p)}, c j C

5 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) P (c j p) = P (c j ) n P (t i c j ) i=1 P (p) Where P (c j ) is a priori probability, and its value is the number of web pages in c j in proportion to the sample set, P (p) is a constant, n is the number of feature items in p. 3.3 Computing link relationships among web classes Web page representation After the preprocessing and classification of web page, we obtain the elements of a web page in Table 1. Table 1: The representation of a web page p Element Symbol feature vector (t 1, t 2,..., t n ) web page class c probability of p belonging to (P (c p)) a set of hyperlinks on P out = (u 1, u 2,..., u m ) a set of URLs which have hyperlinks to p P in = (u 1, u 2,..., u o ) Where P in is initialized to empty before computing link relationships Computing relationships among web classes A directed edge c i, c j in an index model represents the direct link relationship from c i to c j, and its weight w c i, c j is calculated as a conditional probability P (c j c i ). It indicates the probability of users visiting the web page class c j after they browsed c i. In this article, we propose a way to build link relationships among web classes based on weights of hyperlinks and probabilities of web pages belonging to classes. We suppose that a web page only belongs to one class. Then, P (c j c i ) is calculated as follows: P (c j c i ) = Where P (c j p t ) and P (p t c i ) are computed as: P (c j p t ) = P (p t c i ) = p k c i p t c j P (c j p t ) P (p t c i ) (1) P (C j p t ) 1 + p k c j P (C j p k ) P (p t p k ) u k =url t W u (2) (3)

6 660 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) In Eq. (2), P (c j p t ) is the probability of p t belonging to c j after normalization. In Eq. (3), P (p t c i ) means the probability of users visiting p t after they browsed c i, which is calculated by summing all hyperlinks from web page class c i to page p t. u k represents a hyperlink on p k, url t represents the URL of p t, W u is the weight of hyperlink u k. To facilitate the calculation, the Eq. (3) can be transformed into the following form: P (p t c i ) = Then, P (c j c i ) is calculated as follows: P (c j c i ) = p k c i,u k =url t p t c j,p k c i,u k =url t ( Wu P (c i p k ) ) (4) ( Wu P (c j p t ) P (c i p k ) ) (5) Two algorithms for constructing index model For a given set of web pages, Algorithm 1 can complete the initial construction of an index model as defined in Def. 1. Algorithm 1 Index Initialization Algorithm: Input: a set of web class C, a set of web pages S. Output: Index Model Matrix IN CC, HashTable HT 1: For c C 2: For p c 3: Calculate 1 + p c P (c p) 4: HT.put( url, c ) // url is URL of p. 5: For p S 6: For u P out // P out is the set of hyperlinks on p. 7: if ( HT.contains( u ) ) 8: Calculate P (c j p k ) and P (c i p) // u is URL of p k. ( ) 9: P (c j c i ) = P (c j c i ) + 1 P P out (c j p k ) P (c i p) 10: else P in.put( url ) // P in is the set of URLs which have hyperlinks to p. There are two main factors that have influence on the complexity of Algorithm 1, namely, the number of web pages and the number of hyperlinks on each web page. We assume that the number of web pages is S and the average number of hyperlinks on each web page is P out. The time complexity of Steps 1-4 is O( S ). Steps 5-10 calculate those link relationships and their time complexity are O( S P out ). Thus, the total time complexity of Algorithm 1 is O( S + S P out ) = O( S P out ). Algorithm 1 can complete the initial construction of an index model based on a given set of web pages, but it cannot update the model in real time with the changes. Considering dynamic changes of web pages on the Internet, we further give an index incremental updating algorithm for updating an existing index model.

7 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) Algorithm 2 Index Incremental Updating Algorithm: Input: a set of web class C, web page p, Index Model Matrix IN CC, HashTable HT. Output: Index Model Matrix IN CC, HashTable HT 1: Update 1 + p k c j P (c p k ) // p belongs to web page class c. 2: HT.put( url, c ) // url is the URL of p. 3: For c i C 4: P (c i c) = P (c i c) 1+ p k c P (c p k) P (p) 1+ p k c P (c p k) 5: P (c c i ) = P (c c i ) 1+ p k c P (c p k) P (p) 1+ p k c P (c p k) 6: For u P in // P in is the set of URLs which have hyperlinks to p. 7: if( HT.contains( u ) ) 8: Calculate P (c j p k ) and P (c p) // u is URL of p k. ( ) 9: P (c c j ) = P (c c j ) + 1 P in P (c j p k ) P (c p) 10: For u P out // P out is the set of hyperlinks on p. 11: if( HT.contains( u )) 12: Calculate P (c j p k ) and P (c p) // u is URL of p k. ( ) 13: P (c j c) = P (c j c) + 1 P out P (c j p k ) P (c p) 14: else P in.put( url ) // P in is the set of URLs which have hyperlinks to u. The main factors affecting the complexity of Algorithm 2 are the number of web classes, the number of hyperlinks on each web page p, and the number of web pages that have hyperlinks pointing to web page p. The time complexity of steps 3-5 is O( C ), and that of steps 6-14 is O( P in + P out ). Thus, the total time complexity of Algorithm 2 is O( C + P in + P out ). 4 Experimental Evaluation and Analysis 4.1 Experimental environment Experiments in this paper are based on 9 Sugon servers with 4 cores and 8G memory. The experimental set of web pages contains one million Chinese web pages that are crawled from the Internet. Eight servers are used to classify web pages by the web page classification tool Mahout [15] based on Hadoop Distributed system, and one server is used to run the proposed construction algorithms for index model. 4.2 Experimental results and analysis In this paper, the result of web page classification will directly affect the construction of index model. Thus, the accuracy of classification algorithm should be guaranteed. We use 282 web page classes and their sample web pages in open directory project dmoz [16] as the training data, and manually select some relevant web pages as a supplement. There are about 500 sample web pages in each web page class.

8 662 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) We randomly select 80% of web pages in each sample class to train the classifier, while the remaining 20% web pages as the test set. The experimental results show that the classification accuracy is 71.7%. In order to evaluate the efficiency of the two proposed algorithms and compare them, the crawled one million web pages are classified into 282 web classes and then used to construct index model by two algorithms separately as shown in Fig. 2. Specially, we use updating algorithm update an empty model by these web pages. The comparative running time is shown in Fig. 2(a). Time(h) Web Pages(Million) Initiali zation Updat ing Time (h) Initialization Updating (a) Building an index model (b) Updating an index model Fig. 2: Time comparisons of two construction algorithms As can be seen from Fig. 2(a), time efficiency of index initialization algorithm is better than that of index incremental updating algorithm. To build an index model with S pages, index incremental updating algorithm requires O( S C + S P in +P out ) time while the initialization one requires O( S P out ) time. The time complexity of these two algorithms increases linearly with the number of web pages as previously mentioned. For the experimental set of web pages we selected, since it is not self-contained from the perspective of hyperlinks contained, i.e., lots of web pages that are linked to from this experimental set are not within it, the growth curve of time efficiency is shown in Fig. 2(a). When the number of web pages is sufficiently large, the time efficiency of two algorithms will be stabilized and tends to grow linearly. In order to evaluate the efficiency that the two proposed algorithms are used to update an existing index model. We update an index model built by half million web pages using another half million web pages. The comparative running time is shown in Fig. 2(b). Compared to index incremental updating algorithm, index initialization algorithm build an index model in a relatively short period of time. But as shown in Fig. 2(b), when you need to update an index model, index initialization algorithm have to rebuild the whole index model, while updating algorithm just need to add new pages to the built index model. Therefore, the two algorithms are applied to different situations: initialization algorithm is used for the initialization of an index model; incremental updating algorithm is used to update an existing index model. According to the annual statistical report by China Internet Network Information Center (CNNIC), the number of web pages increased stably. If the algorithm is applied, by increasing the number of devices, it can meet the update speed of web pages on the internet which illustrates feasible of the algorithm. Further, we analyzed the structure of the index model constructed above. As the edge weights are generally small, its edge weight multiplied by Then the statistics result of its edge weights is shown in Fig. 3. Since the test data is only one million web pages, lots of web pages that are linked to from this experimental set are not within it, edge weights are generally small and edges whose weight is less than 1 may be caused by heterogeneous of the Internet. The weight distribution shows that the

9 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) The Percentage of All Edges 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% < >100 Edge Weight Fig. 3: Statistics of the index model edge weight index model can distinguish the strength of link relationships among web page classes. Therefore it can be said that the index model constructed by the algorithm satisfied the basic conditions of validity. 4.3 Exhibition example The index model we have built is too big to complete show. In this section, we select 4 web page classes and their edges from the index model we built before as an exhibition example. As shown in Fig. 4, an index model is a direct graph. Each out-degree of a web page class represents the browsing probability from the class to its point to class. For example, in Fig. 4, a direct edge from Estate to House Renting is That is, if users are browsing web pages in Estate now, the probability they will visit web pages in House Renting next step is 22.1%. Based on the index model, we can know which classes are more closely related with a specified class. Compared with House Renting, Estate is more relevant with Retail Market. And compared with Retail Market, House Renting is more relevant with Estate. Along with web pages in those classes are changing, the corresponding edge values are changing too. When most web pages on the Internet have been divided into their corresponding classes, edge values of the index model tend to be stable. Shopping 1.6 Estate Retail Market House Renting Fig. 4: An index model instance 5 Conclusion This paper presents an index model based on web page classification, and designs its initial construction algorithm by hyperlinks analysis. Moreover, in order to reflect the dynamic process of web pages on the

10 664 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) Internet, we give index incremental updating algorithm and the two algorithms time complexity is only linear growth. Experimental results show that index incremental updating algorithm can dynamically update an existing index model and it can meet the update speed of web pages on the internet. Index model can organize and manage web pages and link relationships among classes can reflect some real business connections. However, this article only preliminary given its construction algorithm. We will further study on how to make better use of an index model to service users and how to build an index model based on other web classification methods such as KNN. References [1] F. S. Hong, and H. Kun, Research on search engine technology and service and its enlightenment, Journal of the China Society for Scientific and Technical Information 19(6) (2002) [2] U. Shardanand, P. Maes, Social information filtering: algorithms for automating Word of Mouth, in: Proc. on Human Factors in Computing Systems, ACM Press, New York, 1995, pp [3] W. Hill, L. Stead, M. Rosenstein, and G. Fumas, Recommending and evaluating choices in a virtual community of use, in: Proc. on Human Factors in Computing Systems, ACM Press, New York, 1995, pp [4] P. Resnick, N. lakovou, M. Sushak, P. Bergstrom, and J. Riedl, GroupLens: An open architecture for collaborative filtering of netnews, in: Proc. the Computer Supposed Cooperative Work Conf, ACM Press, New York, 1994, pp [5] E. Rich, User modeling via stereotypes, Cognitive Science, 3(4) (1979) [6] R. Baeza-Yates, and B. Ribeiro-Ner, Modern Information Retrieval, New York Addison-Wesley Publishing, [7] B. P. S. Murthi, and S. Sarkar, The role of the management sciences in research on personalization, Management Science, 49(10) (2003) [8] G. Adomavicius, and A. Tuzhilin, Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions, IEEE Trans. on Knowledge and Data Engineering, 17(6) (2005) [9] S. C. Jie, and G Yi, A statistical approach for content extraction from web page, Journal of Chinese Information Processing, 18(5) (2004) [10] H. Y. Biao, The comparative study of chinese word segmentation of lucene interface, Science & Technology Information, 12 (2012) [11] Y. Yang, and J. O. Pedersen, A comparative study on feature selection in text categorization, in: Proc. the 14th International Conference on Machine Learning (ICML 97), 1997, pp [12] L. Page, S Brin, and R. Motwani, The pagerank citation ranking: Bringing order to the web, [13] A. McCallum, and K. Nigam, A comparison of event models for naive bayes text classification, in: Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, CA: AAAI Press, Menlo Park, 1998, pp [14] T. Cover, and P. Hart, Nearest neighbor pattern classification, IEEE Transactions in Information Theory, 13(1) (1967) [15] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action, Manning Publications, Shelter Island, [16] M. Grobelnik, J. Brank, D. Mladeni, B. Novak, and B. Fortuna, Using dmoz for constructing ontology from data stream, in: Proc. the 28th International Conference on Information Technology Interfaces, Dubrovnik, 2006, pp

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

III. DATA SETS. Training the Matching Model

III. DATA SETS. Training the Matching Model A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: wojciech.gryc@oii.ox.ac.uk Prem Melville IBM T.J. Watson

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Semantic Concept Based Retrieval of Software Bug Report with Feedback Semantic Concept Based Retrieval of Software Bug Report with Feedback Tao Zhang, Byungjeong Lee, Hanjoon Kim, Jaeho Lee, Sooyong Kang, and Ilhoon Shin Abstract Mining software bugs provides a way to develop

More information

ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach

ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach Banatus Soiraya Faculty of Technology King Mongkut's

More information

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng

More information

Fault Analysis in Software with the Data Interaction of Classes

Fault Analysis in Software with the Data Interaction of Classes , pp.189-196 http://dx.doi.org/10.14257/ijsia.2015.9.9.17 Fault Analysis in Software with the Data Interaction of Classes Yan Xiaobo 1 and Wang Yichen 2 1 Science & Technology on Reliability & Environmental

More information

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS ISBN: 978-972-8924-93-5 2009 IADIS RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS Ben Choi & Sumit Tyagi Computer Science, Louisiana Tech University, USA ABSTRACT In this paper we propose new methods for

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

SEO Techniques for various Applications - A Comparative Analyses and Evaluation

SEO Techniques for various Applications - A Comparative Analyses and Evaluation IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 20-24 www.iosrjournals.org SEO Techniques for various Applications - A Comparative Analyses and Evaluation Sandhya

More information

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

More information

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS Charanma.P 1, P. Ganesh Kumar 2, 1 PG Scholar, 2 Assistant Professor,Department of Information Technology, Anna University

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

ISSN: 2348 9510. A Review: Image Retrieval Using Web Multimedia Mining

ISSN: 2348 9510. A Review: Image Retrieval Using Web Multimedia Mining A Review: Image Retrieval Using Web Multimedia Satish Bansal*, K K Yadav** *, **Assistant Professor Prestige Institute Of Management, Gwalior (MP), India Abstract Multimedia object include audio, video,

More information

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Lasso-based Spam Filtering with Chinese Emails

Lasso-based Spam Filtering with Chinese Emails Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

More information

Remote support for lab activities in educational institutions

Remote support for lab activities in educational institutions Remote support for lab activities in educational institutions Marco Mari 1, Agostino Poggi 1, Michele Tomaiuolo 1 1 Università di Parma, Dipartimento di Ingegneria dell'informazione 43100 Parma Italy {poggi,mari,tomamic}@ce.unipr.it,

More information

Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System Bala Kumari P 1, Bercelin Rose Mary W 2 and Devi Mareeswari M 3 1, 2, 3 M.TECH / IT, Dr.Sivanthi Aditanar College

More information

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Index Terms Domain name, Firewall, Packet, Phishing, URL. BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Subordinating to the Majority: Factoid Question Answering over CQA Sites

Subordinating to the Majority: Factoid Question Answering over CQA Sites Journal of Computational Information Systems 9: 16 (2013) 6409 6416 Available at http://www.jofcis.com Subordinating to the Majority: Factoid Question Answering over CQA Sites Xin LIAN, Xiaojie YUAN, Haiwei

More information

Medical Image Segmentation of PACS System Image Post-processing *

Medical Image Segmentation of PACS System Image Post-processing * Medical Image Segmentation of PACS System Image Post-processing * Lv Jie, Xiong Chun-rong, and Xie Miao Department of Professional Technical Institute, Yulin Normal University, Yulin Guangxi 537000, China

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data Jun Wang Department of Mechanical and Automation Engineering The Chinese University of Hong Kong Shatin, New Territories,

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce

More information

Towards Effective Recommendation of Social Data across Social Networking Sites

Towards Effective Recommendation of Social Data across Social Networking Sites Towards Effective Recommendation of Social Data across Social Networking Sites Yuan Wang 1,JieZhang 2, and Julita Vassileva 1 1 Department of Computer Science, University of Saskatchewan, Canada {yuw193,jiv}@cs.usask.ca

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Effective Email

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

A Comparative Approach to Search Engine Ranking Strategies

A Comparative Approach to Search Engine Ranking Strategies 26 A Comparative Approach to Search Engine Ranking Strategies Dharminder Singh 1, Ashwani Sethi 2 Guru Gobind Singh Collage of Engineering & Technology Guru Kashi University Talwandi Sabo, Bathinda, Punjab

More information

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network Qian Wu, Yahui Wang, Long Zhang and Li Shen Abstract Building electrical system fault diagnosis is the

More information

Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham

Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control Phudinan Singkhamfu, Parinya Suwanasrikham Chiang Mai University, Thailand 0659 The Asian Conference on

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Ranked Keyword Search in Cloud Computing: An Innovative Approach

Ranked Keyword Search in Cloud Computing: An Innovative Approach International Journal of Computational Engineering Research Vol, 03 Issue, 6 Ranked Keyword Search in Cloud Computing: An Innovative Approach 1, Vimmi Makkar 2, Sandeep Dalal 1, (M.Tech) 2,(Assistant professor)

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

More information

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013. Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

International Journal of Emerging Technology & Research

International Journal of Emerging Technology & Research International Journal of Emerging Technology & Research An Implementation Scheme For Software Project Management With Event-Based Scheduler Using Ant Colony Optimization Roshni Jain 1, Monali Kankariya

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Achieve Better Ranking Accuracy Using CloudRank Framework for Cloud Services

Achieve Better Ranking Accuracy Using CloudRank Framework for Cloud Services Achieve Better Ranking Accuracy Using CloudRank Framework for Cloud Services Ms. M. Subha #1, Mr. K. Saravanan *2 # Student, * Assistant Professor Department of Computer Science and Engineering Regional

More information

Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Top Top 10 Algorithms in Data Mining

Top Top 10 Algorithms in Data Mining ICDM 06 Panel on Top Top 10 Algorithms in Data Mining 1. The 3-step identification process 2. The 18 identified candidates 3. Algorithm presentations 4. Top 10 algorithms: summary 5. Open discussions ICDM

More information

Top 10 Algorithms in Data Mining

Top 10 Algorithms in Data Mining Top 10 Algorithms in Data Mining Xindong Wu ( 吴 信 东 ) Department of Computer Science University of Vermont, USA; 合 肥 工 业 大 学 计 算 机 与 信 息 学 院 1 Top 10 Algorithms in Data Mining by the IEEE ICDM Conference

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode

More information

Internet Traffic Prediction by W-Boost: Classification and Regression

Internet Traffic Prediction by W-Boost: Classification and Regression Internet Traffic Prediction by W-Boost: Classification and Regression Hanghang Tong 1, Chongrong Li 2, Jingrui He 1, and Yang Chen 1 1 Department of Automation, Tsinghua University, Beijing 100084, China

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

Decision Support System For A Customer Relationship Management Case Study

Decision Support System For A Customer Relationship Management Case Study 61 Decision Support System For A Customer Relationship Management Case Study Ozge Kart 1, Alp Kut 1, and Vladimir Radevski 2 1 Dokuz Eylul University, Izmir, Turkey {ozge, alp}@cs.deu.edu.tr 2 SEE University,

More information

Content-Based Recommendation

Content-Based Recommendation Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

More information

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Keedong Yoo Dept. of Management Information Systems Dankook University Cheonan, Republic of Korea Abstract : The cloud

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

GoldenBullet in a Nutshell

GoldenBullet in a Nutshell GoldenBullet in a Nutshell Y. Ding, M. Korotkiy, B. Omelayenko, V. Kartseva, V. Zykov, M. Klein, E. Schulten, and D. Fensel Vrije Universiteit Amsterdam, De Boelelaan 1081a, 1081 HV Amsterdam, NL From:

More information

FCE: A Fast Content Expression for Server-based Computing

FCE: A Fast Content Expression for Server-based Computing FCE: A Fast Content Expression for Server-based Computing Qiao Li Mentor Graphics Corporation 11 Ridder Park Drive San Jose, CA 95131, U.S.A. Email: qiao li@mentor.com Fei Li Department of Computer Science

More information

Inner Classification of Clusters for Online News

Inner Classification of Clusters for Online News Inner Classification of Clusters for Online News Harmandeep Kaur 1, Sheenam Malhotra 2 1 (Computer Science and Engineering Department, Shri Guru Granth Sahib World University Fatehgarh Sahib) 2 (Assistant

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

A Network Simulation Experiment of WAN Based on OPNET

A Network Simulation Experiment of WAN Based on OPNET A Network Simulation Experiment of WAN Based on OPNET 1 Yao Lin, 2 Zhang Bo, 3 Liu Puyu 1, Modern Education Technology Center, Liaoning Medical University, Jinzhou, Liaoning, China,yaolin111@sina.com *2

More information

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster , pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

A Web Recommender System for Recommending, Predicting and Personalizing Music Playlists

A Web Recommender System for Recommending, Predicting and Personalizing Music Playlists A Web Recommender System for Recommending, Predicting and Personalizing Music Playlists Zeina Chedrawy 1, Syed Sibte Raza Abidi 1 1 Faculty of Computer Science, Dalhousie University, Halifax, Canada {chedrawy,

More information

Software Engineering 4C03

Software Engineering 4C03 Software Engineering 4C03 Research Paper: Google TM Servers Researcher: Nathan D. Jory Last Revised: March 29, 2004 Instructor: Kartik Krishnan Introduction The Google TM search engine is a powerful and

More information

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Understanding Web personalization with Web Usage Mining and its Application: Recommender System Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,

More information

Optimize Position and Path Planning of Automated Optical Inspection

Optimize Position and Path Planning of Automated Optical Inspection Journal of Computational Information Systems 8: 7 (2012) 2957 2963 Available at http://www.jofcis.com Optimize Position and Path Planning of Automated Optical Inspection Hao WU, Yongcong KUANG, Gaofei

More information

Make search become the internal function of Internet

Make search become the internal function of Internet Make search become the internal function of Internet Wang Liang 1, Guo Yi-Ping 2, Fang Ming 3 1, 3 (Department of Control Science and Control Engineer, Huazhong University of Science and Technology, WuHan,

More information

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines , 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing

More information

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...

More information

Bayesian Spam Detection

Bayesian Spam Detection Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional

More information

Exploration of Search Engine Optimization Technology Applied in Internet Marketing

Exploration of Search Engine Optimization Technology Applied in Internet Marketing Exploration of Search Engine Optimization Technology Applied in Internet Marketing 1 Li-Hsing HO, 2 Meng-Huang LU, 3 Chin-Pei LEE, 4 Tien-Fu PENG 1, First Author College of Management, Chung Hua University,

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie

More information

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data

More information

AN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS

AN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS AN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS Maddela Pradeep 1, V. Nagi Reddy 2 1 M.Tech Scholar(CSE), 2 Assistant Professor, Nalanda Institute Of Technology(NIT), Siddharth Nagar, Guntur,

More information

Automated Medical Citation Records Creation for Web-Based On-Line Journals

Automated Medical Citation Records Creation for Web-Based On-Line Journals Automated Medical Citation Records Creation for Web-Based On-Line Journals Daniel X. Le, Loc Q. Tran, Joseph Chow Jongwoo Kim, Susan E. Hauser, Chan W. Moon, George R. Thoma National Library of Medicine,

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Research of Postal Data mining system based on big data

Research of Postal Data mining system based on big data 3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Research of Postal Data mining system based on big data Xia Hu 1, Yanfeng Jin 1, Fan Wang 1 1 Shi Jiazhuang Post & Telecommunication

More information

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information