THE increasing of Internet resources brings up the problem

Size: px
Start display at page:

Download "THE increasing of Internet resources brings up the problem"

Transcription

1 User Interest Analysis in Web Filtering A-Ning DU and Bin-Xing FANG Abstract Web filtering can help people find the most interesting and valuable information. However, current web filtering techniques can not retrieve results which accurately represent the user interest. This paper investigated the user interest in web filtering and analyzed the problems of current machine learning base web filter. According to the difference of user interest, the task of web filtering is divided into three levels: relativityfilter,similarity-filter and homology-filter. And Biased Support Vector Machine(BSVM) is used to make the filter adaptable according to the difference of user interest. Experiments show that BSVM can greatly improve the web filtering performance. Index Terms Web Filtering, User Interest, Biased Support Vector Machine. I. INTRODUCTION THE increasing of Internet resources brings up the problem of information overload, quality enhancement, which means that people want to read the most interesting messages, and avoid having to read low-quality or uninteresting messages. Web filtering is the activity of classifying a stream of incoming web pages dispatched in an asynchronous way by an information producer to an information consumer[1], which helps people find the most interesting and valuable information and saves Internet users from drowned by the flood of incoming information. Recent years, the machine learning (ML) paradigm[2], instead of knowledge engineering and domain experts, becomes more popular in solving the above problem, because of its automatically-learning and relativity-analysis abilities. However, these ML algorithms are insufficiently accurate and do not adapt well to the ever-changing user interest/approprateness of the web document to the user. For example, distinguishing Pornography from SexEd may be less easy, and distinguishing Pornography from Erotica is even harder, since the border is extremely subjective. This paper studies how to adjust the web filtering results to be more fit for the user interest. Based on the careful study of the user interest, the web filtering result is divided into three scopes of relativity, similarity and homology, which help describe the user interest more accurately. To achieve more precisely the filtering result, the inductive process is improved so that it can get better precision and recall ability according to the user interest. The improved machine learning algorithm in this paper is based on the Support Vector Machine (SVM) algorithm because that of all the generic machine learning algorithms (Decision Tree, Rule Induction, Bayesian algorithm and SVM), SVM algorithm has shown to be superior to other machine learning algorithms with the solid foundation of Statistical Learning Theory (SLT). The improved A-Ning DU and Bin-Xing FANG are with the Research Center of Computer Network and Information Security Technology, Harbin Institute of Technology, People s Republic of China. algorithm is called Biased Support Vector Machine (BSVM), which imports a stimulant function, uses training examples distribution n + /n and a user-adaptable parameter k to deal imbalancedly different classes of the pre-assigned pages so as to adjust the filtering result to be best fit for the user interest. The remainder of the paper is organized as follows: Section 2 introduces web filtering, analyzes the user interest and corresponding difference in filtering result, and discusses the failure of current machine learning approaches. Section 3 puts forward the model of Biased Support Vector Machine, and analyzes its efficiency in web filtering. Section 4 closes the paper with our conclusions and future work. A. Web Filtering II. WEB FILTERING AND USER INTEREST Web filtering is the task of assigning a boolean value to each web page vector d i D, where D is a domain of web pages. A value of TRUE assigned to d i indicates a decision to page d i relative to the user interest, while FALSE indicates not. More formally, the task is to approximate the unknown target function Ψ : D {T RU E, F ALSE} (which describes how web pages ought to be assigned) by means of a function Φ : D {T RU E, F ALSE} called the filter. How to improve the precision and recall of the filter Φ are the core problem of web filtering. The general process of web filtering includes five steps: 1) user interest acquiring: acquire many user-assigned web pages as training set 2) web pages pre-processing: translate the assigned pages into a set of compact representations of page content. Usually a page d i is represented as a vector of term weights d i = {w 1i,w 2i,,w F i }, where F is the set of features that occur at least once in at least one document of D, and 0 < w ki < 1 represents how much feature f k contributes to the semantics of page d i 3) dimensionality reduction: select feature of high contribution to reduce the size of feature set F 4) construction of web filters: build a filter to describe user interest automatically 5) predict unfiltered web pages: use the filter to predict an unmarked web page is relative or not Representation of web pages is the basic step of the process, while the degree of dimensionality reduction is the key infecting factor. And what decides the effectiveness of web filters is that the generalization and description ability of web filtering algorithm. Current implementations of web filtering mainly use four techniques of URL blocking, keyword filtering, rating systems, and intelligent content analysis. URL blocking restricts or allows access by comparing the requested web page s URL 588

2 (and equivalent IP address) with URLs in a stored list. The advantages are speed and efficiency, while this approach requires a URL list, and it is quite costly to generate and maintain the list. Keyword filtering blocks access to web site on the basis of the occurrence of offensive words and phrases on those sites. However, many web sites that do not contain objectionable content will be blocked. Rating systems let web publishers associate labels or metadata with web pages to limit certain web content to target audiences. while in general this approach can not provide a reliable source of information. Intelligent content analysis system can automatically classify web content by use of ML algorithms, such learning and adaptation programs can help give semantic meaning to context-dependent words, and thereby they are the dominate approaches used in web filtering. Almost all existing filtering software use URL blocking, while some also provide rating and keyword option. Performance of a filtering system can be measured in terms of blocking rate which is the percentage of the correctly blocked Web pages, and overblocking rate which is the percentage of legitimate pages that are blocked. The Netprotect project evaluated 50 commercially available filtering systems using 2,794 URLs with pornographic content and 1,655 URLs with normal content [3]. Their results reproduced in Table I show that the accuracy of existing systems is far from satisfactory. TABLE I NETPROTECT S EVALUATION FOR WEB FILTERING TOOLS[3] Filtering Tools Blocking Efficiencies Overblocking Rate BizGuard 55% 10% Cyber Patrol 52% 2% CYBER sitter 46% 3% CYBER Snoop 65% 23% Internet Watcher % 0% Net Nanny 20% 5% Norton Internet Security 45% 6% Optenet 79% 25% SurfMonkey 65% 11% X-Stop 65% 4% B. Analysis of User Interest In practical web filtering applications, the web pages set related to user interest may be considerable large. However, what the user desired may be just several homologous pages. In order to show the difference of user interest, we first give some examples and analyze the true requirement of user. Example 1: Problems in Pornographic Pages Filtering Nowadays, Internet has been becoming an important source of information. However it is also host to pornographic, violent contents and others that are inappropriate for most viewers. Web filtering can be used to block access to pages that are against a defined policy. If a page contains a certain number of forbidden keywords, it is considered undesirable. The problem is that the meanings of words depend on the context. Different Page Subjects: For example, sites about breast cancer research, or sexual harassment, or even the home page of someone named Sexton, could be blocked as a forbidden page of Pornographic Class. Different Writer s Viewpoint: Articles on combating pornographic pages are harmless. Different Expression Orientation: The pornographic pages also contains many sub-classes such as gambling, nudity, violence, drugs, alcohol and so on. For example, Itzin[4] classified pornography into three sets: The sexually explicit and violent; the sexually explicit and nonviolent, but subordinating and dehumanizing; and the sexually explicit, nonviolent, and non-subordinating based upon mutuality. Research consistently shows that harmful effects are associated with the first two, but that the third is usually harmless. Example 2: Problems in Personal Information Filtering Information filtering deals with the delivery of information that is relevant to the user in a timely manner. An information filtering system assists users by filtering the data stream and delivering the relevant information to the user. The system selects the articles deemed to be interesting to the user and eliminates the rest. However, a filtering system might not be able to perfectly differentiate the articles that are actually relevant to the user from the ones that are not. The proportion of irrelevant articles delivered to the user should be as low as possible. The proportion of relevant articles eliminated should also be as low as possible. Different Page Subjects: An information filtering agent assists the user with the task of finding interesting news articles. While the articles may in a particular domain or many domains of academics, entertainment, migration, sports etc. Different Writer s Viewpoint: The user task of finding interesting news articles may only include articles supporting the event, or include all the articles about the event. Different Expression Orientation: For example, the user task of finding news articles about disaster may include articles about bailout, damnification etc, or only the articles about one aspect. As shown in the examples above, the user may be interested in portions of the web filtering result according to the difference of page subjects, writer s viewpoints and expression orientations. So we can divide web filtering tasks into three levels according to the user interest: relativity-filter: the filtering result contains all the web pages with the same key phrases or key sentences. These web pages express the same subject, but may be not consistent in viewpoint or orientation. Typical applications of relativity-filtering include erotic web pages filtering and hot topic tracing which expect to collect all the web pages related to the topic, regardless of approval or not. similarity-filter: the filtering result contains all the web pages that hold the same subject, viewpoint and orientation with the user. Typical applications of similarityfiltering include filtering of web pages on racialism or splittism. The similarity-filtering is more strict than relativity-filtering as not only key words or sentences but also orientation is taken into consideration. homology-filter: the filtering result contains only the web 589

3 pages with quite a lot of same sentences or paragraphs. The filtering results are almost the same as the user interest, and always this is because that the articles from the official or authoritative website are redistributed by other websites with little modification. An examples of homology-filtering is counting which article is the most reprinted one on the Bulletin Board Systems. We can define the all the filtering results acquired by ML algorithms as relative results(r 1 ) and the filtering results which the ML algorithms assign TRUE with probability near-to-1 as homologous results(r k ). So the results of similarity-filtering R i {R k R i R 1 }. As is illustrated in the left of Fig.1, most filtering tasks can be described as application of similarity-filtering with different similarity degrees between the web pages acquired and the user interest. User interest of high similkrity (Rk) User interest (Uk) Adkptkble filtering result (Ri) Generkl Filtering Result (R1) Internet (U) Fig. 1. Analysis and demonstration of filtering result estimation. Outside the biggest circles means filtering scope U, the smallest circle means user interest U k, the biggest circle R 1 is the filtering result of general ML algorithms as content relativity, the smaller one R k is the filtering result as content homology. The middle circle R i means the biased filtering result according to user demand as content similarity. C. Current Machine Learning Approaches and The Failure Web filtering by ML techniques is widely discussed in the literature. A few major ML algorithms are often chosen to construct web filter because of their simplicity, flexibility and robustness: Decision Trees is a ML approach to automatic induction of filtering trees based on training data[5], [6], [7]. It is a graph of nodes connected by arcs with each internal node corresponding to a feature and each arc to a possible value of that feature. Decision tree is easily interpretable by humans and has low computational complexity, which is a quite simple and practical idea in the field of ML. Rules Induction methods[8], [9] try to find a proper set of DNF rules for filtering task such that the error rate on training set is minimal. By use of local optimization techniques, rule induction methods dynamically evaluate rules and revise the covering rule set. K-Nearest neighbor (KNN)[10], [11] selects k most similar documents from the training set and uses the categories of these documents to determine categories of the document being classified. Documents are represented by vectors of words and the similarity between two documents is measured using Euclidean distance or other functions between these vectors. In [12], [13], [10], [14], [15], Naïve Bayes has been applied to web pages filtering. It uses the joint probabilities of words and categories to estimate the probabilities of categories given a document. Documents with a probabilities above a certain threshold are considered relevant. Lee et al.[2] applied Artificial Neural Networks to identify members of the forbidden class, which learns patterns by modifying the weights among nodes based on learning examples. Support Vector Machines (SVM)[16], [17], [18] is also a major statistical method. SVM is a process of finding a surface which separates the positives from the negatives with the widest possible margin among all the surfaces in F - dimensional space. SVM acts well in dealing with large scale training set and it has no need of human and machine efforts in parameter tuning. As is compared in [19], [20], [21], SVM achieved the best performance on different filtering corpus with strong robustness and acceptable efficiency. While the precondition of Naïve Bayes that omitting the feature dependence reduces its web content analysis ability. Artificial Neural Networks is computationally expensive, and over-fitting problem of Decision Trees and Rule Induction occurring in the procedure of user interest description makes it not satisfied. However, as is shown above, web filters based on ML algorithms can not achieve satisfactory results. This is because that it is difficult to understand and express the true meaning of user interest. Current ML algorithms acquire the user interest only by analyzing the arrange modes of words and expressions in the training examples. They neglect much information hidden in the training set, such as the distribution of number of positive example and negative examples, the max distributing radius of positives, the max distributing radius of negatives, and so on. In fact, such hidden information is quite valuable to express what portions of the web filtering result the user may be interested in. As a result, this paper tries to import the ML algorithms the ability to analyze these information. The improved ML algorithm is based on SVM because of its strong robustness and acceptable efficiency. III. BIASED SUPPORT VECTOR MACHINE FOR WEB FILTERING A. Biased Support Vector Machine Algorithm To fit the user interest better, we must import adjusting ability into the ML algorithms. So the approach proposed in this paper imports a stimulant function, uses training examples distribution n + /n and a user-adaptable parameter k to deals imbalancedly different classes of the pre-assigned pages, so as to be best fit for the user interest. The approach is called Biased Support Vector Machine, and a detailed description and analysis are in [22]. In the classical SVM, a penalty function F = C ξ i is introduced as additional capacity control function, where the non-negative variable ξ i is a measure of the misclassification errors and the coefficient C emphasizes the tolerant degree of misclassification error. Consequently the width of the margin decreases with C increasing. 590

4 BSVM introduces a stimulant function, F = C [(k 1) n y ξ i=1 i n + y ξ i= 1 i]/n, as the extension of penalty function. In BSVM, we describe positives as the examples of y i = +1, negatives as the examples of y i = 1, thus we define n + = {y i = +1} and n = {y i = 1}. The stimulant function uses both training examples distribution n + /n and an user-adaptable parameter k to express the user bias degree of different classes. Together with the effect of penalty function, the bias is described in Equation 1. The width of the margin to the positive side decreases with n + /n or k increasing. Thus BSVM can find a proper separating hyperplane with filtering result R i between R 1 and R k. bias= C+C (k 1) n /n = 1+(k 1) n /n C C n + /n 1 n + /n = n +/n+k n /n =k+n + /n (1) n /n BSVM is shown as follows. The generalized optimal separating hyperplane is determined by the vector w, that minimizes the functional, 1 min w,b,ξ 2 w 2 + C ξ i + C 1 ξ i C 2 ξ i y i=1 y i= 1 wherec 1 = C (k 1) n /n,c 2 = C n + /n,k 0 (2) subject to the constraints of: y i (w x i b) 1 ξ i where ξ i 0, i (3) Here C 1 and C 2 are the classification errors stimulant coefficients, k 0 is an adaptable parameter. The solution to the optimization problem of Equation 2 under the constraints of Equation 3 is given by the saddle point of the Lagrangian: L(w,b,ξ,α,β) = 1 2 w 2 + (C + C 1 ) y i=1ξ i + (C C 2 ) y i= 1 ξ i α i (y i [w T x i b] 1 + ξ i ) β i ξ i (4) where α, β are the Lagrange multipliers. The Lagrangian has to be minimized with respect to w,b,ξ and maximized with respect to α,β. Hence the solution to the problem is given by: min Q(α) = 1 2 with constraints of: and α i α j y i y j K(x i,x j ) i,j=1 α i (5) i=1 y i α i = 0 (6) i=1 B. Experiments and Analysis In our experiment, the forbidden pages belong to the category of Adult content. We have collected a total of 500 web pages by searching with the keyword porn. The corpus has been reviewed and classified as containing adult contents by human editors, which includes 100 non-pornographic web pages and 400 pornographic web pages. After taking 1/5 of each as training examples, we measured the training accuracy for SVM and BSVM in Table II. TABLE II TRAINING ACCURACY OF SVM AND BIASED SVM(K=5) Algorithm WebPage Correct Incorrect Total SVM Porngraphic 378(94.5%) 22(5.5%) 400 Non-porngraphic 69(69.0%) 31(31.0%) 100 Total 447(89.4%) 53(10.6%) 500 BSVM Porngraphic 396(99.0%) 4(1.0%) 400 Non-porngraphic 78(78.0%) 22(22.0%) 100 Total 474(94.8%) 26(5.2%) 500 To show the impact of adaptable parameters on BSVM, we experiment on benchmark collections of Chinese web pages 1 prepared by FuDan University. The collections include 9804 training examples and 9833 evaluating documents, which consist of a set of Chinese newswire stories classified under 20 categories. In this paper, we experiment on a document set made of two related categories (history and politics) of the benchmark. The document set contains totally 2800 web pages (2000 pages about politics as positives, 800 pages about history as negatives and 1/10 of each as training examples). We compute the positive sentences filtering precision under different C, and exhibit the influence of d = n + /n and k in Fig. 2. Concluded from the result, the positive sentences filtering precision increases with n + /n and k increasing. Fig. 2. BSVM filtering efficiency on different k and n + /n. The left figure shows the influence of parameter d=n + /n on the positive sentences filtering precision (k=1). The right figure shows the influence of parameter k on the positive sentences filtering precision (n + /n =1). IV. CONCLUSION AND FUTURE WORK In this paper, we give a study on different scopes of filtering result according to different filtering task and user interest. We find that the web filtering result can be divided three sets of relative pages set(r 1 ), similar pages set(r i ) and homologous pages set(r k ) with the relationship of R k R i R 1. To adjust the web filtering result to be more fit for the user 0 α i C + C 1 if y i = 1 0 α i C C 2 if y i = 1 (7) 1 The benchmark and a detailed description(in Chinese) are available at 16\&type=

5 interest, a Biased Support Vector Machine (BSVM) algorithm in introduced which imports a stimulant function, uses training examples distribution n + /n and a user-adaptable parameter k to deals imbalanced different classes of the pre-assigned pages. Experiments show that BSVM can greatly improve the web filtering performance. But problems of user bias description and parameter self-adaptable are still open and we leave them as future work. REFERENCES [1] N. J. Belkin and W. B. Croft, Information Filtering and Information Retrieval: Two Sides of the Same Coin? Communications of the ACM, vol. 35, no. 12, pp , Dec [2] P. Y. Lee, S. C. Hui, and A. C. M. Fong, Neural networks for web content filtering, IEEE Intelligent Systems, vol. 17, no. 5, pp , [3] N. Project, Report on currently available cots filtering tools, Technicle report, [4] O. B. Longe and F. A. Longe, The nigerian web content: Combating pornographic using content filters, Journal of Information Technology Impact, vol. 5, no. 2, pp , [5] J. R. Quinlan, Discovering rules by induction from large collections of examples, Expert Systems in the Micro-Electronic Age, pp , [6] J. R. Quinlan, Induction of decision trees, Machine Learning, vol. 1, no. 1, pp , [7] J. R. Quinlan, C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., [8] F. D. Chid Apte and S. Weiss, Text miningwith decision rules and decision trees, in Proceedings of the Conference on Automated Learning and Discovery, CMU, June [9] P. Clark and T. Niblett, The cn2 induction algorithm, Mach. Learn., vol. 3, no. 4, pp , [10] M. Iwayama and T. Tokunaga, Cluster-based text categorization: a comparison of category search strategies, in Proceedings of SIGIR- 95, 18th ACM International Conference on Research and Development in Information Retrieval, E. A. Fox, P. Ingwersen, and R. Fidel, Eds. Seattle, US: ACM Press, New York, US, 1995, pp [11] B. Masand, G. Linoff, and D. Waltz, Classifying news stories using memory based reasoning, in SIGIR 92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM Press, 1992, pp [12] S. Chakrabarti, B. E. Dom, and P. Indyk, Enhanced hypertext categorization using hyperlinks, in Proceedings of SIGMOD-98, ACM International Conference on Management of Data, L. M. Haas and A. Tiwary, Eds. Seattle, US: ACM Press, New York, US, 1998, pp [13] K. M. A. Chai, H. L. Chieu, and H. T. Ng, Bayesian online classifiers for text classification and filtering, in SIGIR 02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM Press, 2002, pp [14] A. McCallum and K. Nigam, A comparison of event models for naive bayes text classification, in AAAI-98 Workshop on Learning for Text Categorization, [15] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, A machine learning approach to building domain-specific search engines, in The Sixteenth International Joint Conference on Artificial Intelligence (IJCAI- 99), [16] T. Joachims, Text categorization with support vector machines: Learning with many relevant features, in Proceedings of the European Conference on Machine Learning. Berlin,German: Springer, 1998, pp [17] T. Joachims, N. Cristianini, and J. Shawe-Taylor, Composite kernels for hypertext categorisation, in Proceedings of ICML-01, 18th International Conference on Machine Learning, C. Brodley and A. Danyluk, Eds. Williams College, US: Morgan Kaufmann Publishers, San Francisco, US, 2001, pp [18] V. Vapnik, Statistical Learning Theory. New York: John Wiley, Sons, [19] A. Du and B. Fang, Comparison of maching learning algorithms in chinese web filtering, in proceedings of The third International Conference on Machine Learning and Cybernetics. Shanghai,China: IEEE Press, 2004, pp [20] F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv., vol. 34, no. 1, pp. 1 47, [21] Y. Yang and X. Liu, A re-examination of text categorization methods, in SIGIR 99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM Press, 1999, pp [22] A. Du and B. Fang, A biased support vector machine approach to web filtering, in ICAPR 05: Proceedings of the Third International Conference on Advances in Patten Recognition, C. A. P. P. Sameer Singh, Maneesha Singh, Ed. Springer Verlag, Heidelberg, D-69121, Germany, 2005, pp

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Spidering and Filtering Web Pages for Vertical Search Engines

Spidering and Filtering Web Pages for Vertical Search Engines Spidering and Filtering Web Pages for Vertical Search Engines Michael Chau The University of Arizona mchau@bpa.arizona.edu 1 Introduction The size of the Web is growing exponentially. The number of indexable

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Application of Support Vector Machines to Fault Diagnosis and Automated Repair

Application of Support Vector Machines to Fault Diagnosis and Automated Repair Application of Support Vector Machines to Fault Diagnosis and Automated Repair C. Saunders and A. Gammerman Royal Holloway, University of London, Egham, Surrey, England {C.Saunders,A.Gammerman}@dcs.rhbnc.ac.uk

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos NCSR Demokritos, Institute of Informatics & Telecommunications,

More information

A fast multi-class SVM learning method for huge databases

A fast multi-class SVM learning method for huge databases www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo 71251911@mackenzie.br,nizam.omar@mackenzie.br

More information

WE DEFINE spam as an e-mail message that is unwanted basically

WE DEFINE spam as an e-mail message that is unwanted basically 1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann

More information

Investigation of Support Vector Machines for Email Classification

Investigation of Support Vector Machines for Email Classification Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software

More information

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,

More information

Modeling Suspicious Email Detection Using Enhanced Feature Selection

Modeling Suspicious Email Detection Using Enhanced Feature Selection Modeling Suspicious Email Detection Using Enhanced Feature Selection Sarwat Nizamani, Nasrullah Memon, Uffe Kock Wiil, and Panagiotis Karampelas Abstract The paper presents a suspicious email detection

More information

An Imbalanced Spam Mail Filtering Method

An Imbalanced Spam Mail Filtering Method , pp. 119-126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Comparison of machine learning methods for intelligent tutoring systems

Comparison of machine learning methods for intelligent tutoring systems Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System Bala Kumari P 1, Bercelin Rose Mary W 2 and Devi Mareeswari M 3 1, 2, 3 M.TECH / IT, Dr.Sivanthi Aditanar College

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

Theme-based Retrieval of Web News

Theme-based Retrieval of Web News Theme-based Retrieval of Web Nuno Maria, Mário J. Silva DI/FCUL Faculdade de Ciências Universidade de Lisboa Campo Grande, Lisboa Portugal {nmsm, mjs}@di.fc.ul.pt ABSTRACT We introduce an information system

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization

An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization International Journal of Network Security, Vol.9, No., PP.34 43, July 29 34 An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization Jyh-Jian Sheu Department of Information Management,

More information

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS Charanma.P 1, P. Ganesh Kumar 2, 1 PG Scholar, 2 Assistant Professor,Department of Information Technology, Anna University

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

A Study on the Comparison of Electricity Forecasting Models: Korea and China

A Study on the Comparison of Electricity Forecasting Models: Korea and China Communications for Statistical Applications and Methods 2015, Vol. 22, No. 6, 675 683 DOI: http://dx.doi.org/10.5351/csam.2015.22.6.675 Print ISSN 2287-7843 / Online ISSN 2383-4757 A Study on the Comparison

More information

AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES

AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES Anwar Ali Yahya *, Addin Osman * * Faculty of Computer Science and Information Systems, Najran University,

More information

A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

More information

Financial Trading System using Combination of Textual and Numerical Data

Financial Trading System using Combination of Textual and Numerical Data Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Lasso-based Spam Filtering with Chinese Emails

Lasso-based Spam Filtering with Chinese Emails Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

More information

Operations Research and Knowledge Modeling in Data Mining

Operations Research and Knowledge Modeling in Data Mining Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 koda@sk.tsukuba.ac.jp

More information

Web Mining as a Tool for Understanding Online Learning

Web Mining as a Tool for Understanding Online Learning Web Mining as a Tool for Understanding Online Learning Jiye Ai University of Missouri Columbia Columbia, MO USA jadb3@mizzou.edu James Laffey University of Missouri Columbia Columbia, MO USA LaffeyJ@missouri.edu

More information

Inner Classification of Clusters for Online News

Inner Classification of Clusters for Online News Inner Classification of Clusters for Online News Harmandeep Kaur 1, Sheenam Malhotra 2 1 (Computer Science and Engineering Department, Shri Guru Granth Sahib World University Fatehgarh Sahib) 2 (Assistant

More information

Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University) 260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case

More information

Class-specific Sparse Coding for Learning of Object Representations

Class-specific Sparse Coding for Learning of Object Representations Class-specific Sparse Coding for Learning of Object Representations Stephan Hasler, Heiko Wersing, and Edgar Körner Honda Research Institute Europe GmbH Carl-Legien-Str. 30, 63073 Offenbach am Main, Germany

More information

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Method of Fault Detection in Cloud Computing Systems

Method of Fault Detection in Cloud Computing Systems , pp.205-212 http://dx.doi.org/10.14257/ijgdc.2014.7.3.21 Method of Fault Detection in Cloud Computing Systems Ying Jiang, Jie Huang, Jiaman Ding and Yingli Liu Yunnan Key Lab of Computer Technology Application,

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine. Tutorial. (and Statistical Learning Theory) Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

More information

Learning with Local and Global Consistency

Learning with Local and Global Consistency Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

More information

Learning with Local and Global Consistency

Learning with Local and Global Consistency Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Email Classification Using Data Reduction Method

Email Classification Using Data Reduction Method Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract- Unethical

More information

Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

More information

Dissecting the Learning Behaviors in Hacker Forums

Dissecting the Learning Behaviors in Hacker Forums Dissecting the Learning Behaviors in Hacker Forums Alex Tsang Xiong Zhang Wei Thoo Yue Department of Information Systems, City University of Hong Kong, Hong Kong inuki.zx@gmail.com, xionzhang3@student.cityu.edu.hk,

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM

A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM Journal of Computational Information Systems 10: 17 (2014) 7629 7635 Available at http://www.jofcis.com A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM Tian

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Flexible Neural Trees Ensemble for Stock Index Modeling

Flexible Neural Trees Ensemble for Stock Index Modeling Flexible Neural Trees Ensemble for Stock Index Modeling Yuehui Chen 1, Ju Yang 1, Bo Yang 1 and Ajith Abraham 2 1 School of Information Science and Engineering Jinan University, Jinan 250022, P.R.China

More information

Business Lead Generation for Online Real Estate Services: A Case Study

Business Lead Generation for Online Real Estate Services: A Case Study Business Lead Generation for Online Real Estate Services: A Case Study Md. Abdur Rahman, Xinghui Zhao, Maria Gabriella Mosquera, Qigang Gao and Vlado Keselj Faculty Of Computer Science Dalhousie University

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Naive Bayes Spam Filtering Using Word-Position-Based Attributes Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Model Trees for Classification of Hybrid Data Types

Model Trees for Classification of Hybrid Data Types Model Trees for Classification of Hybrid Data Types Hsing-Kuo Pao, Shou-Chih Chang, and Yuh-Jye Lee Dept. of Computer Science & Information Engineering, National Taiwan University of Science & Technology,

More information

Tracking and Recognition in Sports Videos

Tracking and Recognition in Sports Videos Tracking and Recognition in Sports Videos Mustafa Teke a, Masoud Sattari b a Graduate School of Informatics, Middle East Technical University, Ankara, Turkey mustafa.teke@gmail.com b Department of Computer

More information

To improve the problems mentioned above, Chen et al. [2-5] proposed and employed a novel type of approach, i.e., PA, to prevent fraud.

To improve the problems mentioned above, Chen et al. [2-5] proposed and employed a novel type of approach, i.e., PA, to prevent fraud. Proceedings of the 5th WSEAS Int. Conference on Information Security and Privacy, Venice, Italy, November 20-22, 2006 46 Back Propagation Networks for Credit Card Fraud Prediction Using Stratified Personalized

More information

Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

More information

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand

More information

Whitepaper: Understanding Web Filtering Technologies ABSTRACT

Whitepaper: Understanding Web Filtering Technologies ABSTRACT Whitepaper: Understanding Web Filtering Technologies ABSTRACT The Internet is now a huge resource of information and plays an increasingly important role in business and education. However, without adequate

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation

Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation James K. Kimotho, Christoph Sondermann-Woelke, Tobias Meyer, and Walter Sextro Department

More information