Automatic Text Categorization using NTC
|
|
- Bertha Page
- 7 years ago
- Views:
Transcription
1 Automatic Text Categorization using NTC Taeho Jo School of Computer and Information Engineering, Inha University Abstract In this research, we propose NTC (Neural Text Categorizer) as the approach to text categorization. Traditional approaches to text categorization require encoding documents into numerical vectors which leads to the two main problems: huge dimensionality and sparse distribution in each numerical vector. In this research, documents are encoded into string vectors instead of numerical vectors, and a new neural network called NTC which receive a string vector as its input vector is used for text categorization. The goal of this research is to avoid the two main problems by encoding documents into alternative structured data to numerical vectors. We will validate the performance of NTC by comparing it with other machine learning algorithms on the standard test bed, Reuter Introduction Text categorization refers to the process of assigning one or some of the predefined categories to each document. There are two preliminary tasks before performing a task of text categorization: the predefinition of categories and the preparation of sample labeled documents. After the preliminary tasks, text categorization systems learn the sample labeled documents, in order to build their own classification capacity. Using the classification capacity, the system classifies unseen documents. Therefore, the system performs the task of text categorization through the three steps: the preliminary tasks, the learning, and the classification. It requires encoding documents into numerical vectors for using one of traditional approaches to text categorization. By indexing a corpus which consists of sample labeled documents, words called feature candidates are generated. Among them, only some are selected as features, and various criteria for doing that were already defined [7]. Values are assigned to each feature in representing a document into a numerical vector whose attributes are the selected features. Note that encoding documents so causes the two main problems: the huge dimensionality and the sparse distribution. The first problem caused by encoding documents into numerical vectors is the huge dimensionality. It means the requirement of many features for avoiding representing a document into a zero vector. The number of features is usually several hundreds according to the previous literatures [13][15][1]. Although the number of features is small compared with the number of feature candidates, it is still large as the input size. This problem costs very much time and system resources for processing each document and the demand of many sample documents proportionally to the number of features. The second problem caused by encoding documents so is the sparse distribution. It refers to the phenomena in which zero values takes almost positions of each numerical vector. Under the problem, there is the potential possibility that complete zero vectors exist with regardless of categories among representations of sample labeled documents. There is also the possibility of encoding unseen documents into zero vectors. The problem makes very poor the robustness of the learning and the classification of approaches to text categorization. The fundamental idea of this research is to encode documents into string vectors, instead of numerical vectors. A string as an element of the string vectors refers to a combination of alphabets of a language as the basic semantic unit. A string vector refers to a finite ordered set of strings. Note that string vectors with their identical elements but their different orders are treated as different ones. The intention of encoding documents so is to avoid the two main problems which are characterized by encoding documents into numerical vectors. In order to solve the two main problems, the neural network model, called NTC (Neural Text Categorizer), is proposed as the approach to text categorization. The neural network model receives string vectors as its input vectors and initializes its weights with frequencies /09/$ IEEE 26
2 of elements within each class. The weights are optimized as the learning process of the neural networks model to minimize its classification error of training examples. When an unseen string vector as the representation of an unseen document is given, the value of each output node is computed by summing of weights connected to itself, and the unseen document is classified into the category corresponding to the output node with its maximum value. Since the proposed neural network model uses string vectors as its input vectors, it is free from the two main problem. The proposed neural network model is expected to improve the performance of text categorization by avoiding the two main problems. In order to use the proposed neural network model, documents are encoded into string vectors, as mentioned above. Dominance of a particular value never happen in string vectors; any problem which is similar as the sparse distribution does not happen in each string vector. Much smaller dimension of string vectors compared with that of numerical vectors is sufficient for the robust learning and classification. Therefore, the two main problems never happen in using the proposed neural network model for tasks of text categorization. This paper consists of six sections, including this section. In section 2, we will explore relevant literatures. In section 3 and 4, we will describe the definition of string vectors and the learning algorithm and classification of the proposed neural network model, respectively. In section 5, we will validate empirically the performance of the proposed neural network model by comparing it with other machine learning algorithms on the test bed, Reuter In section 6, we will mention the significance of this research and further research as the conclusion of this article. 2 Previous Works This section is concerned with previous works relevant to this research. Even if many approaches to text categorization already proposed, we will mention the four representative and popular approaches: KNN (K Nearest Neighbor), NB (Naive Bayes), SVM (Support Vector Machine), and Neural Networks. It requires encoding documents into numerical vectors for using one of them for text categorization; the two main problems are caused. String kernel was proposed in using the SVM for text categorization as the solution to the two main problems, but it failed to improve the performance. In this section, we will explore the previous works on previous approaches to text categorization and previous solution to the two main problems. The KNN may be considered as a typical and popular approach to text categorization [1]. The KNN was initially created by Cover and Hart in 1967 as a genetic classification algorithm [2]. It was initially applied to text categorization by Massand et al in 1992 [3]. KNN was recommended by Yang in 1999 [4] and by Sebastiani in 2002 [1] as a practical approach to text categorization. Therefore, the KNN has been aimed as the base approach in other literatures as the base approach [1]. The Naive Bayes may be considered as another approach to text categorization. It was initially created by Kononenko in 1989, based on Bayes Rule [5]. Its application to text categorization was mentioned in the textbook by Mitchell in 1997 [6]. Assuming that the Naive Bayes is the popular approach, in 1999, Mladenic and Grobelink proposed and evaluated feature selection methods [7]. The Naive Bayes has been compared with other subsequent approaches in text categorization [9] [10]. Recently, the SVM was recommended as the practical approach to text categorization [9] [10]. It was initially introduced in her magazine article by Hearst in 1998 [8]. In the same year, it was applied to text categorization by Joachims [9]. It was adopted as the approach to spam mail filtering as a practical instance of text categorization in 1999 by Druker et al [10]. Furthermore, the SVM is popularly used not only for text categorization tasks but also for any other pattern classification tasks [11]. Neural Networks may be considered as an approach to text categorization, and among them, the MLP (Multiple Layers Perceptron) with back propagation is the most popular model. The neural network model was initially created in 1986 by Mcelland and Rumelhart, and it was intended to for performing tasks of pattern classification and nonlinear regressions as a supervised learning algorithm [12]. It was initially applied to text categorization in 1995 by Wiener [13]. Its performance was validated by comparing it with KNN in his master thesis on the test bed, Reuter [13]. Even if the neural network classifies documents more accurately, it takes very much time for learning training documents. The string kernel was proposed as the solution to the two main problems which is inherent in encoding documents into numerical vectors. It was initially proposed by Lodhi et al in 2002 as the kernel function of SVM [15]. The string kernel receives two raw texts as its inputs and computes their syntactical similarity between them. Since documents don t need to be encoded into numerical vectors, the two main problems are naturally avoided. However, it costed very time for computing the similarity and failed to improve the 27
3 performance of text categorization. This research has three advantages as mentioned in section 1. The first advantage of this research is to avoid the two main problems by encoding documents into alternative structured data to numerical vectors. The second advantage is that string vectors are more transparent than numerical vectors with respect to the content of its full text; it is easier to guess the content of document by seeing its string vector than by its numerical vector. The third advantage as one derived from the second advantage is that it is potentially easier to trace why each document is classified. Therefore, this research proposes the neural network which received string vectors as its input data because of the three advantages. 3 String Vector This section is concerned with the general aspect of string vectors. A string vector refers to an ordered set of strings. In the context of a natural language, a string indicates a word or a vocabulary. In other words, words or vocabularies are given as elements of string vectors. Therefore, this section describes in detail string vectors as the alternative representation of documents to numerical vectors. A string vector is defined as a set of words which is ordered and has its fixed size. It is denoted by [s 1,s 2,...,s d ] where s i denotes a string, and there are d elements. When representing documents into string vectors, their sizes are fixed with d, and it is called the dimension of string vectors. Since the elements are ordered in each string vector, two string vectors with their identical elements but different orders are treated as different ones. The reason is that each position of an element has its own different feature. Table 1 illustrate differences between string vectors and numerical vectors. The first difference is that numerical values are given as elements in numerical vectors, while strings are given as elements in string vectors. The second difference is that the similarity measure between two numerical vectors is the cosine similarity or the Euclidean distance, while that between two string vectors is the semantic average similarity 1. The third difference between the two types of structured data is that features for encoding documents into numerical vectors are words, while those for encoding them into string vectors are statistical linguistic and posting properties of words. Therefore, a string vector is the vector where numerical values are replaced by strings in a numerical vector. 1 The average semantic similarity is not described here, since it is not involved in executing the proposed neural network. Table 1. The Comparison of Numerical Vectors and String Vectors Numerical Vector String Vector Element Numerical Value String Similarity Measure Inner Products Semantic Similarity Euclidean Distance Attributes Words Property of Words The differences between string vectors and bags of words are illustrated in table 2. Both types of structured data have strings as their elements. As the similarity measure, cardinality of intersection of two bags of words is used while the average semantic similarity is used in string vectors. A bag of words is defined as an unordered infinite set of words, while a string vector is defined as an ordered finite set of words. Although a bag of words and a string vector look similar as each other, they should be distinguished from each other, based on table 2. Table2.TheComparisonofBagofWordsand String Vectors Element Numerical Vector String String Vector Similarity Measure Number of Shared Words Semantic Similarity Set Unordered Infinite Set Ordered Finite Set There are three advantages in representing documents into string vectors. The first advantage is to avoid completely the two main problems: the huge dimensionality and the sparse distribution. The second advantage is that string vectors are characterized as more transparent representations of documents than numerical vectors; it is easier to guess the content of documents only from their representations. The third advantage is that there is the potential possibility of tracing more easily why documents are classified so. However, in order to use string vectors more freely, it is necessary to make mathematical foundations. 4 Neural Text Categorizer This section is concerned with the architecture, learning algorithm, and classification of the proposed neural network called NTC. Documents are encoded into string vectors for using the NTC. In the neural network, the special layer, called the learning layer, exists. With respect to the architecture of the NTC, 28
4 the input nodes are directly connected to the output nodes, and the weights between the input and output nodes are decided by the learning nodes. In this section, we will describe the architecture and the learning process of NTC. The architecture of the NTC is illustrated in figure 1. The input layer receives an input vector given as a string vector and the number of nodes in the layer is consistent with the dimension of the string vector. The output layer generates categorical scores which indicates the likelihood of the given input vector to each category, and the number of nodes in the layer is consistent with the number of the predefined categories or classes. The learning layer decides the weights between the input and output layers differently depending on the given input vector, and the number of nodes in the layer is also consistent with the number of categories or classes. Therefore, with respect to its architecture, the NTC has the three layers as shown in figure 1. nodes as its categorical score. The weights are decided by referring the table which is owned by its corresponding learning node. The category corresponding to the output node which generate its maximum categorical score is decided as the category of the given example. Therefore, the output of this process is one of the predefined categories, assuming that the NTC is applied to text categorization without the decomposition. The property which characterizes the NTC exists. It is that the learning layer exists inherently in the NTC, and it has its own table, as the reference for deciding weights, which consists of words and their weights. Each input node receives a string as an element of a string vector, and learning nodes decide the weights connected from the input node by referring to their own tables. In the current version of the NTC, if the word is not registered in a table, its weight is assigned to zero. The issue on this case will be considered in further research. 5 Empirical Results Figure 1. Overall Architecture of the NTC In context of the learning process, the first step of the NTC is to initialize the weights between the input and output layers. Let s assume that the NTC is applied to text categorization without decomposing the task into binary classification tasks. A set of the training string vectors is partitioned category by category. Each learning node has its own table which consists of words and their weights. Frequencies of elements of string vectors within each category assigned in the table as the initial weights. Therefore, the initial step is to set up the tables in learning nodes. The learning process of the NTC refers to the process of optimizing the weights in the tables of the learning nodes. Each training example is classified by summing the initial weights and selecting the category corresponding to the maximal sum. If the training example is classified correctly, the weights are not updated. Otherwise, the weights are incremented toward the target category and those are decremented toward the classified category. The optimized weights are generated as the output of this process. In the NTC, each example is classified by summing the optimized weights, whether it is a training or unseen example. Each output node generates the summation of weights connected to itself from the input This section is concerned with the empirical validation of the performance of NTC. We use the collection of news articles, called Reuter 21578, as the test bed. This set of experiments involves the five approaches: KNN, NB, SVM, NNBP, and NTC. The F1 measure is used for evaluating the performance of each approach to text categorization. In this section, the test bed and configurations of the approaches involved in the set of experiments are described, and the results of the set of experiments are presented and discussed. Table 3 illustrates the predefined categories and the number of news articles per each category in the test bed, Reuter The ten most frequent categories are selected among more than 100 categories in this set of experiments. As illustrated in table 3, the collection of news articles is partitioned into two sets: the training and the test set. The selection of ten most frequent categories and the partition are subject to the literature [16]. The Reuter is popularly used as the standard test bed for evaluating approaches to text categorization [1]. The configurations of the involved approaches are illustrated in table 4. The parameters of the SVM and the KNN, the capacity and the number of nearest neighbors, are set as four and three, respectively but the NB has no parameter. The parameters of the NNBP such as the number of hidden nodes and the learning rate are arbitrary set as shown in table 4. News articles are eneocded into 500 dimensional numerical vectors and 50 dimensional string vectors. Therefore, the configurations of the involved ap- 29
5 Table 3. The Most Frequent Categories in Reuter21378 Category Name #Training #Test #Total Earn Acq Money-Fx Crude Grain Trade Interest Ship Wheat Corn Figure 2. The Results of This Set of Experiments 6 Conclusion proaches are set as shown in table 4. Table 4. The Configurations of Participating Approaches Approaches Parameter Configurations SVM Capacity = 4.0 KNN #nearest number = 3 Naive Bayes N/A NNBP Hidden Layer: 10 hidden nodes Learning rate: 0.3 #Training Epochs: 1000 NTC Learning rate: 0.3 #Training Epochs: 100 The results of comparing the involved approach with each other are presented in figure 2. Among the five bars, the black bar indicates the performance of the proposed approach. The left group indicates the microaveraged F1 measure of the five approaches. The right group does the macro-averaged F1 measure. The proposed approach, NTC, shows its best performance among the five approaches. We need to discuss the results illustrated in figure 2. Note that documents are encoded into the smaller dimension of string vectors than that of numerical vectors. The NTC shows its best performance among the five approaches which are involved in this set of experiments. Even if the NTC has its comparable performance to the others, it is more practical approach, since its input size is very smaller. Therefore, this set of experiments shows that the proposed approach is most practical with respect to both the better performance and more compact input size. The four contributions are considered as the significance of this research. For first, this research proposes the practical approach, according to the results of the set of experiments. For second, it solved the two main problems, the huge dimensionality and the sparse distribution which are inherent in encoding documents into numerical vectors. For third, it created a new neural network, called NTC, which receives string vectors differently from the previous neural networks. For last, it provides the potential easiness for tracing why each document is classified so. Let s consider the four remaining tasks as the further research. The first task is to apply the proposed neural network to categorization of documents within a specific domain such as medicine, law, and engineering. The second task is to modify it into the regression version. The third task is to create an unsupervised neural network based on the proposed neural network. The last task is to develop a text categorization system where the approach is adopted as its categorization engine. References [1] F. Sebastiani, Machine Learning in Automated Text Categorization, pp1-47, ACM Computing Survey, Vol 34, No 1, [2] T.M. Cover and P.E. Hart, Nearest Neighbor Pattern Classification,pp21-27, IEEE Transaction on Information Theory, Vol 13, [3] B. Massand, G. Linoff, and D. Waltz, Classifying News Stories using Memory based Reasoning, pp59-65, The Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval,
6 [4] Y. Yang, An evaluation of statistical approaches to text categorization, pp67-88, Information Retrieval, Vol 1, No 1-2, [5] I. Kononenko, ID3, sequential Bayes, naive Bayes and Bayesian neural networks, pp91-98 The Proceedings of 4th European Working Session on Learning, Montpellier,1989. [6] T. M. Mitchell, Machine Learning, McGraw-Hill, [7] D. Mladenic and M. Grobelink, Feature Selection for unbalanced class distribution and Na?ve Bayes, pp , The Proceedings of International Conference on Machine Learning, [8] M. Hearst, Support Vector Machines, pp18-28, IEEE Intelligent Systems, Vol 13, No 4, [9] T. Joachims, Text Categorization with Support Vector Machines: Learning with many Relevant Features, pp , The Proceedings of 10th European Conference on Machine Learning, [10] H. Drucker, D. Wu, and V. N. Vapnik, Support Vector Machines for Spam Categorization, pp , IEEE Transaction on Neural Networks, Vol 10, No 5, [11] N. Cristianini and J. Shawe-Taylor, Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, [12] J. McClelland and D. Rumelhart, Parallel Distributed Processing, Vol 1 and 2, MIT Press, [13] E. D. Wiener, A Neural Network Approach to Topic Spotting in Text, The Thesis of Master of University of Colorado, [14] M. E. Ruiz and P. Srinivasan, Hierarchical Text Categorization Using Neural Networks, pp87-118, Information Retrieval, Vol 5, No 1, [15] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, pp , Journal of Machine Learning Research, Vol 2, No 2, [16] A. Estabrooks, T. Jo, and N. Japkowicz, A Multiple Resampling Method for Learning from Imbalanced Data Sets, pp18-36, Computational Intelligence, Vol 28, No 1,,
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationSURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING
I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and
More informationBagged Ensemble Classifiers for Sentiment Classification of Movie Reviews
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie
More informationA Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationData Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
More informationSupervised Learning Evaluation (via Sentiment Analysis)!
Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationComparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationMachine Learning. 01 - Introduction
Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
More informationArtificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
More informationA Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
More informationWE DEFINE spam as an e-mail message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationReference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors
Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann
More informationRole of Neural network in data mining
Role of Neural network in data mining Chitranjanjit kaur Associate Prof Guru Nanak College, Sukhchainana Phagwara,(GNDU) Punjab, India Pooja kapoor Associate Prof Swami Sarvanand Group Of Institutes Dinanagar(PTU)
More informationNeural Networks and Support Vector Machines
INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines
More informationLCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationFeature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
More informationImplementation of hybrid software architecture for Artificial Intelligence System
IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.1, January 2007 35 Implementation of hybrid software architecture for Artificial Intelligence System B.Vinayagasundaram and
More informationDATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
More informationSelf Organizing Maps for Visualization of Categories
Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, julian.szymanski@eti.pg.gda.pl
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationMachine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
More informationPredicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
More informationMining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
More informationA Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad
More informationLearning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
More informationData Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
More informationE-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce
More informationLearning to Process Natural Language in Big Data Environment
CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview
More informationAn Introduction to Data Mining
An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationPredicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)
260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case
More informationFacilitating Business Process Discovery using Email Analysis
Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process
More informationBIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
More informationAUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES
AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES Anwar Ali Yahya *, Addin Osman * * Faculty of Computer Science and Information Systems, Najran University,
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationClassification of Fingerprints. Sarat C. Dass Department of Statistics & Probability
Classification of Fingerprints Sarat C. Dass Department of Statistics & Probability Fingerprint Classification Fingerprint classification is a coarse level partitioning of a fingerprint database into smaller
More informationAdaptive Anomaly Detection for Network Security
International Journal of Computer and Internet Security. ISSN 0974-2247 Volume 5, Number 1 (2013), pp. 1-9 International Research Publication House http://www.irphouse.com Adaptive Anomaly Detection for
More informationUniversité de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr
Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
More informationLess naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationPredictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
More informationA Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng
More informationIntrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department
More informationPractical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING
Practical Applications of DATA MINING Sang C Suh Texas A&M University Commerce r 3 JONES & BARTLETT LEARNING Contents Preface xi Foreword by Murat M.Tanik xvii Foreword by John Kocur xix Chapter 1 Introduction
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationImpact of Boolean factorization as preprocessing methods for classification of Boolean data
Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,
More informationEmail Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
More informationQuestion 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
More information203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
More informationA Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
More informationUsing Artificial Intelligence to Manage Big Data for Litigation
FEBRUARY 3 5, 2015 / THE HILTON NEW YORK Using Artificial Intelligence to Manage Big Data for Litigation Understanding Artificial Intelligence to Make better decisions Improve the process Allay the fear
More informationMachine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.
Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,
More informationNovelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
More informationUsing artificial intelligence for data reduction in mechanical engineering
Using artificial intelligence for data reduction in mechanical engineering L. Mdlazi 1, C.J. Stander 1, P.S. Heyns 1, T. Marwala 2 1 Dynamic Systems Group Department of Mechanical and Aeronautical Engineering,
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationSupervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
More informationStatistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
More informationCustomer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
More informationClustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
More informationMobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
More information10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
More informationTaking Advantage of the Web for Text Classification with Imbalanced Classes *
Taking Advantage of the Web for Text lassification with Imbalanced lasses * Rafael Guzmán-abrera 1,2, Manuel Montes-y-Gómez 3, Paolo Rosso 2, Luis Villaseñor-Pineda 3 1 FIMEE, Universidad de Guanajuato,
More informationL25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
More informationScalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
More informationData Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
More informationA Survey of Classification Techniques in the Area of Big Data.
A Survey of Classification Techniques in the Area of Big Data. 1PrafulKoturwar, 2 SheetalGirase, 3 Debajyoti Mukhopadhyay 1Reseach Scholar, Department of Information Technology 2Assistance Professor,Department
More informationRecurrent Neural Networks
Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationSupporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
More informationElectroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep
Engineering, 23, 5, 88-92 doi:.4236/eng.23.55b8 Published Online May 23 (http://www.scirp.org/journal/eng) Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep JeeEun
More informationComparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
More informationSVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
More informationCombining SVM classifiers for email anti-spam filtering
Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationCLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationAn Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
More informationARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES
FOUNDATION OF CONTROL AND MANAGEMENT SCIENCES No Year Manuscripts Mateusz, KOBOS * Jacek, MAŃDZIUK ** ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES Analysis
More informationData Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over
More informationSURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH
1 SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH Y, HONG, N. GAUTAM, S. R. T. KUMARA, A. SURANA, H. GUPTA, S. LEE, V. NARAYANAN, H. THADAKAMALLA The Dept. of Industrial Engineering,
More informationNeural Networks for Sentiment Detection in Financial Text
Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.
More informationEmail Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
More informationSupport Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More information