Automatic Text Categorization using NTC

Size: px
Start display at page:

Download "Automatic Text Categorization using NTC"

Transcription

1 Automatic Text Categorization using NTC Taeho Jo School of Computer and Information Engineering, Inha University Abstract In this research, we propose NTC (Neural Text Categorizer) as the approach to text categorization. Traditional approaches to text categorization require encoding documents into numerical vectors which leads to the two main problems: huge dimensionality and sparse distribution in each numerical vector. In this research, documents are encoded into string vectors instead of numerical vectors, and a new neural network called NTC which receive a string vector as its input vector is used for text categorization. The goal of this research is to avoid the two main problems by encoding documents into alternative structured data to numerical vectors. We will validate the performance of NTC by comparing it with other machine learning algorithms on the standard test bed, Reuter Introduction Text categorization refers to the process of assigning one or some of the predefined categories to each document. There are two preliminary tasks before performing a task of text categorization: the predefinition of categories and the preparation of sample labeled documents. After the preliminary tasks, text categorization systems learn the sample labeled documents, in order to build their own classification capacity. Using the classification capacity, the system classifies unseen documents. Therefore, the system performs the task of text categorization through the three steps: the preliminary tasks, the learning, and the classification. It requires encoding documents into numerical vectors for using one of traditional approaches to text categorization. By indexing a corpus which consists of sample labeled documents, words called feature candidates are generated. Among them, only some are selected as features, and various criteria for doing that were already defined [7]. Values are assigned to each feature in representing a document into a numerical vector whose attributes are the selected features. Note that encoding documents so causes the two main problems: the huge dimensionality and the sparse distribution. The first problem caused by encoding documents into numerical vectors is the huge dimensionality. It means the requirement of many features for avoiding representing a document into a zero vector. The number of features is usually several hundreds according to the previous literatures [13][15][1]. Although the number of features is small compared with the number of feature candidates, it is still large as the input size. This problem costs very much time and system resources for processing each document and the demand of many sample documents proportionally to the number of features. The second problem caused by encoding documents so is the sparse distribution. It refers to the phenomena in which zero values takes almost positions of each numerical vector. Under the problem, there is the potential possibility that complete zero vectors exist with regardless of categories among representations of sample labeled documents. There is also the possibility of encoding unseen documents into zero vectors. The problem makes very poor the robustness of the learning and the classification of approaches to text categorization. The fundamental idea of this research is to encode documents into string vectors, instead of numerical vectors. A string as an element of the string vectors refers to a combination of alphabets of a language as the basic semantic unit. A string vector refers to a finite ordered set of strings. Note that string vectors with their identical elements but their different orders are treated as different ones. The intention of encoding documents so is to avoid the two main problems which are characterized by encoding documents into numerical vectors. In order to solve the two main problems, the neural network model, called NTC (Neural Text Categorizer), is proposed as the approach to text categorization. The neural network model receives string vectors as its input vectors and initializes its weights with frequencies /09/$ IEEE 26

2 of elements within each class. The weights are optimized as the learning process of the neural networks model to minimize its classification error of training examples. When an unseen string vector as the representation of an unseen document is given, the value of each output node is computed by summing of weights connected to itself, and the unseen document is classified into the category corresponding to the output node with its maximum value. Since the proposed neural network model uses string vectors as its input vectors, it is free from the two main problem. The proposed neural network model is expected to improve the performance of text categorization by avoiding the two main problems. In order to use the proposed neural network model, documents are encoded into string vectors, as mentioned above. Dominance of a particular value never happen in string vectors; any problem which is similar as the sparse distribution does not happen in each string vector. Much smaller dimension of string vectors compared with that of numerical vectors is sufficient for the robust learning and classification. Therefore, the two main problems never happen in using the proposed neural network model for tasks of text categorization. This paper consists of six sections, including this section. In section 2, we will explore relevant literatures. In section 3 and 4, we will describe the definition of string vectors and the learning algorithm and classification of the proposed neural network model, respectively. In section 5, we will validate empirically the performance of the proposed neural network model by comparing it with other machine learning algorithms on the test bed, Reuter In section 6, we will mention the significance of this research and further research as the conclusion of this article. 2 Previous Works This section is concerned with previous works relevant to this research. Even if many approaches to text categorization already proposed, we will mention the four representative and popular approaches: KNN (K Nearest Neighbor), NB (Naive Bayes), SVM (Support Vector Machine), and Neural Networks. It requires encoding documents into numerical vectors for using one of them for text categorization; the two main problems are caused. String kernel was proposed in using the SVM for text categorization as the solution to the two main problems, but it failed to improve the performance. In this section, we will explore the previous works on previous approaches to text categorization and previous solution to the two main problems. The KNN may be considered as a typical and popular approach to text categorization [1]. The KNN was initially created by Cover and Hart in 1967 as a genetic classification algorithm [2]. It was initially applied to text categorization by Massand et al in 1992 [3]. KNN was recommended by Yang in 1999 [4] and by Sebastiani in 2002 [1] as a practical approach to text categorization. Therefore, the KNN has been aimed as the base approach in other literatures as the base approach [1]. The Naive Bayes may be considered as another approach to text categorization. It was initially created by Kononenko in 1989, based on Bayes Rule [5]. Its application to text categorization was mentioned in the textbook by Mitchell in 1997 [6]. Assuming that the Naive Bayes is the popular approach, in 1999, Mladenic and Grobelink proposed and evaluated feature selection methods [7]. The Naive Bayes has been compared with other subsequent approaches in text categorization [9] [10]. Recently, the SVM was recommended as the practical approach to text categorization [9] [10]. It was initially introduced in her magazine article by Hearst in 1998 [8]. In the same year, it was applied to text categorization by Joachims [9]. It was adopted as the approach to spam mail filtering as a practical instance of text categorization in 1999 by Druker et al [10]. Furthermore, the SVM is popularly used not only for text categorization tasks but also for any other pattern classification tasks [11]. Neural Networks may be considered as an approach to text categorization, and among them, the MLP (Multiple Layers Perceptron) with back propagation is the most popular model. The neural network model was initially created in 1986 by Mcelland and Rumelhart, and it was intended to for performing tasks of pattern classification and nonlinear regressions as a supervised learning algorithm [12]. It was initially applied to text categorization in 1995 by Wiener [13]. Its performance was validated by comparing it with KNN in his master thesis on the test bed, Reuter [13]. Even if the neural network classifies documents more accurately, it takes very much time for learning training documents. The string kernel was proposed as the solution to the two main problems which is inherent in encoding documents into numerical vectors. It was initially proposed by Lodhi et al in 2002 as the kernel function of SVM [15]. The string kernel receives two raw texts as its inputs and computes their syntactical similarity between them. Since documents don t need to be encoded into numerical vectors, the two main problems are naturally avoided. However, it costed very time for computing the similarity and failed to improve the 27

3 performance of text categorization. This research has three advantages as mentioned in section 1. The first advantage of this research is to avoid the two main problems by encoding documents into alternative structured data to numerical vectors. The second advantage is that string vectors are more transparent than numerical vectors with respect to the content of its full text; it is easier to guess the content of document by seeing its string vector than by its numerical vector. The third advantage as one derived from the second advantage is that it is potentially easier to trace why each document is classified. Therefore, this research proposes the neural network which received string vectors as its input data because of the three advantages. 3 String Vector This section is concerned with the general aspect of string vectors. A string vector refers to an ordered set of strings. In the context of a natural language, a string indicates a word or a vocabulary. In other words, words or vocabularies are given as elements of string vectors. Therefore, this section describes in detail string vectors as the alternative representation of documents to numerical vectors. A string vector is defined as a set of words which is ordered and has its fixed size. It is denoted by [s 1,s 2,...,s d ] where s i denotes a string, and there are d elements. When representing documents into string vectors, their sizes are fixed with d, and it is called the dimension of string vectors. Since the elements are ordered in each string vector, two string vectors with their identical elements but different orders are treated as different ones. The reason is that each position of an element has its own different feature. Table 1 illustrate differences between string vectors and numerical vectors. The first difference is that numerical values are given as elements in numerical vectors, while strings are given as elements in string vectors. The second difference is that the similarity measure between two numerical vectors is the cosine similarity or the Euclidean distance, while that between two string vectors is the semantic average similarity 1. The third difference between the two types of structured data is that features for encoding documents into numerical vectors are words, while those for encoding them into string vectors are statistical linguistic and posting properties of words. Therefore, a string vector is the vector where numerical values are replaced by strings in a numerical vector. 1 The average semantic similarity is not described here, since it is not involved in executing the proposed neural network. Table 1. The Comparison of Numerical Vectors and String Vectors Numerical Vector String Vector Element Numerical Value String Similarity Measure Inner Products Semantic Similarity Euclidean Distance Attributes Words Property of Words The differences between string vectors and bags of words are illustrated in table 2. Both types of structured data have strings as their elements. As the similarity measure, cardinality of intersection of two bags of words is used while the average semantic similarity is used in string vectors. A bag of words is defined as an unordered infinite set of words, while a string vector is defined as an ordered finite set of words. Although a bag of words and a string vector look similar as each other, they should be distinguished from each other, based on table 2. Table2.TheComparisonofBagofWordsand String Vectors Element Numerical Vector String String Vector Similarity Measure Number of Shared Words Semantic Similarity Set Unordered Infinite Set Ordered Finite Set There are three advantages in representing documents into string vectors. The first advantage is to avoid completely the two main problems: the huge dimensionality and the sparse distribution. The second advantage is that string vectors are characterized as more transparent representations of documents than numerical vectors; it is easier to guess the content of documents only from their representations. The third advantage is that there is the potential possibility of tracing more easily why documents are classified so. However, in order to use string vectors more freely, it is necessary to make mathematical foundations. 4 Neural Text Categorizer This section is concerned with the architecture, learning algorithm, and classification of the proposed neural network called NTC. Documents are encoded into string vectors for using the NTC. In the neural network, the special layer, called the learning layer, exists. With respect to the architecture of the NTC, 28

4 the input nodes are directly connected to the output nodes, and the weights between the input and output nodes are decided by the learning nodes. In this section, we will describe the architecture and the learning process of NTC. The architecture of the NTC is illustrated in figure 1. The input layer receives an input vector given as a string vector and the number of nodes in the layer is consistent with the dimension of the string vector. The output layer generates categorical scores which indicates the likelihood of the given input vector to each category, and the number of nodes in the layer is consistent with the number of the predefined categories or classes. The learning layer decides the weights between the input and output layers differently depending on the given input vector, and the number of nodes in the layer is also consistent with the number of categories or classes. Therefore, with respect to its architecture, the NTC has the three layers as shown in figure 1. nodes as its categorical score. The weights are decided by referring the table which is owned by its corresponding learning node. The category corresponding to the output node which generate its maximum categorical score is decided as the category of the given example. Therefore, the output of this process is one of the predefined categories, assuming that the NTC is applied to text categorization without the decomposition. The property which characterizes the NTC exists. It is that the learning layer exists inherently in the NTC, and it has its own table, as the reference for deciding weights, which consists of words and their weights. Each input node receives a string as an element of a string vector, and learning nodes decide the weights connected from the input node by referring to their own tables. In the current version of the NTC, if the word is not registered in a table, its weight is assigned to zero. The issue on this case will be considered in further research. 5 Empirical Results Figure 1. Overall Architecture of the NTC In context of the learning process, the first step of the NTC is to initialize the weights between the input and output layers. Let s assume that the NTC is applied to text categorization without decomposing the task into binary classification tasks. A set of the training string vectors is partitioned category by category. Each learning node has its own table which consists of words and their weights. Frequencies of elements of string vectors within each category assigned in the table as the initial weights. Therefore, the initial step is to set up the tables in learning nodes. The learning process of the NTC refers to the process of optimizing the weights in the tables of the learning nodes. Each training example is classified by summing the initial weights and selecting the category corresponding to the maximal sum. If the training example is classified correctly, the weights are not updated. Otherwise, the weights are incremented toward the target category and those are decremented toward the classified category. The optimized weights are generated as the output of this process. In the NTC, each example is classified by summing the optimized weights, whether it is a training or unseen example. Each output node generates the summation of weights connected to itself from the input This section is concerned with the empirical validation of the performance of NTC. We use the collection of news articles, called Reuter 21578, as the test bed. This set of experiments involves the five approaches: KNN, NB, SVM, NNBP, and NTC. The F1 measure is used for evaluating the performance of each approach to text categorization. In this section, the test bed and configurations of the approaches involved in the set of experiments are described, and the results of the set of experiments are presented and discussed. Table 3 illustrates the predefined categories and the number of news articles per each category in the test bed, Reuter The ten most frequent categories are selected among more than 100 categories in this set of experiments. As illustrated in table 3, the collection of news articles is partitioned into two sets: the training and the test set. The selection of ten most frequent categories and the partition are subject to the literature [16]. The Reuter is popularly used as the standard test bed for evaluating approaches to text categorization [1]. The configurations of the involved approaches are illustrated in table 4. The parameters of the SVM and the KNN, the capacity and the number of nearest neighbors, are set as four and three, respectively but the NB has no parameter. The parameters of the NNBP such as the number of hidden nodes and the learning rate are arbitrary set as shown in table 4. News articles are eneocded into 500 dimensional numerical vectors and 50 dimensional string vectors. Therefore, the configurations of the involved ap- 29

5 Table 3. The Most Frequent Categories in Reuter21378 Category Name #Training #Test #Total Earn Acq Money-Fx Crude Grain Trade Interest Ship Wheat Corn Figure 2. The Results of This Set of Experiments 6 Conclusion proaches are set as shown in table 4. Table 4. The Configurations of Participating Approaches Approaches Parameter Configurations SVM Capacity = 4.0 KNN #nearest number = 3 Naive Bayes N/A NNBP Hidden Layer: 10 hidden nodes Learning rate: 0.3 #Training Epochs: 1000 NTC Learning rate: 0.3 #Training Epochs: 100 The results of comparing the involved approach with each other are presented in figure 2. Among the five bars, the black bar indicates the performance of the proposed approach. The left group indicates the microaveraged F1 measure of the five approaches. The right group does the macro-averaged F1 measure. The proposed approach, NTC, shows its best performance among the five approaches. We need to discuss the results illustrated in figure 2. Note that documents are encoded into the smaller dimension of string vectors than that of numerical vectors. The NTC shows its best performance among the five approaches which are involved in this set of experiments. Even if the NTC has its comparable performance to the others, it is more practical approach, since its input size is very smaller. Therefore, this set of experiments shows that the proposed approach is most practical with respect to both the better performance and more compact input size. The four contributions are considered as the significance of this research. For first, this research proposes the practical approach, according to the results of the set of experiments. For second, it solved the two main problems, the huge dimensionality and the sparse distribution which are inherent in encoding documents into numerical vectors. For third, it created a new neural network, called NTC, which receives string vectors differently from the previous neural networks. For last, it provides the potential easiness for tracing why each document is classified so. Let s consider the four remaining tasks as the further research. The first task is to apply the proposed neural network to categorization of documents within a specific domain such as medicine, law, and engineering. The second task is to modify it into the regression version. The third task is to create an unsupervised neural network based on the proposed neural network. The last task is to develop a text categorization system where the approach is adopted as its categorization engine. References [1] F. Sebastiani, Machine Learning in Automated Text Categorization, pp1-47, ACM Computing Survey, Vol 34, No 1, [2] T.M. Cover and P.E. Hart, Nearest Neighbor Pattern Classification,pp21-27, IEEE Transaction on Information Theory, Vol 13, [3] B. Massand, G. Linoff, and D. Waltz, Classifying News Stories using Memory based Reasoning, pp59-65, The Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval,

6 [4] Y. Yang, An evaluation of statistical approaches to text categorization, pp67-88, Information Retrieval, Vol 1, No 1-2, [5] I. Kononenko, ID3, sequential Bayes, naive Bayes and Bayesian neural networks, pp91-98 The Proceedings of 4th European Working Session on Learning, Montpellier,1989. [6] T. M. Mitchell, Machine Learning, McGraw-Hill, [7] D. Mladenic and M. Grobelink, Feature Selection for unbalanced class distribution and Na?ve Bayes, pp , The Proceedings of International Conference on Machine Learning, [8] M. Hearst, Support Vector Machines, pp18-28, IEEE Intelligent Systems, Vol 13, No 4, [9] T. Joachims, Text Categorization with Support Vector Machines: Learning with many Relevant Features, pp , The Proceedings of 10th European Conference on Machine Learning, [10] H. Drucker, D. Wu, and V. N. Vapnik, Support Vector Machines for Spam Categorization, pp , IEEE Transaction on Neural Networks, Vol 10, No 5, [11] N. Cristianini and J. Shawe-Taylor, Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, [12] J. McClelland and D. Rumelhart, Parallel Distributed Processing, Vol 1 and 2, MIT Press, [13] E. D. Wiener, A Neural Network Approach to Topic Spotting in Text, The Thesis of Master of University of Colorado, [14] M. E. Ruiz and P. Srinivasan, Hierarchical Text Categorization Using Neural Networks, pp87-118, Information Retrieval, Vol 5, No 1, [15] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, pp , Journal of Machine Learning Research, Vol 2, No 2, [16] A. Estabrooks, T. Jo, and N. Japkowicz, A Multiple Resampling Method for Learning from Imbalanced Data Sets, pp18-36, Computational Intelligence, Vol 28, No 1,,

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and

More information

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

Supervised Learning Evaluation (via Sentiment Analysis)!

Supervised Learning Evaluation (via Sentiment Analysis)! Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Machine Learning. 01 - Introduction

Machine Learning. 01 - Introduction Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

WE DEFINE spam as an e-mail message that is unwanted basically

WE DEFINE spam as an e-mail message that is unwanted basically 1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann

More information

Role of Neural network in data mining

Role of Neural network in data mining Role of Neural network in data mining Chitranjanjit kaur Associate Prof Guru Nanak College, Sukhchainana Phagwara,(GNDU) Punjab, India Pooja kapoor Associate Prof Swami Sarvanand Group Of Institutes Dinanagar(PTU)

More information

Neural Networks and Support Vector Machines

Neural Networks and Support Vector Machines INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines

More information

LCs for Binary Classification

LCs for Binary Classification Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Implementation of hybrid software architecture for Artificial Intelligence System

Implementation of hybrid software architecture for Artificial Intelligence System IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.1, January 2007 35 Implementation of hybrid software architecture for Artificial Intelligence System B.Vinayagasundaram and

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Self Organizing Maps for Visualization of Categories

Self Organizing Maps for Visualization of Categories Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, julian.szymanski@eti.pg.gda.pl

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Learning to Process Natural Language in Big Data Environment

Learning to Process Natural Language in Big Data Environment CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University) 260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case

More information

Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES

AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES Anwar Ali Yahya *, Addin Osman * * Faculty of Computer Science and Information Systems, Najran University,

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability Classification of Fingerprints Sarat C. Dass Department of Statistics & Probability Fingerprint Classification Fingerprint classification is a coarse level partitioning of a fingerprint database into smaller

More information

Adaptive Anomaly Detection for Network Security

Adaptive Anomaly Detection for Network Security International Journal of Computer and Internet Security. ISSN 0974-2247 Volume 5, Number 1 (2013), pp. 1-9 International Research Publication House http://www.irphouse.com Adaptive Anomaly Detection for

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING Practical Applications of DATA MINING Sang C Suh Texas A&M University Commerce r 3 JONES & BARTLETT LEARNING Contents Preface xi Foreword by Murat M.Tanik xvii Foreword by John Kocur xix Chapter 1 Introduction

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Using Artificial Intelligence to Manage Big Data for Litigation

Using Artificial Intelligence to Manage Big Data for Litigation FEBRUARY 3 5, 2015 / THE HILTON NEW YORK Using Artificial Intelligence to Manage Big Data for Litigation Understanding Artificial Intelligence to Make better decisions Improve the process Allay the fear

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

Novelty Detection in image recognition using IRF Neural Networks properties

Novelty Detection in image recognition using IRF Neural Networks properties Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,

More information

Using artificial intelligence for data reduction in mechanical engineering

Using artificial intelligence for data reduction in mechanical engineering Using artificial intelligence for data reduction in mechanical engineering L. Mdlazi 1, C.J. Stander 1, P.S. Heyns 1, T. Marwala 2 1 Dynamic Systems Group Department of Mechanical and Aeronautical Engineering,

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html 10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

More information

Taking Advantage of the Web for Text Classification with Imbalanced Classes *

Taking Advantage of the Web for Text Classification with Imbalanced Classes * Taking Advantage of the Web for Text lassification with Imbalanced lasses * Rafael Guzmán-abrera 1,2, Manuel Montes-y-Gómez 3, Paolo Rosso 2, Luis Villaseñor-Pineda 3 1 FIMEE, Universidad de Guanajuato,

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

A Survey of Classification Techniques in the Area of Big Data.

A Survey of Classification Techniques in the Area of Big Data. A Survey of Classification Techniques in the Area of Big Data. 1PrafulKoturwar, 2 SheetalGirase, 3 Debajyoti Mukhopadhyay 1Reseach Scholar, Department of Information Technology 2Assistance Professor,Department

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence

More information

Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep

Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep Engineering, 23, 5, 88-92 doi:.4236/eng.23.55b8 Published Online May 23 (http://www.scirp.org/journal/eng) Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep JeeEun

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Combining SVM classifiers for email anti-spam filtering

Combining SVM classifiers for email anti-spam filtering Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES

ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES FOUNDATION OF CONTROL AND MANAGEMENT SCIENCES No Year Manuscripts Mateusz, KOBOS * Jacek, MAŃDZIUK ** ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES Analysis

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH

SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH 1 SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH Y, HONG, N. GAUTAM, S. R. T. KUMARA, A. SURANA, H. GUPTA, S. LEE, V. NARAYANAN, H. THADAKAMALLA The Dept. of Industrial Engineering,

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information