# Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Save this PDF as:

Size: px
Start display at page:

Download "Impact of Boolean factorization as preprocessing methods for classification of Boolean data"

## Transcription

1 Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science, Palacky University, Olomouc Abstract. The paper explores a utilization of Boolean factorization as a method for data preprocessing in classification of Boolean data. In previous papers, we demonstrated that data preprocessing consisting in replacing the original Boolean attributes by factors, i.e. new Boolean attributes that are obtained from the original ones by Boolean factorization, improves the quality of classification. The aim of this paper is to explore the question of how the various Boolean factorization methods that were proposed in the literature impact the quality of classification. In particular, we compare three factorization methods, present experimental results, and outline issues for future research. 1 Problem Setting In classification of Boolean data, the objects to classify are described by Boolean (binary, yes-no) attributes. As with the other classification problems, one may be interested in preprocessing of the input attributes to improve the quality of classification. With Boolean input attributes, we might want to limit ourselves to preprocessing with a clear semantics. Namely, as it is known, see e.g. [2, 8], applying to Boolean data the methods designed originally for real-valued data distorts the meaning of the data and leads generally to results difficult to interpret. In [9, 10], we proposed a method for preprocessing Boolean data based on the Boolean matrix factorization (BMF) method, i.e. a decomposition method for Boolean matrices, developed in [2]. The method consists in using for classification of the objects new Boolean attributes. The new attributes are actually the factors computed from the original attributes. Since the factors are essentially (some of the) formal concepts [4] associated to the input data, they have a clear meaning and are easy to interpret [2]. Moreover, there exists a natural transformation of the objects between the space of the original attributes and the space of the factors [2] which is conveniently utilized by the method. It has been demonstrated in [9, 10] that such preprocessing makes it possible to classify We acknowledge support by the ESF project No. CZ.1.07/2.3.00/ , the project is co-financed by the European Social Fund and the state budget of the Czech Republic (R. Belohlavek); Grant No. 202/10/P3 of the Czech Science Foundation (J. Outrata); and by the IGA of Palacky University, No. PrF (M. Trnecka). c Laszlo Szathmary, Uta Priss (Eds.): CLA 2012, pp , ISBN , Universidad de Málaga (Dept. Matemática Aplicada)

2 306 Radim Belohlavek, Jan Outrata and Martin Trnecka using a smaller number of input variables (factors instead of original attributes) and yet improve the quality of classification. In addition to the method from [2], there exist several other BMF methods described in the literature. In the present paper, we therefore look at the question of how these methods influence the quality of classification. In particular, we focus on three such methods and provide an experimental evaluation using basically the same scenario as in [9, 10]. Doing so, we emphasize the need to consider not only coverage and the number of extracted factors, but also additional criteria regarding quality of the proposed BMF methods. We use the following notation. We denote by X = {1,..., n} a set of objects which are given along with their input Boolean attributes which form the set Y = {1,..., m}, and a class attribute c. The input attributes are described by an n m Boolean matrix I with entries I ij (entry at row i and column j), i.e. I ij {0, 1} for every i, j. Alternatively, I may be considered as a representation of a binary relation between X and Y and, hence, we may speak of a formal context X, Y, I, etc. Since there is no danger of confusion, we conveniently switch between the matrix and relational way of looking at things. The class attribute c may be conceived as a mapping c : X C assigning to every object i X its class label c(i) in the set C of all class tables (note that C may contain more than two labels). The preprocessing method along with the three particular methods of Boolean matrix factorization is described in Section 2. Section 3 describes the experiments and provides their results. In Section 4 we conclude the paper and provide some directions for future research. 2 Boolean Matrix Factorization and Its Utilization 2.1 General BMF Problem We denote by {0, 1} n m the set of all n m Boolean matrices and by I i and I j the ith row and jth column, respectively, of matrix I. In BMF, the general aim is to find for a given I {0, 1} n m (and possibly other parameters, see Problem 1 and Problem 2) matrices A {0, 1} n k and B {0, 1} k m for which I is (approximately) equal to A B, (1) with being the Boolean matrix product given by (A B) ij = k A il B lj, (2) l=1 where denotes the maximum and the ordinary product. Such an exact or approximate decomposition of I into A B corresponds to a discovery of k factors (new Boolean variables) that exactly or approximately explain the data. Namely, factor l = 1,..., k, may be represented by A l (column l of A) and B l (row l of B): A il = 1 indicates that factor l applies to object i while B lj indicates that

3 Impact of Bool. factorization as preprocessing for classification 307 attribute j is a particular manifestation of factor l (think of person A as object, being fluent in English as attribute, and having good education as factor). The least k for which an exact decomposition I = A B exists is called the Boolean (or Schein) rank of I [2, 5, 8]. Then, according to (2), the factor model reads object i has attribute j if and only if there exists factor l such that l applies to i and j is a particular manifestation of l. The matrices I, A, and B are usually called the object-attribute matrix, the object-factor (or usage) matrix, and the factor-attribute (or basis vector) matrix [2, 8]. The methods described in the literature are usually designed for two particular problems. Consider the matrix metric [5, 8] (arising from the L 1 - norm of matrices, or Hamming weight in case of Boolean matrices) given by E(C, D) = C D = m,n i=1,j=1 C ij D ij. (3) E(I, A B) may be used to asses how well the product A B approximates the input matrix I. Problem 1 input: I {0, 1} n m, positive integer k output: A {0, 1} n k and B {0, 1} k m minimizing I A B. This problem is called the discrete basis problem (DBP) in [8]. In [2], the following problem is considered: Problem 2 input: I {0, 1} n m, positive integer ε output: A {0, 1} n k and B {0, 1} k m with k as small as possible such that I A B ε. The two problems reflect two important views on BMF, the first one emphasizing the importance of the first k (presumably most important) factors, the second one emphasizing the need to account for (and thus to explain) a prescribed portion of data. Note that the problem of finding an exact decomposition of I with the least number k of factors possible is a particular instance of Problem 2 (put ε = 0). Note also that it follows from the known results that both Problem 1 and Problem 2 are NP-hard optimization problems, see e.g. [2, 8], and hence approximation algorithms are needed to obtain (suboptimal) solutions. 2.2 Use of BMF in Preprocessing of Boolean Data The idea may be described as follows. For a given set X of objects, set Y of attributes, Boolean matrix I, and class attribute c, we compute n k and k m Boolean matrices A and B, respectively, for which A B approximates I reasonably well (either according to the scenario given by Problem 1 or Problem 2). Then, instead of the original instance X, Y, I, c of the classification problem, we consider a new instance given by X, F, A, c, with F = {1,..., k} denoting the factors, i.e. new Boolean attributes. Any classification model developed for X, F, A, c may then be used to classify the objects described by the original

4 308 Radim Belohlavek, Jan Outrata and Martin Trnecka Boolean attributes from Y. Namely, one may utilize natural transformations g : {0, 1} m {0, 1} k and h : {0, 1} k {0, 1} m between the space of the original attributes and the space of factors which are given by (g(p )) l = m j=1 (B lj P j ) and (h(q)) j = k l=1 (Q l B lj ) for P {0, 1} m and Q {0, 1} k ( and denote minimum and implication). These transformations are described in [2] to which we refer for more information. In particular, given an object represented by P {0, 1} m (vector of values of the m input attributes), we apply the classification method developed for X, F, A, c to g(p ), i.e. to the object representation in the space of factors. Any classification model M F : {0, 1} k C for X, F, A, c therefore induces a classification model M Y : {0, 1} m C by M Y (P ) = M F (g(p )) for any P {0, 1} m. Note that since the number k of factors of I is usually smaller than the number m of attributes (see [2], which means a reduction of dimensionality of data) and the transformation of objects from the attribute space to the factor space is not an injective mapping, we need to solve the problem of assigning a class label to objects in X, F, A, c with equal g(p ) representations transformed from objects in X, Y, I, c with different P representations and different assigned class labels. We adopt the common solution of assigning to such objects in X, F, A, c the majority class label of class labels assigned to the objects in X, Y, I, c. 2.3 Three Methods for Boolean Matrix Factorization Used in Our Experiments Asso [8] works as follows. From the input n m matrix I, the required number k of factors, and parameters τ, w +, and w, the algorithm computes an m m matrix C in which C ij = 1 if the confidence of the association rule {i} {j} is at least τ. The rows of C are then the candidate rows for matrix B. The actual k rows of B are selected from the rows of C in a greedy manner using parameters w + and w. During the greedy selection, the k columns of A are selected along with the k rows of B. This way, one obtains from I two matrices A and B such that A B approximates I. Asso is designed for Problem 1. There is no guarantee that Asso computes an exact factorization of I even for k = m, see [3]. In our experiments, we used τ = 1 and w + = w = 1 because such choice guarantees that for k = m all the 1s in I will be covered by the computed factors. GreConD This algorithm, described in [2] where it is called Algorithm 2, utilizes formal concepts of I as factors. Namely, the algorithm is selecting formal concepts of I, one by one, until a decomposition of I into A B is obtained. The selected formal concepts are utilized in a simple way: The (characteristic vectors of the) extents and intents of the concepts form the columns and rows of A and B. The algorithm may be stopped after computing the first k concepts or whenever I A B ε, i.e. the algorithm may be used for solving Problem 1 as well as Problem 2. The formal concepts are selected in a greedy manner to maximize the drop of the error function, in particular, on demand way, whence the name Gre(edy)Con(concepts on)d(emand).

5 Impact of Bool. factorization as preprocessing for classification 309 GreEssQ This algorithm [3] utilizes formal concepts of I in the same way as GreConD. The concepts are selected in a greedy manner, but contrary to Gre- ConD, the concepts are selected using a particular heuristic that is based on the information provided by certain intervals in the concept lattice of I. As with GreConD, GreEssQ may be used to solve both Problem 1 and 2. 3 Experiments We performed a series of experiments to evaluate the impact of the three Boolean matrix factorization methods described in Section 2.3 on classification of Boolean data when using factors as new attributes. The experiments consisted in comparing the classification accuracy of learning models created by selected machine learning (ML) algorithms from the data with the original attributes replaced by factors. The factors are computed from the input data by the three selected factorization methods. The ML algorithms used in the comparison are: the reference decision tree algorithms ID3 and C4.5 (entropy and information gain based), an instance based learning method (Nearest Neighbor, NN), Naive Bayes learning (NB) and a multilayer perceptron neural network trained by back propagation (MLP) [7, 11]. The algorithms were borrowed and run from Weka 1, a software package that contains implementations of machine learning and data mining algorithms in Java. Default Weka s parameters were used for the algorithms. Table 1. Characteristics of datasets used in experiments Dataset No. of attributes (binary) No. of objects Class distribution breast-cancer 9(51) /81 kr-vs-kp 36(74) /1527 mushroom 22(125) /95 vote 16(32) /108 zoo 15(30) /20/5/13/4/8/10 The experiments were done on selected public real-world datasets from UCI Machine Learning Repository [1]. The selected datasets are from different areas (medicine, biology, zoology, politics, games). All the datasets contain only categorical attributes with one class attribute and the datasets were cleared of objects containing missing values. Basic characteristics of the datasets are depicted in Table 1 (note that the mushroom dataset was shrunk in the number of objects due to computation time reasons). Note that 9(51) means 9 categorical and 51 binary attributes obtained by nominal scaling. The classification accuracy is evaluated using the 10-fold stratified cross-validation test [6] and the 1 Waikato Environment for Knowledge Analysis, available at

6 310 Radim Belohlavek, Jan Outrata and Martin Trnecka following results are based on averaging 10 execution runs on each dataset with randomly ordered objects. The results are depicted in Figures 1 to 5. Each figure contains five graphs for the five ML algorithms used. The graphs show the average percentage rates of correct classifications on the preprocessed data, i.e. the data X, F, A, c described by factors instead of X, Y, I, c described by the original attributes, cf. Sections 2.2 and 2.3 for each of the three Boolean matrix factorization methods. The percentage rates of GreEssQ, GreConD, and Asso are depicted by the dashed, dot-and-dashed, and dotted lines, respectively. The x-axis corresponds to the factor decompositions obtained by the algorithms and, in particular, measures the quantity i, j ; I ij = 1 and (A B) ij = 0, i, j ; I ij = 1 i.e. the relative error w.r.t. 1s of the input matrix I that are left uncovered in A B for the computed factorization given by A and B. The values on the x-axis range from 0.9 (corresponding to a factorization with a small number of factors that leave % of the 1s in I uncovered) to 0 (corresponding to the number of factors which decompose I exactly, i.e. I = A B). The average percentage rate of correct classification for the original data X, Y, I, c is depicted in each graph by a constant solid line. All the graphs are computed for the testing parts of the datasets used in the evaluation of classification only. We can clearly see from the graphs for all datasets but breast-cancer that the best results (average percentage rates of correct classifications) for preprocessed data are obtained, for all ML algorithms used, by the GreEssQ algorithm, outperforming both GreConD and, quite significantly, the Asso algorithm. GreConD outperforms the Asso algorithm, again for all ML algorithms used, for datasets kr-v-kp and mushroom, but not for the vote dataset. We can also see from the graphs that sometimes the preprocessed data lead to a better classification accuracy than the original data even with a few factors covering less than % of input data. This can be seen for instance for the kr-vs-kp dataset and Nearest Neighbor and MLP or the mushroom dataset and ID.3, Naive Bayes and MLP. See [9, 10] for indications of when, i.e. for which datasets and ML algorithms, the data with original attributes replaced by factors (computed by GreConD) covering % of input data leads to a better classification accuracy compared to the original data. Particularly interesting seem the results for the breast-cancer dataset. As we can see, the preprocessed data with factors instead of the original attributes are (much) better classified compared to the original data and that this is observable for all ML algorithms used except for Naive Bayes. Furthermore, the number of factors leading to the best average percentage rates of correct classifications is such that the factors cover just 40 % (which corresponds to 0.6 on the x-axis) of input data! This indicates either many superfluous attributes or large noise in the input data that is overcome by using the factors. The GreEssQ and GreConD algorithms are of comparable performance here, both outperforming the Asso algorithm.

7 Impact of Bool. factorization as preprocessing for classification 311 correct classification (%) correct classification (%) correct classification (%) correct classification (%) correct classification (%) Fig. 1. Classification accuracy for breast-cancer dataset, for (from top to bottom) ML algorithms ID.3, C4.5, NN, NB and MLP

8 312 Radim Belohlavek, Jan Outrata and Martin Trnecka correct classification (%) correct classification (%) correct classification (%) correct classification (%) correct classification (%) Fig. 2. Classification accuracy for kr-vs-kp dataset, for (from top to bottom) ML algorithms ID.3, C4.5, NN, NB and MLP

9 Impact of Bool. factorization as preprocessing for classification 313 correct classification (%) correct classification (%) correct classification (%) correct classification (%) correct classification (%) Fig. 3. Classification accuracy for mushroom dataset, for (from top to bottom) ML algorithms ID.3, C4.5, NN, NB and MLP

10 314 Radim Belohlavek, Jan Outrata and Martin Trnecka correct classification (%) correct classification (%) correct classification (%) correct classification (%) correct classification (%) Fig. 4. Classification accuracy for vote dataset, for (from top to bottom) ML algorithms ID.3, C4.5, NN, NB and MLP

11 Impact of Bool. factorization as preprocessing for classification 315 correct classification (%) correct classification (%) correct classification (%) correct classification (%) correct classification (%) Fig. 5. Classification accuracy for zoo dataset, for (from top to bottom) ML algorithms ID.3, C4.5, NN, NB and MLP

12 316 Radim Belohlavek, Jan Outrata and Martin Trnecka 4 Conclusions We presented an experimental study which shows that when Boolean matrix factorization is used as a preprocessing technique in Boolean data classification in the scenario proposed in [9, 10], the particular factorization algorithms impact in a significant way the accuracy of classification. For this purpose, we compared three such algorithms from the literature. In addition to demonstrating further the usefulness of Boolean factorization for classification of Boolean data, the paper emphasizes Boolean factorization as a data dimensionality reduction technique that may be utilized in a similar way as the matrix-decompositionbased methods designed for real-valued data. An extended version of this paper will include further factorization algorithms in the experimental comparison (let us note in this respect that a technical problem with some such algorithms is that they are poorly described in the literature). Furthermore, we intend to investigate and utilize further appropriate transformation functions between the attribute and the factor spaces, in particular those suitable for approximate factorizations. A comparison with other data dimensionality techniques, see e.g. the references in [12], in the presented scenario is also an important topic for future research. In this respect, both the impact on the classification accuracy as well as the transparency of the resulting classification model are important aspects to be evaluated in such a comparison. References 1. Asuncion A., Newman D. J., UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, Belohlavek R., Vychodil V.: Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Computer and System Sci. 76(1)(2010), Belohlavek R., Trnecka M.: From-below approximations in Boolean matrix factorization: geometry, heuristics, and new BMF algorithm (submitted). 4. Ganter B., Wille R.: Formal Concept Analysis. Mathematical Foundations. Springer, Berlin, Kim K. H.: Boolean Matrix Theory and Applications. M. Dekker, Kohavi R.: A Study on Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Proc. IJCAI1995, pp Mitchell T. M.: Machine Learning. McGraw-Hill, Miettinen P., Mielikäinen T., Gionis A., Das G., Mannila H., The Discrete Basis Problem. IEEE Trans. Knowledge and Data Eng. 20(10)(2008), (preliminary version in PKDD 2006, pp ) 9. Outrata J.: Preprocessing input data for machine learning by FCA. Proc. CLA 2010, pp , Sevilla, Spain. 10. Outrata J.: Boolean factor analysis for data preprocessing in machine learning. Proc. ICML 2010, pp , Washington, D.C., USA. 11. Quinlan J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Tatti N., Mielikäinen T., Gionis A., Mannila H.: What is the dimension of your binary data? Proc. ICDM 2006, pp

### TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

### DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Sub-class Error-Correcting Output Codes

Sub-class Error-Correcting Output Codes Sergio Escalera, Oriol Pujol and Petia Radeva Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain. Dept. Matemàtica Aplicada i Anàlisi, Universitat

### An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

### Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

### Data Mining: A Preprocessing Engine

Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

### Data mining knowledge representation

Data mining knowledge representation 1 What Defines a Data Mining Task? Task relevant data: where and how to retrieve the data to be used for mining Background knowledge: Concept hierarchies Interestingness

### Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

### An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients

An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients Celia C. Bojarczuk 1, Heitor S. Lopes 2 and Alex A. Freitas 3 1 Departamento

### Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

### Effective Data Mining Using Neural Networks

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 6, DECEMBER 1996 957 Effective Data Mining Using Neural Networks Hongjun Lu, Member, IEEE Computer Society, Rudy Setiono, and Huan Liu,

### Predicting Student Performance by Using Data Mining Methods for Classification

BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

### T3: A Classification Algorithm for Data Mining

T3: A Classification Algorithm for Data Mining Christos Tjortjis and John Keane Department of Computation, UMIST, P.O. Box 88, Manchester, M60 1QD, UK {christos, jak}@co.umist.ac.uk Abstract. This paper

### An Experimental Study on Rotation Forest Ensembles

An Experimental Study on Rotation Forest Ensembles Ludmila I. Kuncheva 1 and Juan J. Rodríguez 2 1 School of Electronics and Computer Science, University of Wales, Bangor, UK l.i.kuncheva@bangor.ac.uk

### Discretization and grouping: preprocessing steps for Data Mining

Discretization and grouping: preprocessing steps for Data Mining PetrBerka 1 andivanbruha 2 1 LaboratoryofIntelligentSystems Prague University of Economic W. Churchill Sq. 4, Prague CZ 13067, Czech Republic

### Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

### An Introduction to Data Mining

An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

### On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

### Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

### An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

### Web Document Clustering

Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

### Reducing multiclass to binary by coupling probability estimates

Reducing multiclass to inary y coupling proaility estimates Bianca Zadrozny Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093-0114 zadrozny@cs.ucsd.edu

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

### Less naive Bayes spam detection

Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

### MANY factors affect the success of data mining algorithms on a given task. The quality of the data

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 3, MAY/JUNE 2003 1 Benchmarking Attribute Selection Techniques for Discrete Class Data Mining Mark A. Hall, Geoffrey Holmes Abstract Data

### Multiclass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin

Multiclass Classification 9.520 Class 06, 25 Feb 2008 Ryan Rifkin It is a tale Told by an idiot, full of sound and fury, Signifying nothing. Macbeth, Act V, Scene V What Is Multiclass Classification? Each

### Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

### Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

### Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: raghavendra_bk@rediffmail.com

### COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS AMJAD HARB and RASHID JAYOUSI Faculty of Computer Science, Al-Quds University, Jerusalem, Palestine Abstract This study exploits

### BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

### Data Mining Analysis (breast-cancer data)

Data Mining Analysis (breast-cancer data) Jung-Ying Wang Register number: D9115007, May, 2003 Abstract In this AI term project, we compare some world renowned machine learning tools. Including WEKA data

### Feature Selection with Decision Tree Criterion

Feature Selection with Decision Tree Criterion Krzysztof Grąbczewski and Norbert Jankowski Department of Computer Methods Nicolaus Copernicus University Toruń, Poland kgrabcze,norbert@phys.uni.torun.pl

### A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

### Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

### Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

### Web Site Visit Forecasting Using Data Mining Techniques

Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many

### International Journal of Electronics and Computer Science Engineering 1449

International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

### Evaluation and Comparison of Data Mining Techniques Over Bank Direct Marketing

Evaluation and Comparison of Data Mining Techniques Over Bank Direct Marketing Niharika Sharma 1, Arvinder Kaur 2, Sheetal Gandotra 3, Dr Bhawna Sharma 4 B.E. Final Year Student, Department of Computer

### ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India lav_dlr@yahoo.com

### Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

### 4. MATRICES Matrices

4. MATRICES 170 4. Matrices 4.1. Definitions. Definition 4.1.1. A matrix is a rectangular array of numbers. A matrix with m rows and n columns is said to have dimension m n and may be represented as follows:

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### Comparative Analysis of Classification Algorithms on Different Datasets using WEKA

Volume 54 No13, September 2012 Comparative Analysis of Classification Algorithms on Different Datasets using WEKA Rohit Arora MTech CSE Deptt Hindu College of Engineering Sonepat, Haryana, India Suman

### Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

### Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

### A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

### 1. Classification problems

Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing

### Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

### Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

### Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

### SVM Ensemble Model for Investment Prediction

19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### CS 6220: Data Mining Techniques Course Project Description

CS 6220: Data Mining Techniques Course Project Description College of Computer and Information Science Northeastern University Spring 2013 General Goal In this project, you will have an opportunity to

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### Mining Life Insurance Data for Customer Attrition Analysis

Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: oshini.g@iit.ac.lk H. A. Caldera

### Spam Detection System Combining Cellular Automata and Naive Bayes Classifier

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier F. Barigou*, N. Barigou**, B. Atmani*** Computer Science Department, Faculty of Sciences, University of Oran BP 1524, El M Naouer,

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

### Benchmarking Open-Source Tree Learners in R/RWeka

Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems

### An exploratory neural network model for predicting disability severity from road traffic accidents in Thailand

An exploratory neural network model for predicting disability severity from road traffic accidents in Thailand Jaratsri Rungrattanaubol 1, Anamai Na-udom 2 and Antony Harfield 1* 1 Department of Computer

### Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

### A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES

A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES K.M.Ruba Malini #1 and R.Lakshmi *2 # P.G.Scholar, Computer Science and Engineering, K. L. N College Of

### Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms

First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms Azwa Abdul Aziz, Nor Hafieza IsmailandFadhilah Ahmad Faculty Informatics & Computing

### D A T A M I N I N G C L A S S I F I C A T I O N

D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.

### EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

### Car Insurance. Jan Tomášek Štěpán Havránek Michal Pokorný

Car Insurance Jan Tomášek Štěpán Havránek Michal Pokorný Competition details Jan Tomášek Official text As a customer shops an insurance policy, he/she will receive a number of quotes with different coverage

### Model-Based Cluster Analysis for Web Users Sessions

Model-Based Cluster Analysis for Web Users Sessions George Pallis, Lefteris Angelis, and Athena Vakali Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece gpallis@ccf.auth.gr

### The Role Mining Problem: Finding a Minimal Descriptive Set of Roles

The Role Mining Problem: Finding a Minimal Descriptive Set of Roles Jaideep Vaidya MSIS Department and CIMIC Rutgers University 8 University Ave, Newark, NJ 72 jsvaidya@rbs.rutgers.edu Vijayalakshmi Atluri

### Application of Data Mining in Medical Decision Support System

Application of Data Mining in Medical Decision Support System Habib Shariff Mahmud School of Engineering & Computing Sciences University of East London - FTMS College Technology Park Malaysia Bukit Jalil,

### Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

### Real-Life Industrial Process Data Mining RISC-PROS. Activities of the RISC-PROS project

Real-Life Industrial Process Data Mining Activities of the RISC-PROS project Department of Mathematical Information Technology University of Jyväskylä Finland Postgraduate Seminar in Information Technology

### Standardization and Its Effects on K-Means Clustering Algorithm

Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

### D-optimal plans in observational studies

D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

### Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms ESSLLI 2015 Barcelona, Spain http://ufal.mff.cuni.cz/esslli2015 Barbora Hladká hladka@ufal.mff.cuni.cz

### Predicting Students Final GPA Using Decision Trees: A Case Study

Predicting Students Final GPA Using Decision Trees: A Case Study Mashael A. Al-Barrak and Muna Al-Razgan Abstract Educational data mining is the process of applying data mining tools and techniques to

### Practical Data Mining COMP-321B. Tutorial 4: Preprocessing

Practical Data Mining COMP-321B Tutorial 4: Preprocessing Shevaun Ryan Mark Hall August 1, 2006 c 2006 University of Waikato 1 Introduction For this tutorial we will be using the Preprocess panel, the

### Content-Based Recommendation

Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

### Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

### Meta-Ensemble Classification Modeling for Concept Drift

, pp. 231-244 http://dx.doi.org/10.14257/ijmue.2015.10.3.22 Meta-Ensemble Classification Modeling for Concept Drift Joung Woo Ryu 1 and Jin-Hee Song 2 1 Technical Research Center, Safetia Ltd. Co., South

### Visualization of large data sets using MDS combined with LVQ.

Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk

### A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery Alex A. Freitas Postgraduate Program in Computer Science, Pontificia Universidade Catolica do Parana Rua Imaculada Conceicao,

### DATA MINING APPROACH FOR PREDICTING STUDENT PERFORMANCE

. Economic Review Journal of Economics and Business, Vol. X, Issue 1, May 2012 /// DATA MINING APPROACH FOR PREDICTING STUDENT PERFORMANCE Edin Osmanbegović *, Mirza Suljić ** ABSTRACT Although data mining

### A Dynamic Integration Algorithm with Ensemble of Classifiers

1 A Dynamic Integration Algorithm with Ensemble of Classifiers Seppo Puuronen 1, Vagan Terziyan 2, Alexey Tsymbal 2 1 University of Jyvaskyla, P.O.Box 35, FIN-40351 Jyvaskyla, Finland sepi@jytko.jyu.fi

### T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or