# Equational Reasoning as a Tool for Data Analysis

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 AUSTRIAN JOURNAL OF STATISTICS Volume 31 (2002), Number 2&3, Equational Reasoning as a Tool for Data Analysis Michael Bulmer University of Queensland, Brisbane, Australia Abstract: A combination of deductive reasoning, clustering, and inductive learning is given as an example of a hybrid system for exploratory data analysis. Visualization is replaced by a dialogue with the data. Keywords: Automated Deduction, Clustering, Induction. 1 Introduction The aim of this paper is to show how data analysis may be enriched by the use of an automated deduction system, and simultaneously how a system for automated deduction may be extended to carry out data analysis. Of course there are many ways that this could be done and many frameworks on which it could be built. Here we will use equational reasoning and show how its discrete algebraic language may be used to analyze data with discrete and continuous variables. In Section 2 we introduce the basics of equational reasoning using a term algebra and show how information is represented in this language and how deductions and inductive conjectures can be made. The data analysis will begin with some form of inductive learning and then use deduction to make predictions from the conjectures and existing knowledge. Section 3 describes the discretization process that is necessary for analyzing continuous data, with Section 4 giving a second method of learning information from a database of observations. This discrete form of reasoning fits well with a simplified natural language, allowing a user to have a dialogue with the software rather than using visualization to analyze data. Example dialogues are given in Section 5. Finally, in Section 6 we summarize the potentials of this approach and detail some of its shortcomings. 2 Equational Reasoning We begin by giving an introduction to the equational approach to deduction and a simple functional learning method that follows from it. The language of our equational reasoning is that of a term algebra generated by a signature. Figure 1 gives an example signature for representing knowledge about family relationships. Here Person and Sex are sorts while father and Alice are words. Terms of the algebra are generated by making paths in the signature, which we will write in reverse order. For example, Figure 1 gives rise to the terms father Alice : Ground Person and sex mother : Person Sex, as well as the singleton Clare : Ground Person. For a term f : σ τ we call σ the domain of f, dom(f), and τ the codomain of f, cod(f). Terms whose domain is Ground are the elements of the language and we will usually write, for example, Clare Person to reflect this. Terms with domains other than Ground are functions. All sorts σ have the erasing term! : σ Ground. That is, if g σ then

2 232 Austrian Journal of Statistics, Vol. 31 (2002), No. 2&3, Figure 1: A signature for describing family relationships the term f! g is equal to f. Thus f! is a constant function on the elements of σ. For example, Male! John = Male. Knowledge is captured by giving equations between terms in the language. For example, the ground equation father Alice = John represents the statement that Alice s father is John. Functional equations, such as father sister = father ( the father of a person s sister is their father ) and sex father = Male! ( the sex of any person s father is male ), capture universal statements since they hold for any element of the domain. For example, if we believe the three equations then we may deduce father sister = father, father Alice = John, sister Kate = Alice father Kate = father sister Kate = father Alice = John. This equational deduction, which can be done mechanically using rewrite rules and the Knuth-Bendix algorithm, has a rich history in automated theorem proving (Hsiang et al., 1992). However, little work has been done on using equational reasoning for the inductive learning required by data analysis. Yet equational induction is straightforward in the functional setting. Consider again the above deduction. Suppose instead we knew Then we would observe that father Alice = John, sister Kate = Alice,. father Kate = John father sister Kate = father Kate, and so might conjecture the functional equation father sister = father

3 M. Bulmer 233 since the two functions have the same output when applied to the singleton Kate. In general, suppose we have a database of ground equations. If it is observed that for all singletons a dom(f) there is a singleton b cod(f) such that fa = b and ga = b, then we conjecture that f = g. This is a summative conjecture since complete information about the function values exists in the database. A weaker form of induction, allowing predictions, is to conjecture f = g if there is at least one singleton a dom(f) such that fa = b and ga = b and there is no a dom(f) such that fa = b and ga = c, where b c. For example, given the data D = father Alice = John, mother Alice = Clare, wife John = Clare, father Paul = Peter, wife Peter = Mary we could conjecture that wife father = mother. We could then predict from D that mother Paul = wife father Paul = wife Peter = Mary. This method has been shown to be successful in certain problem domains (Bulmer, 1996a). In particular, the bidirectional nature of equations makes this representation very effective. In the above example, if we observed mother Paul = Mary instead of wife Peter = Mary, we could use the same conjecture to predict wife Peter = Mary. We find that D also gives rise to the erasing conjecture mother = Clare!, since the single observation suggests that mother is a constant function. Such conjectures are sometimes important, such as sex father = Male!, but can also arise in this trivial manner; as the database grows, the trivial ones will usually evaporate. However, such erasing conjectures will be the main focus of the use of subsets in Section 4. Note that in the context of D we face the problem of deciding between two predictions for mother Paul, either Mary (from wife father = mother ) or Clare (from mother = Clare!). Such issues have been addressed by looking at consistent theories for the data (Bulmer, 1996b); our implicit requirement above that the induction algorithm only generates conjectures that are individually consistent with the data is to allow such analysis of consistency amongst sets of conjectures. However, this leads to poor results for data sets that have overlapping classes or which are noisy. Error reduction methods, such as bootstrap aggregating (Breiman, 1996), can help overcome this in practice. 3 Data Analysis and Discretization So far we have dealt only with discrete data. For practical data analysis we need to be able to work with continuous observations as well. The approach we will use here is to preprocess continuous variables by collapsing them into a collection of discrete groups. This discretization has an important role in many data mining methods, and indeed is an intrinsic part of all decision trees. Methods use some criterion, such as maximizing information gain or minimizing the description length (see Fayyad and Irani, 1993), to find the best splitting of a continuous attribute with respect to a target attribute. However,

4 234 Austrian Journal of Statistics, Vol. 31 (2002), No. 2&3, here our aim is not necessarily the same as that of a classification problem. Instead we would like to discretize variables individually, without reference to any existing groups. Thus discretization in this paper is simply a univariate clustering problem, to which we can apply many existing tools. One of the simplest and fastest is the k-means algorithm (see, for example, Berry and Linoff, 1997), and we have used this for our examples. Choosing the number of discrete groups in clustering is a difficult problem in general and in our case the variables we are discretizing may be quite homogenous. We have taken our cue from natural language and have arbitrarily split all continuous variables into five groups, since five is an easy number of groups to give names to. For example, heights will be split into very short, short, medium, tall, very tall. 4 Subset Induction Consider an alternative set of observations about the family in Figure 1: father Alice = John, mother Alice = Clare, father Kate = John Here it is plausible that Kate s mother is also Clare, since her father is the same as Alice s father and so she might belong to the same class as Alice. This is precisely the setting of many classification problems. However, the simple functional induction of Section 2 fails to make this prediction since there are no functions which have the same outputs (except for the trivial conjectures father = John! and mother = Clare!). To overcome this we need to extend our language and equation processing to capture such reasoning. Any functional equation can be used as a characteristic function to describe a subset of a sort. (The use here is related to subobjects in category theory; (see Goldblatt, 1983,for details).) For example, from the equation father Alice = John we could create the subset fatherjohn from the classifier father = John!, as illustrated in Figure 2. Figure 2: Signature for families with a subset

5 M. Bulmer 235 The equation processing system is augmented to treat Alice and fatherjohn Alice interchangeably, a process that is similar to the handling of associative and commutative domains (Dershowitz, 1989). With subsets present, the learning algorithm can then conjecture mother fatherjohn = Clare!. From this it can make the prediction mother Kate = mother fatherjohn Kate = Clare! Kate = Clare. The main advantage of this system is that it fits into the broader scope of equational reasoning and induction. For example, consider the database father Alice = John, mother Alice = Clare, wife John = Clare, father Kate = John, father Paul = Peter, mother Paul = Mary This database gives conjectures under both the functional induction described in Section 2 and the subset induction described here. We obtain the conjectures { } wife father = mother,. mother fatherjohn = Clare! From these conjectures we can predict mother Kate = Clare in two different ways, and wife Peter = Mary, and also make a range of other deductions. 5 Dialogues The use of equational reasoning moves the emphasis of data analysis from visualisation to a kind of dialogue with the data. In this section we present two such dialogues, generated automatically by an implementation of the system described here. 5.1 Heights The first data set is a simple one obtained from student responses to a survey used in the teaching of an introductory statistics subject (see Bulmer, 1999, for the data). Among other variables, each student gave their sex and height. After loading the data from the web page, the dialogue in Table 1 was recorded. The reasoner begins with 89 observations of heights, which are automatically split into the 5 groups. There was no measurement for Alice, so no deductive consequences about Alice s height can be made. Once the reasoner is told Alice s height this is discretized and can be used to make deductions. (Any statement made by the reasoner is a deductive truth unless it is preceded by I think that.) When asked about sex, no split is given since its value is already discrete. However, based on observations about height and sex,

6 236 Austrian Journal of Statistics, Vol. 31 (2002), No. 2&3, Table 1: A dialogue about height and sex User Tell me about height. How tall is Alice? Alice s height is 156. How tall is Alice? Tell me about sex. What sex is Alice? Reasoner I have 89 observations about height. The values are split into very short, short, average height, tall, and very tall. I don t know. Okay. Alice is very short. I have 89 observations about sex. I think that if a person s height is very tall then that person is male. I think that if a person s height is very short then that person is female. I think that Alice is female. the reasoner has formed two subset conjectures about sex. These can be used to predict Alice s sex. It is worth noting here that the subsets are crisp; the induction could be extended to use fuzzy clustering and fuzzy subsets. For example, most of the people clustered in Short by the reasoner are also female, and so the reasoner could make predictions with varying degrees of certainty. 5.2 Irises The second data set is Fisher s famous iris data (Fisher, 1936). The data includes measurements of the petals and sepals of 50 specimens each of 3 varieties of iris. Figure 3 gives part of the signature for representing this data. Table 2 gives a dialogue with the iris data set, showing the discretization of petal length and the resulting conjectures about determining petal lengths and iris varieties. Again these can be used to make predictions about a new plant. 6 Conclusions The approach described here has been applied to a variety of data settings in the UCI Repository (Black and Merz, 1998). As a classification tool it does not perform as well as directed methods such as discriminant analysis and decision tools, but it does do reasonably well on most problems, as suggested by Holte (1993). In addition, the equational framework allows the easy inclusion of background functional knowledge, and can also be used to generate abductive hypotheses. However, the aim of this paper has been mainly to show the potential of embedding data analysis within a framework for automated deduction. As there are a broad range of

7 M. Bulmer 237 Figure 3: Signature for iris data approaches to both tasks, there is naturally a large number of possible systems that could be developed in this way. We have demonstrated how one method, equational reasoning, can be used to enrich the analysis of data. References M.J.A. Berry and G. Linoff. Data mining techniques for marketing, sales, and customer support. John Wiley & Sons, C.L. Black and C.J. Merz. UCI repository of machine learning databases, mlearn/mlrepository.html. L. Breiman. Bagging predictors. Machine Learning, 24: , M. Bulmer. Inductive equational reasoning. In Pacific Rim International Conference on Artificial Intelligence (PRICAI 96): Topics in Artificial Intelligence, number 1114 in Lecture Notes in Artificial Intelligence, pages Springer-Verlag, 1996a. M. Bulmer. Inductive theories from equational systems. In D.L. Dowe, K.B. Korb, and J.J. Oliver, editors, Information, Statistics and Induction in Science (ISIS 96), pages World Scientific, 1996b. M. Bulmer. Survey data from students in MS113 Elementrary Statistics 1, N. Dershowitz. Completion and its applications. In Aït Kaci and M. H. & Nivat, editors, Resolution of Equations in Algebraic Structures, volume 2, pages Academic Press, 1989.

8 238 Austrian Journal of Statistics, Vol. 31 (2002), No. 2&3, Table 2: A dialogue about irises User Tell me about petal length. Tell me about iris. The petal length of Plant151 is 5.8. What iris is Plant151? Reasoner I have 150 observations about petal length. The values are split into very short, short, medium length, long, and very long. I think that if a plant s sepal is very long then that plant s petal is very long. I think that if a plant s sepal is very wide then that plant s petal is very short. I think that if a plant s petal is very narrow then that plant s petal is very short. I think that if a plant is Setosa then that plant s petal is very short. I have 150 observations about iris. I think that if a plant s sepal length is very long then that plant is Virginica. I think that if a plant s sepal width is very wide then that plant is Setosa. I think that if a plant s petal length is very short then that plant is Setosa. I think that if a plant s petal length is short then that plant is Versicolor. I think that if a plant s petal length is very long then that plant is Virginica. I think that if a plant s petal width is very narrow then that plant is Setosa. I think that if a plant s petal width is narrow then that plant is Versicolor. I think that if a plant s petal width is very wide then that plant is Virginica. Okay. I think that Plant151 is Virginica. U. Fayyad and K. Irani. Multi-interval discretization of continuous valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages , R.A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7: , R. Goldblatt. Topoi, the Categorical Analysis of Logic. North-Holland, R.C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63 91, J. Hsiang, H. Kirchner, P. Lescanne, and M. Rusinowitch. The term rewriting approach to automated theorem proving. Journal of Logic Programming, 14:71 99, 1992.

9 M. Bulmer 239 Author s address: Dr. Michael Bulmer Department of Mathematics University of Queensland Queensland 4072 Australia Tel Fax mrb

### Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

### Data Mining. Practical Machine Learning Tools and Techniques. Classification, association, clustering, numeric prediction

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 2 of Data Mining by I. H. Witten and E. Frank Input: Concepts, instances, attributes Terminology What s a concept? Classification,

### Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary

### Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression

### Data Clustering for Forecasting

Data Clustering for Forecasting James B. Orlin MIT Sloan School and OR Center Mahesh Kumar MIT OR Center Nitin Patel Visiting Professor Jonathan Woo ProfitLogic Inc. 1 Overview of Talk Overview of Clustering

### COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping

### Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?

### Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

### Data Mining Practical Machine Learning Tools and Techniques

Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Outline Terminology What s a concept Classification, association, clustering, numeric

### A K-means-like Algorithm for K-medoids Clustering and Its Performance

A K-means-like Algorithm for K-medoids Clustering and Its Performance Hae-Sang Park*, Jong-Seok Lee and Chi-Hyuck Jun Department of Industrial and Management Engineering, POSTECH San 31 Hyoja-dong, Pohang

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### High-dimensional labeled data analysis with Gabriel graphs

High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use

### Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

### Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

### COC131 Data Mining - Clustering

COC131 Data Mining - Clustering Martin D. Sykora m.d.sykora@lboro.ac.uk Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window

### Connecting Segments for Visual Data Exploration and Interactive Mining of Decision Rules

Journal of Universal Computer Science, vol. 11, no. 11(2005), 1835-1848 submitted: 1/9/05, accepted: 1/10/05, appeared: 28/11/05 J.UCS Connecting Segments for Visual Data Exploration and Interactive Mining

### Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

### Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

### Data Exploration Data Visualization

Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

### T3: A Classification Algorithm for Data Mining

T3: A Classification Algorithm for Data Mining Christos Tjortjis and John Keane Department of Computation, UMIST, P.O. Box 88, Manchester, M60 1QD, UK {christos, jak}@co.umist.ac.uk Abstract. This paper

### A Comparison of Variable Selection Techniques for Credit Scoring

1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:

### Standardization of Components, Products and Processes with Data Mining

B. Agard and A. Kusiak, Standardization of Components, Products and Processes with Data Mining, International Conference on Production Research Americas 2004, Santiago, Chile, August 1-4, 2004. Standardization

### A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st

### Study and Analysis of Data Mining Concepts

Study and Analysis of Data Mining Concepts M.Parvathi Head/Department of Computer Applications Senthamarai college of Arts and Science,Madurai,TamilNadu,India/ Dr. S.Thabasu Kannan Principal Pannai College

### Representing Reversible Cellular Automata with Reversible Block Cellular Automata

Discrete Mathematics and Theoretical Computer Science Proceedings AA (DM-CCG), 2001, 145 154 Representing Reversible Cellular Automata with Reversible Block Cellular Automata Jérôme Durand-Lose Laboratoire

### Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Iris Sample Data Set Basic Visualization Techniques: Charts, Graphs and Maps CS598 Information Visualization Spring 2010 Many of the exploratory data techniques are illustrated with the Iris Plant data

### STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

### A Lightweight Solution to the Educational Data Mining Challenge

A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

### Analysis Tools and Libraries for BigData

+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I

### [Refer Slide Time: 05:10]

Principles of Programming Languages Prof: S. Arun Kumar Department of Computer Science and Engineering Indian Institute of Technology Delhi Lecture no 7 Lecture Title: Syntactic Classes Welcome to lecture

### Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

### Visualizing class probability estimators

Visualizing class probability estimators Eibe Frank and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, mhall}@cs.waikato.ac.nz Abstract. Inducing classifiers

### Jieh Hsiang Department of Computer Science State University of New York Stony brook, NY 11794

ASSOCIATIVE-COMMUTATIVE REWRITING* Nachum Dershowitz University of Illinois Urbana, IL 61801 N. Alan Josephson University of Illinois Urbana, IL 01801 Jieh Hsiang State University of New York Stony brook,

### A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

### Lecture 3. Mathematical Induction

Lecture 3 Mathematical Induction Induction is a fundamental reasoning process in which general conclusion is based on particular cases It contrasts with deduction, the reasoning process in which conclusion

### ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan

### Data Mining Applications in Higher Education

Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

### ISSUES IN RULE BASED KNOWLEDGE DISCOVERING PROCESS

Advances and Applications in Statistical Sciences Proceedings of The IV Meeting on Dynamics of Social and Economic Systems Volume 2, Issue 2, 2010, Pages 303-314 2010 Mili Publications ISSUES IN RULE BASED

### International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### DATA ANALYTICS USING R

DATA ANALYTICS USING R Duration: 90 Hours Intended audience and scope: The course is targeted at fresh engineers, practicing engineers and scientists who are interested in learning and understanding data

### COMPARISON OF DISCRETIZATION METHODS FOR PREPROCESSING DATA FOR PYRAMIDAL GROWING NETWORK CLASSIFICATION METHOD

COMPARISON OF DISCRETIZATION METHODS FOR PREPROCESSING DATA FOR PYRAMIDAL GROWING NETWORK CLASSIFICATION METHOD Ilia Mitov 1, Krassimira Ivanova 1, Krassimir Markov 1, Vitalii Velychko 2, Peter Stanchev

### Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

### Building an Iris Plant Data Classifier Using Neural Network Associative Classification

Building an Iris Plant Data Classifier Using Neural Network Associative Classification Ms.Prachitee Shekhawat 1, Prof. Sheetal S. Dhande 2 1,2 Sipna s College of Engineering and Technology, Amravati, Maharashtra,

### Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Data Mining for Customer Service Support Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Traditional Hotline Services Problem Traditional Customer Service Support (manufacturing)

### CHAPTER 3 DATA MINING AND CLUSTERING

CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge

### Model Trees for Classification of Hybrid Data Types

Model Trees for Classification of Hybrid Data Types Hsing-Kuo Pao, Shou-Chih Chang, and Yuh-Jye Lee Dept. of Computer Science & Information Engineering, National Taiwan University of Science & Technology,

### CHI-Squared Test of Independence

CHI-Squared Test of Independence Minhaz Fahim Zibran Department of Computer Science University of Calgary, Alberta, Canada. Email: mfzibran@ucalgary.ca Abstract Chi-square (X 2 ) test is a nonparametric

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;

### Action Reducts. Computer Science Department, University of Pittsburgh at Johnstown, Johnstown, PA 15904, USA 2

Action Reducts Seunghyun Im 1, Zbigniew Ras 2,3, Li-Shiang Tsay 4 1 Computer Science Department, University of Pittsburgh at Johnstown, Johnstown, PA 15904, USA sim@pitt.edu 2 Computer Science Department,

### Discretization and grouping: preprocessing steps for Data Mining

Discretization and grouping: preprocessing steps for Data Mining PetrBerka 1 andivanbruha 2 1 LaboratoryofIntelligentSystems Prague University of Economic W. Churchill Sq. 4, Prague CZ 13067, Czech Republic

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

### BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

### International Journal of Electronics and Computer Science Engineering 1449

International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

### An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

### Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

### Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

### Healthcare Measurement Analysis Using Data mining Techniques

www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik

### Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints

### Mining Educational Data to Improve Students Performance: A Case Study

Mining Educational Data to Improve Students Performance: A Case Study Mohammed M. Abu Tair, Alaa M. El-Halees Faculty of Information Technology Islamic University of Gaza Gaza, Palestine ABSTRACT Educational

### Rule based Classification of BSE Stock Data with Data Mining

International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification

### Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

### Triangular Visualization

Triangular Visualization Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland tmaszczyk@is.umk.pl;google:w.duch http://www.is.umk.pl Abstract.

### SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION

CHAPTER 5 SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION Alessandro Artale UniBZ - http://www.inf.unibz.it/ artale/ SECTION 5.2 Mathematical Induction I Copyright Cengage Learning. All rights reserved.

### On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

### Classification and Regression Trees

Classification and Regression Trees Bob Stine Dept of Statistics, School University of Pennsylvania Trees Familiar metaphor Biology Decision tree Medical diagnosis Org chart Properties Recursive, partitioning

### CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

### NEURAL NETWORKS IN DATA MINING

NEURAL NETWORKS IN DATA MINING 1 DR. YASHPAL SINGH, 2 ALOK SINGH CHAUHAN 1 Reader, Bundelkhand Institute of Engineering & Technology, Jhansi, India 2 Lecturer, United Institute of Management, Allahabad,

### Data Mining Part 5. Prediction

Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

### Data mining knowledge representation

Data mining knowledge representation 1 What Defines a Data Mining Task? Task relevant data: where and how to retrieve the data to be used for mining Background knowledge: Concept hierarchies Interestingness

### Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

### Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

### International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET

DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand

### Identification of User Patterns in Social Networks by Data Mining Techniques: Facebook Case

Identification of User Patterns in Social Networks by Data Mining Techniques: Facebook Case A. Selman Bozkır 1, S. Güzin Mazman 2, and Ebru Akçapınar Sezer 1 1 Hacettepe University, Department of Computer

### Incorporating Evidence in Bayesian networks with the Select Operator

Incorporating Evidence in Bayesian networks with the Select Operator C.J. Butz and F. Fang Department of Computer Science, University of Regina Regina, Saskatchewan, Canada SAS 0A2 {butz, fang11fa}@cs.uregina.ca

### Egyptian fraction representations of 1 with odd denominators

Egyptian fraction representations of with odd denominators P. Shiu Department of Mathematical Sciences Loughborough University Leicestershire LE 3TU United Kingdon Email: P.Shiu@lboro.ac.uk Abstract The

### STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

### Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms

Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms Y.Y. Yao, Y. Zhao, R.B. Maguire Department of Computer Science, University of Regina Regina,

### DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM M. Mayilvaganan 1, S. Aparna 2 1 Associate

### New Axioms for Rigorous Bayesian Probability

Bayesian Analysis (2009) 4, Number 3, pp. 599 606 New Axioms for Rigorous Bayesian Probability Maurice J. Dupré, and Frank J. Tipler Abstract. By basing Bayesian probability theory on five axioms, we can

### SPATIAL DATA CLASSIFICATION AND DATA MINING

, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

### Data Mining Solutions for the Business Environment

Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

### Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,

### Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

### ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

### Decision Support System on Prediction of Heart Disease Using Data Mining Techniques

International Journal of Engineering Research and General Science Volume 3, Issue, March-April, 015 ISSN 091-730 Decision Support System on Prediction of Heart Disease Using Data Mining Techniques Ms.

### Missing is Useful : Missing Values in Cost-sensitive Decision Trees 1

Missing is Useful : Missing Values in Cost-sensitive Decision Trees 1 Shichao Zhang, senior Member, IEEE Zhenxing Qin, Charles X. Ling, Shengli Sheng Abstract. Many real-world datasets for machine learning

### Data Mining Practical Machine Learning Tools and Techniques

Some Core Learning Representations Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and E. Frank Decision trees Learning Rules Association rules

Computing and Informatics, Vol. 20, 2001,????, V 2006-Nov-6 UPDATES OF LOGIC PROGRAMS Ján Šefránek Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University,

### Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

### Key words. cluster analysis, k-means, eigen decomposition, Laplacian matrix, data visualization, Fisher s Iris data set

SPECTRAL CLUSTERING AND VISUALIZATION: A NOVEL CLUSTERING OF FISHER S IRIS DATA SET DAVID BENSON-PUTNINS, MARGARET BONFARDIN, MEAGAN E. MAGNONI, AND DANIEL MARTIN Advisors: Carl D. Meyer and Charles D.

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing

### White Paper. Data Mining for Business

White Paper Data Mining for Business January 2010 Contents 1. INTRODUCTION... 3 2. WHY IS DATA MINING IMPORTANT?... 3 FUNDAMENTALS... 3 Example 1...3 Example 2...3 3. OPERATIONAL CONSIDERATIONS... 4 ORGANISATIONAL

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### WEB PAGE CATEGORISATION BASED ON NEURONS

WEB PAGE CATEGORISATION BASED ON NEURONS Shikha Batra Abstract: Contemporary web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the WWW.