Discretization and grouping: preprocessing steps for Data Mining
|
|
|
- Abigail Garrison
- 10 years ago
- Views:
Transcription
1 Discretization and grouping: preprocessing steps for Data Mining PetrBerka 1 andivanbruha 2 1 LaboratoryofIntelligentSystems Prague University of Economic W. Churchill Sq. 4, Prague CZ 13067, Czech Republic [email protected] 2 DepartmentofComputerScienceandSystems,McMasterUniversity Hamilton, Ont., Canada L8S4K1 [email protected] Abstract. Unlike on-line discretization performed by a number of machine learning(ml) algorithms for building decision trees or decision rules, we propose off-line algorithms for discretizing numerical attributes and grouping values of nominal attributes. The number of resulting intervals obtained by discretization depends only on the data; the number of groups corresponds to the number of classes. Since both discretization and grouping is done with respect to the goal classes, the algorithms are suitable only for classification/prediction tasks. Asasideeffectoftheoff-lineprocessing,thenumberofobjectsinthe datasets and number of attributes may be reduced. Itshouldbealsomentionedthatalthoughtheoriginalideaofthediscretization procedure is proposed to the Kex system, the algorithms show good performance together with other machine learning algorithms.
2 1 Introduction The Knowledge Discovery in Databases(KDD) process can involve a significant iteration and may contain loops among data selection, data preprocessing, data transformation, data mining, and interpretation of mined patterns. The most complex steps in this process are data preprocessing and data transformation. Theresultofthesestepsshouldbedataintheformsuitableforthedatamining algorithmsused 1.Thus,thenecessarypreprocessingandtransformationoperationsaree.g. merging datatocreatesingedatamatrixfromanumberof related data tables, creating new attributes, excluding unnecessary attributes, discretization or grouping. We present here algorithms for the last two mentioned operations: off-line discretization of numerical attributes and off-line grouping values of nominal attributes. Such transformations can help in better understanding the data that enter into the data mining step. Both algorithms are class-sensitive, so they can be used only for classification/prediction tasks. The described algorithms were developed within the Kex framework[2] but canbeusedasapreprocessingtoolforothermlalgorithmsaswell. 2 Discretization The task of discretization of numerical variables is well known to statisticians. Different approaches are used such as discretization into a given number of categories using equidistant cutpoints, or discretization based on mean and standard deviation. All these approaches are class-blind, since they do not take into account that objects belong to different classes. Hence they can be used if thedataminingalgorithmsperformunsupervisedlearning 2.Ifthetaskisclassification/prediction the information about the class label can be used during discretization. In the TDIDT family, the algorithms for discretization are based mostly on binarization within a subset of training data created during tree generation[6]. Well known algorithm by Fayyad and Irani extended this idea to multi-interval discretization[8]. KnowledgeSeeker, a commercial system of the TDIDT family, looks for best k-way split of each variable using F statistics for numerical attributesandχ 2 statisticfornominalattributes;sothesystemperformsboth discretization and grouping[4]. Discretization in the set covering algorithm CN4(a large extension of CN2) is based on entropy or Laplacian estimate[5]. All these systems discretize numerical attributes on-line, during learning. The algorithms described in the paper preprocess the data off-line before starting the machine learning step. 1 Ifweinsistoninterpretabilityofthediscoveredknowledge,weshouldprefersymbolic machine learning algorithms. 2 Thiscaseiscarriedoutwhenthedataminingtaskistofindsomeinterestinggroups of objects with similar characteristics or to find associations.
3 3 Discretization and grouping for KEX 3.1 Asurveyofthesystem Kex performs symbolic empirical multiple concept learning from examples, where the induced concept description is represented as weighted decision rules intheform Ant= C (weight) where Ant is a combination(conjunction) of attribute-value pairs, C is a single category(class), weight from the interval[0, 1] expresses the uncertainty of the rule. Kex[2] works in an iterative way, in each iteration testing and expanding an implication Ant = C. This process starts with an empty rule weighted with the relative frequency of the class C and stops after testing all implications created according to the user defined criteria. During testing, the validity(conditionalprobabilityp(c Ant))ofanimplicationiscomputed 3.Ifthisvalidity significantly differs from the composed weight(value obtained when composing weightsofallsubrulesoftheimplicationant= C),thenthisimplicationis addedtotheknowledgebase.theweightofthisnewruleiscomputedfromthe validity and from the composed weight using inverse composing function. For composing weights we use a pseudobayesian(prospector-like) combination function x y x y= x y+(1 x) (1 y). During expanding, new implications are created by adding single categories to Ant. 3.2 Algorithm for discretization We treat each numerical attribute separately. The basic idea is to create intervals for which the aposteriori distribution of classes P(C interval) significantly differs from the apriori distribution of classes P(C) in the whole training data. This can be achieved by simply merging such values, for which most objects belong to the same class. Within the Kex knowledge acquisition approach, this will lead torulesoftheform interval = C, but this approach can be used for other learning algorithms, too. The number of resulting intervals is controlled by giving a threshold for minimal number of objects within one interval; less frequent intervals are labeled as UNKNOWN inthestep3.1(thealgorithmisshowninfigure1). 3 ThevalidityP(C Ant)iscomputedfromthedataas Ant C Ant.
4 MAIN LOOP: 1 create ordered list of values of the attribute; 2 foreachvaluedo 2.1 compute frequencies of occurrence of objects with respect to each class; 2.2 assign the class label to every value using procedure ASSIGN; enddo 3 create the intervals from values using procedure INTERVAL; ASSIGN: ifforthegivenvalueallobjectsbelongtosameclass,thenassignthevalueto that class elseifforthegivenvaluethedistributionofobjectswithrespecttoclassmembershipsignificantlydiffers(accordingtoχ 2 orrelativefrequencycriterion) from frequencies of goal classes, then assign that value to the most frequent class else assign the value to the class UNKNOWN ; INTERVAL: 3.1 ifasequenceofvaluesbelongstothesameclass,thencreatetheinterval INT i=[lbound i,ubound i]fromthesevalues; 3.2 iftheintervalint ibelongstotheclass UNKNOWN then ifitsneighbouringintervalsint i 1,INT i+1belongtothesameclassthencreate theintervalbyjoiningint i 1 INT i INT i+1 elsecreatetheintervaleitherbyjoiningint i 1 INT iorbyjoiningint i INT i+1accordingtogivencriterion(forχ 2 testjointheintervalsaccording tothehighervalueofχ 2,forfrequencycriterionjointheintervalsaccording to higher relative frequency of the majority class.); 3.3 create continuous coverage of the attribute by assigning LBound i := (LBound i+ubound i 1)/2andUBound i 1:=LBound i; Fig. 1. Algorithm for discretization 3.3 Algorithm for grouping Grouping values of a nominal attribute becomes important if the number of thesevaluesistoolarge(e.g.hundredsofzipcodesorprofessioncodes).to deal with each value separately can bring problems both during computation (e.g. branching a node) and interpretation. ThegroupingalgorithmforKexisbasedonthesameideaasthediscretization described above. Again, we create groups of values such, that the aposteriori distribution P(C group) significantly differs from the apriori distribution P(C). The main difference to the discretization algorithm is that we create single group for each value of class attribute and an additional group labeled UNKNOWN ; sothenumberofgroupsisfixed.thegroupingalgorithmisshowninfigure Evaluation of the algorithms During discretization/grouping of an attribute, we can loose some information hiddeninthedata.wecanmeasurethislossbythenumberofcontradictionsbe-
5 MAIN LOOP: 1 foreachvaluedo 1.1 compute frequencies of occurrence of objects with respect to each class; 1.2 assign the class label to every value using procedure ASSIGN; enddo 2 create the groups from values using procedure GROUP; ASSIGN: ifforthegivenvalueallobjectsbelongtosameclass,thenassignthevalueto that class elseifforthegivenvaluethedistributionofobjectswithrespecttoclassmembershipsignificantlydiffers(accordingtoχ 2 orrelativefrequencycriterion)) from frequencies of goal classes, then assign that value to the most frequent class else assign the value to the class UNKNOWN ; GROUP: 3.1 create groups for values with the same class label; Fig. 2. Algorithm for grouping fore and after the preprocessing. Contradiction here means a situation when objects described by same attribute values belong to different classes. Any learning algorithm will classify such objects as objects belonging to the same(usually the majority) class and objects belonging to other classes will be classified wrongly. We can count such errors and thus estimate the maximal possible accuracy as: 1 no.oferrors no. of objects Asasideeffectoftheoff-lineprocessing,thenumberofobjectsinthedatasets andnumberofattributesmaybereduced.thisreductioncangiveussome information about regular patterns in the data. Another useful information is the number of intervals or groups. If the preprocessing results in a single interval/group or in one interval/group with very high frequency in the data, we can ignore the corresponding attribute in the subsequent machine learning step. 4 Empirical results We have tested our algorithm on different datasets taken from the UCI Machine Learning Repository[10]. The results are summarized in Table 1. This table givesfortheuseddatasets 4 thenumberof(different)objectsandthemaximal possible accuracy both before and after discretization. We present also the results of classification accuracy for different systems to show that there is no significant 4 CreditstandsforJapanesecreditdata,irisstandsforIrisdata,diabforPimaIndian diabetes data, and aust for Australian credit data.
6 difference between Kex(system, that motivates the discretization algorithm) and other systems(cn4 ordered mode, CN4 unordered mode, and C4.5). accuracy#.diff. classification accuracy indata objectskexcn4 CN4 C4.5 data ord. unord. credit(orig.) 100% % 81% 83% credit(disc.) 100% % 95% 88% 76% iris(orig.) 100% % 97% 98% iris(disc.) 98% 35 95% 98% 95% 96% diab.(orig.) 100% % 79% 90% diab.(disc.) 95% % 84% 81% 75% aust.(orig.) 100% % 87% 91% aust.(disc.) 99% % 92% 89% 89% Table 1. Discretization of some UCI data We used our discretization/grouping algorithm also to preprocess a more complexdata[3].thekddsisyphusdataisanexcerptfromaswisslifedata warehouse system MASY[9]. These data consist of several relations describing relationship between Swiss Life partners, insurance contracts and components ofinsurancetariffs.thegoalofkddonthisdataistofindconceptdescription for given class(taska, TaskB). After creating single data matrix for each task we run the discretization and grouping algorithms for the TaskA and TaskB data. The results(in terms of maximal possible accuracy before and after this preprocessing) are shown in table 2. Discretization and grouping helped us also to find unimportant attributes; lines Taskx (disc.+red.) in the table show the accuracy and number of objects after removing such attributes. data accuracy in data #.diff. objects #. attributes TaskA(orig.) 100% TaskA(disc.) 98% TaskA(disc.+red.) 98% TaskB(orig.) 100% TaskB(disc.) 100% TaskB(disc.+red.) 100% Table 2. Discretization and grouping of KDD Sisyphus data
7 5 Conclusion Unlike on-line discretization performed by a number of ML algorithms during building decision trees or decision rules, we propose off-line algorithms for discretizing numerical attributes and grouping values of nominal attributes. The number of resulting intervals obtained during discretization depends only on the data; the number of groups corresponds to the number of classes. Since both discretization and grouping is done with respect to the goal classes, the algorithms are suitable only for classification/prediction tasks. The obtained intervals/groups can be interpreted as potential knowledge. The results of the off-line discretization and grouping can also help during further preprocessing of the data. If we obtain single interval, we can conclude that the discretized attribute is irrelevant for the classification task. We can draw the same conclusion also if after discretization or grouping almost all objects fall into one interval/group. It should be also mentioned that although the original idea of the discretization procedure is related to the Kex, the algorithms show good performance together with other machine learning algorithms. References 1. Berka,P.- Bruha,I.: Various discretization procedures of numerical attributes: Empirical comparisons. In:(Kodratoff, Nakhaeizadeh, Taylor eds.) Proc. MLNet Familiarization Workshop on Statistics, Machine Learning and Knowledge Discovery in Databases, Herakleion, 1995, p Berka,P.- Ivánek,J.: Automated Knowledge Acquisition for PROSPECTOR-like Expert Systems. In.(Bergadano, deraedt eds.) Proc. ECML 94, Springer Berka,P.- Sochorová,M.- Rauch,J.: Using GUHA and KEX for Knowledge DIscovery in Databases; the KDD Sisyphus Experience. In Proc: Poster Session ECML 98, TU Chemnitz Biggs,D. - de Ville,B - Suen,E.: A method of choosing multiway partitions for classification and decision trees. Journal of Applied Statistics, Vol. 18, No. 1, 1991, Bruha,I.- Kočková,S.: A covering learning algorithm for cost-sensitive and noisy environments. In: Proc. of ECML 93 Workshop on Learning Robots, Catlett,J.: On changing continuous attributes into ordered discrete attributes. In: Y. Kodratoff, ed.: Machine Learning- EWSL-91, Springer-Verlag, 1991, Clark,P.- Niblett,T.: The CN2 induction algorithm. Machine Learning, 3(1989), Fayyad,U.M.- Irani,K.B.: Multi-Interval Discretization of Continuous-Valued Attributes for Classifiacation Learning. In: Proc. IJCAI 93, Kietz,J.U.- Reimer,U.- Staudt,M.: Mining Insurance Data at Swiss Life. In: Proc. 23rd VLDB Conference, Athens, Merz,C.J.- Murphy,P.M.: UCI Repository of Machine Learning Databases. Irvine, University of California, Dept. of Information and Computer Science, ThisarticlewasprocessedusingtheL A TEXmacropackagewithLLNCSstyle
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Impact of Boolean factorization as preprocessing methods for classification of Boolean data
Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,
Data Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
A Review of Missing Data Treatment Methods
A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common
Discretization: An Enabling Technique
Data Mining and Knowledge Discovery, 6, 393 423, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Discretization: An Enabling Technique HUAN LIU FARHAD HUSSAIN CHEW LIM TAN MANORANJAN
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS
PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS Kalpesh Adhatrao, Aditya Gaykar, Amiraj Dhawan, Rohit Jha and Vipul Honrao ABSTRACT Department of Computer Engineering, Fr.
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Roulette Sampling for Cost-Sensitive Learning
Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
Artificial Neural Network Approach for Classification of Heart Disease Dataset
Artificial Neural Network Approach for Classification of Heart Disease Dataset Manjusha B. Wadhonkar 1, Prof. P.A. Tijare 2 and Prof. S.N.Sawalkar 3 1 M.E Computer Engineering (Second Year)., Computer
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Applying Data Mining Technique to Sales Forecast
Applying Data Mining Technique to Sales Forecast 1 Erkin Guler, 2 Taner Ersoz and 1 Filiz Ersoz 1 Karabuk University, Department of Industrial Engineering, Karabuk, Turkey [email protected], [email protected]
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL
Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
Performance Study on Data Discretization Techniques Using Nutrition Dataset
2009 International Symposium on Computing, Communication, and Control (ISCCC 2009) Proc.of CSIT vol.1 (2011) (2011) IACSIT Press, Singapore Performance Study on Data Discretization Techniques Using Nutrition
How To Use Data Mining For Loyalty Based Management
Data Mining for Loyalty Based Management Petra Hunziker, Andreas Maier, Alex Nippe, Markus Tresch, Douglas Weers, Peter Zemp Credit Suisse P.O. Box 100, CH - 8070 Zurich, Switzerland [email protected],
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
From Knowledge Discovery to Implementation: A Business Intelligence Approach Using Neural Network Rule Extraction and Decision Tables
From Knowledge Discovery to Implementation: A Business Intelligence Approach Using Neural Network Rule Extraction and Decision Tables Christophe Mues 1,2, Bart Baesens 1, Rudy Setiono 3, and Jan Vanthienen
Rule based Classification of BSE Stock Data with Data Mining
International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1
Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints
International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET
DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer
Model Trees for Classification of Hybrid Data Types
Model Trees for Classification of Hybrid Data Types Hsing-Kuo Pao, Shou-Chih Chang, and Yuh-Jye Lee Dept. of Computer Science & Information Engineering, National Taiwan University of Science & Technology,
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Implementation of Data Mining Techniques to Perform Market Analysis
Implementation of Data Mining Techniques to Perform Market Analysis B.Sabitha 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, P.Balasubramanian 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
Enhanced data mining analysis in higher educational system using rough set theory
African Journal of Mathematics and Computer Science Research Vol. 2(9), pp. 184-188, October, 2009 Available online at http://www.academicjournals.org/ajmcsr ISSN 2006-9731 2009 Academic Journals Review
Background knowledge-enrichment for bottom clauses improving.
Background knowledge-enrichment for bottom clauses improving. Orlando Muñoz Texzocotetla and René MacKinney-Romero Departamento de Ingeniería Eléctrica Universidad Autónoma Metropolitana México D.F. 09340,
Building Data Cubes and Mining Them. Jelena Jovanovic Email: [email protected]
Building Data Cubes and Mining Them Jelena Jovanovic Email: [email protected] KDD Process KDD is an overall process of discovering useful knowledge from data. Data mining is a particular step in the
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
Scaling Up the Accuracy of Naive-Bayes Classiers: a Decision-Tree Hybrid. Ron Kohavi. Silicon Graphics, Inc. 2011 N. Shoreline Blvd. ronnyk@sgi.
Scaling Up the Accuracy of Classiers: a Decision-Tree Hybrid Ron Kohavi Data Mining and Visualization Silicon Graphics, Inc. 2011 N. Shoreline Blvd Mountain View, CA 94043-1389 [email protected] Abstract
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Equational Reasoning as a Tool for Data Analysis
AUSTRIAN JOURNAL OF STATISTICS Volume 31 (2002), Number 2&3, 231-239 Equational Reasoning as a Tool for Data Analysis Michael Bulmer University of Queensland, Brisbane, Australia Abstract: A combination
Decision-Tree Learning
Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers
Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers Manjit Kaur Department of Computer Science Punjabi University Patiala, India [email protected] Dr. Kawaljeet Singh University
Data Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
Building an Iris Plant Data Classifier Using Neural Network Associative Classification
Building an Iris Plant Data Classifier Using Neural Network Associative Classification Ms.Prachitee Shekhawat 1, Prof. Sheetal S. Dhande 2 1,2 Sipna s College of Engineering and Technology, Amravati, Maharashtra,
Data Mining Applications in Fund Raising
Data Mining Applications in Fund Raising Nafisseh Heiat Data mining tools make it possible to apply mathematical models to the historical data to manipulate and discover new information. In this study,
Network Intrusion Detection Using a HNB Binary Classifier
2015 17th UKSIM-AMSS International Conference on Modelling and Simulation Network Intrusion Detection Using a HNB Binary Classifier Levent Koc and Alan D. Carswell Center for Security Studies, University
Bisecting K-Means for Clustering Web Log data
Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
An Empirical Study of Application of Data Mining Techniques in Library System
An Empirical Study of Application of Data Mining Techniques in Library System Veepu Uppal Department of Computer Science and Engineering, Manav Rachna College of Engineering, Faridabad, India Gunjan Chindwani
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad Email: [email protected]
96 Business Intelligence Journal January PREDICTION OF CHURN BEHAVIOR OF BANK CUSTOMERS USING DATA MINING TOOLS Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad
Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms
Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms Y.Y. Yao, Y. Zhao, R.B. Maguire Department of Computer Science, University of Regina Regina,
Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008
Decision Trees JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology Doctoral School, Catania-Troina, April, 2008 Aims of this module The decision tree representation. The basic
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
Efficient Integration of Data Mining Techniques in Database Management Systems
Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
Unit 4 DECISION ANALYSIS. Lesson 37. Decision Theory and Decision Trees. Learning objectives:
Unit 4 DECISION ANALYSIS Lesson 37 Learning objectives: To learn how to use decision trees. To structure complex decision making problems. To analyze the above problems. To find out limitations & advantages
2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
Data Mining in the Application of Criminal Cases Based on Decision Tree
8 Journal of Computer Science and Information Technology, Vol. 1 No. 2, December 2013 Data Mining in the Application of Criminal Cases Based on Decision Tree Ruijuan Hu 1 Abstract A briefing on data mining
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES
INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ENGINEERING AND SCIENCE IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES C.Priyanka 1, T.Giri Babu 2 1 M.Tech Student, Dept of CSE, Malla Reddy Engineering
Specific Usage of Visual Data Analysis Techniques
Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Performance Analysis of Decision Trees
Performance Analysis of Decision Trees Manpreet Singh Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Sonam Sharma CBS Group of Institutions, New Delhi,India
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
COMBINED METHODOLOGY of the CLASSIFICATION RULES for MEDICAL DATA-SETS
COMBINED METHODOLOGY of the CLASSIFICATION RULES for MEDICAL DATA-SETS V.Sneha Latha#, P.Y.L.Swetha#, M.Bhavya#, G. Geetha#, D. K.Suhasini# # Dept. of Computer Science& Engineering K.L.C.E, GreenFields-522502,
An Analysis of Four Missing Data Treatment Methods for Supervised Learning
An Analysis of Four Missing Data Treatment Methods for Supervised Learning Gustavo E. A. P. A. Batista and Maria Carolina Monard University of São Paulo - USP Institute of Mathematics and Computer Science
