Practical Applications of DATA MINING Sang C Suh Texas A&M University Commerce r 3 JONES & BARTLETT LEARNING
Contents Preface xi Foreword by Murat M.Tanik xvii Foreword by John Kocur xix Chapter 1 Introduction to Data Mining 1 1.1 Traditional Database Management Systems 1 1.2 Knowledge Discovery in Databases 3 1.2.1 Pre-Processing 5 1.2.2 Data Warehousing 6 1.2.3 Post-Processing 6 1.3 Data-Mining Methods 6 1.3.1 Association Rules 7 1.3.2 Classification Learning 8 1.3.3 Statistical Data Mining 10 1.3.4 Rough Sets for Data Mining 11 1.3.5 Neural Networks for Data Mining 12 1.3.6 Clustering for Data Mining 14 1.3.7 Fuzzy Sets for Data Mining 16 1.4 Integrated Framework for Intelligent Databases 17 1.5 Practical Applications of Data Mining 20 1.5.1 Healthcare Services 20 1.5.2 Banking 22 1.5.3 Supermarket Applications 23 1.5.4 Medical Image Classification 25 1.6 Chapter Summary 27
vi CONTENTS Chapter 2 Association Rules 29 2.1 Introduction 29 2.2 Mining of Association Rules in Market Basket Data 29 2.2.1 Apriori Algorithm 30 2.2.2 Apriori-gen( ) Function 32 2.2.3 Apriori Example 32 2.2.4 AprioriTid Algorithm 33 2.3 Attribute-Oriented Rule Generalization 35 2.3.1 Concept Hierarchies 36 2.3.2 Basic Strategies for Attribute-Oriented Induction 38 2.3.3 Basic Attribute-Oriented Induction Algorithm 42 2.3.4 Generation of Discrimination Rules through Attribute-Oriented Induction 43 2.4 Association Rules in Hypertext Databases 46 2.4.1 Formal Model 47 2.4.2 Algorithms for Generating Composite Association Rules 49 2.5 Quantitative Association Rules 53 2.5.1 Mapping of Quantitative Association Rules 53 2.5.2 Problem Decomposition 55 2.5.3 Partitioning of Quantitative Attributes 56 2.6 Mining of Compact Rules 59 2.6.1 Semantic Association Relationships 59 2.6.2 Generalization Algorithm 60 2.6.3 Learning Process 61 2.6.4 Learning Algorithm 63 2.7 Mining of Tmie-Constrained Association Rules 67 2.7.1 Time-Constrained Association Rules 67 2.7.2 Properties oftime Constraints 69 2.7.3 Potential Applications 70 2.8 Chapter Summary 70 2.9 Exercises 71 2.10 Selected Bibliographic Notes 74 2.11 Chapter Bibliography 75 Chapter 3 Classification Learning 79 3.1 Introduction 79 3.2 Knowledge Representation 81 3.2.1 Classification Rules 81 3.2.2 Decision Trees 81 3.3 Separate-and-Conquer Approach 82 3.3.1 Prism 83 3.3.2 Induct 86 3.3.3 REP, IREP, RIPPER 97
CONTENTS vii 3.4 Divide-and-Conquer Approach 99 3.4.1 ID3 100 3.4.2 C4.5 and C5.0 106 3.5 Partial Decision Tree 123 3.6 Chapter Summary 129 3.7 Exercises 129 3.8 Selected Bibliographic Notes 137 3.9 Chapter Bibliography 138 Chapter 4 Statistics for Data Mining 143 4.1 Introduction 143 4.2 House Sales Data 145 4.3 Conditional Probability 146 4.4 Equality Tests 148 4.5 Correlation Coefficient 152 4.6 Contingency Table and the %2 Test 157 4.7 Linear Regression 164 4.8 House Sales Database Revisited 172 4.9 Chapter Summary 175 4.10 Exercises 175 4.11 Selected Bibliographic Notes 178 4.12 Chapter Bibliography 178 Chapter 5 Rough Sets and Bayes' Theories 181 5.1 Introduction 181 5.2 Bayes'Theorem 183 5.3 Rough Sets 184 5.3.1 Data Analysis and Representation 184 5.3.2 Reduction of Condition Attributes and Generation of Decision Rules 188 5.4 Applications Based on Bayes'and Rough Sets 190 5.4.1 Customer Tendency Analysis Using Bayes'Theory 190 5.4.2 Contact Lens Prescription Using Rough Set Theory 190 5.4.3 Welding Procedure Using Rough-Set Theory 195 5.4.4 Classification ofautomobiles Using Both Bayes' and Rough Set Theory 202 5.5 Chapter Summary 212 5.6 Exercises 213 5.7 Selected Bibliographic Notes 220 5.8 Chapter Bibliography 221 Chapter 6 Neural Networks 225 6.1 Introduction 225 6.2 Neural Computing and Databases 226
viii CONTENTS 6.3 Network Classification 228 6.3.1 Unsupervised Learning Models 228 6.3.2 Supervised Learning Models 230 6.4 Parameters of the Learning Process 231 6.4.1 Number of Hidden Layers 231 6.4.2 Number of Hidden Nodes 232 6.4.3 Early Stopping 232 6.4.4 Convergence Curve (Back-Propagation Neural Network) 233 6.5 Network Structures 234 6.5.1 Neural Net andtraditional Classifiers 235 6.6 Knowledge Discovery 6.6.1 Normalization 236 in Databases 235 6.7 Backpropagation Neural Network (BPNN) 6.7.1 Network Architecture 239 6.7.2 Algorithm 240 6.7.3 Example I 242 Model 239 6.7.4 Example II (Retrieval ofdata Using the BPNN Model) 243 6.8 Bidirectional Associative Memory (BAM) Model 246 6.8.1 Network Architecture 247 6.8.2 Algorithm 247 6.8.3 Example with Four TrainingVectors 248 6.9 Learning Vector Quantization (LVQ) Model 250 6.9.1 Network Architecture 251 6.9.2 Algorithm 252 6.9.3 Example 253 6.10 Probabilistic Neural Network (PNN) Model 255 6.10.1 Network Architecture 256 6.10.2 Algorithm 259 6.10.3 Example 260 6.10.4 Parameter Adjustment Using a Smoothing Factor 265 6.11 Chapter Summary 267 6.12 Exercises 268 6.13 Selected Bibliographic Notes 274 6.14 Chapter Bibliography 275 Chapter 7 Clustering 279 7.1 Introduction 279 7.2 Definition of Clusters and Clustering 280 7.3 Clustering Procedures 283 7.4 Clustering Concepts 284 7.4.1 Choosing Variables 284 7.4.2 Similarity and Dissimilarity Measurement 285
CONTENTS ix 7.4.3 Standardization of Variables 287 7.4.4 Weights and Threshold Values 288 7.4.5 Association Rules 289 7.5 Clustering Algorithms 290 7.5.1 Hierarchical Algorithms 291 7.5.2 Graph Theory Algorithm with the Single-link Method 304 7.5.3 Partition Algorithms: K"-means Algorithm 307 7.5.4 Density-Search Algorithms 310 7.5.5 Association Rule Algorithms 313 7.6 Chapter Summary 329 7.7 Exercises 329 7.8 Selected Bibliographic Notes 333 7.9 Chapter Bibliography 335 Chapter 8 Fuzzy Information Retrieval 339 8.1 Introduction 339 8.2 Fuzzy Set Basics 340 8.3 Fuzzy Set Applications 341 8.3.1 Project Management 342 8.3.2 Data Analysis 342 8.3.3 Nuanced Information Systems 346 8.4 Linguistic Variables 347 8.5 Fuzzy Query Processing 348 8.6 Fuzzy Query Processing Using Fuzzy Tables 363 8.6.1 Convert Raw Data to Fuzzy Member Functions 363 8.6.2 Fuzzy Table 368 8.6.3 Fuzzy Search Engine 369 8.6.4 Fuzzy Table Construction 370 8.6.5 Fuzzy Query Processing 371 8.7 Role of Relational Division for Information Retrieval 374 8.7.1 Information Retrieval through Relational Division 375 8.7.2 Information Retrieval through Fuzzy Relational Division 376 8.8 Alpha-Cut Thresholds 379 8.9 Chapter Summary 383 8.10 Exercises 384 8.11 Selected Bibliographic Notes 391 8.12 Chapter Bibliography 392 Appendix 395 Index 409