Practical Data Mining Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint of the Taylor Ei Francis Group, an Informs business AN AUERBACH BOOK
Contents Dedication v Preface xv About the Author xxi Acknowledgments xxiii Chapter 1 What Is Data Mining and What Can It Do? 1 Purpose 1 Goals 1 1.1 Introduction 1 1.2 A Brief Philosophical Discussion 2 1.3 The Most Important Attribute of the Successful Data Miner: Integrity 3 1.4 What Does Data Mining Do? 4 1.5 What Do We Mean By Data? 6 1.5.1 Nominal Data vs. Numeric Data 7 1.5.2 Discrete Data vs. Continuous Data 7 1.5.3 Coding and Quantization as Inverse Processes 8 1.5.4 A Crucial Distinction: Data and Information Are Not the Same Thing 9 1.5.5 The Parity Problem 11 1.5.6 Five Riddles about Information 11 1.5.7 Seven Riddles about Meaning 13 1.6 Data Complexity 14 1.7 Computational Complexity 15
viii Practical Data Mining 1.7.1 Some NP-Hard Problems 17 1.7.2 Some Worst-Case Computational Complexities 17 1.8 Summary 17 Chapter 2 The Data Mining Process 19 Purpose 19 Goals 19 2.1 Introduction 19 2.2 Discovery and Exploitation 20 2.3 Eleven Key Principles of Information Driven Data Mining 23 2.4 Key Principles Expanded 24 2.5 Type of Models: Descriptive, Predictive, Forensic 30 2.5.1 Domain Ontologies as Models 30 2.5.2 Descriptive Models 32 2.5.3 Predictive Models 32 2.5.4 Forensic Models 32 2.6 Data Mining Methodologies 32 2.6.1 Conventional System Development: Waterfall Process 33 2.6.2 Data Mining as Rapid Prototyping 34 2.7 A Generic Data Mining Process 34 2.8 RAD Skill Set Designators 35 2.9 Summary 36 Chapter 3 Problem Definition (Step 1) 37 Purpose 37 Goals 37 3.1 Introduction 37 3.2 Problem Definition Task 1: Characterize Your Problem 38 3.3 Problem Definition Checklist 38 3.3.1 Identify Previous Work 43 3.3.2 Data Demographics 45 3.3.3 User Interface 47 3.3.4 Covering Blind Spots 50 3.3.5 Evaluating Domain Expertise 51 3.3.6 Tools 53 3.3.7 Methodology 54 3.3.8 Needs 54
Contents ix 3.4 Candidate Solution Checklist 56 3.4.1 What Type of Data Mining Must the System Perform? 56 3.4.2 Multifaceted Problems Demand Multifaceted Solutions 57 3.4.3 The Nature of the Data 58 3.5 Problem Definition Task 2: Characterizing Your Solution 62 3.5.1 Candidate Solution Checklist 62 3.6 Problem Definition Case Study 64 3.6.1 Predictive Attrition Model: Summary Description 64 3.6.2 Glossary 64 3.6.3 The ATM Concept 65 3.6.4 Operational Functions 65 3.6.5 Predictive Modeling and ATM 67 3.6.6 Cognitive Systems and Predictive Modeling 68 3.6.7 The ATM Hybrid Cognitive Engine 68 3.6.8 Testing and Validation of Cognitive Systems 69 3.6.9 Spiral Development Methodology 69 3.7 Summary 70 Chapter 4 Data Evaluation (Step 2) 71 Purpose 71 Goals 71 4.1 Introduction 71 4.2 Data Accessibility Checklist 72 4.3 How Much Data Do You Need? 74 4.4 Data Staging 75 4.5 Methods Used for Data Evaluation 76 4.6 Data Evaluation Case Study: Estimating the Information Content Features 77 4.7 Some Simple Data Evaluation Methods 81 4.8 Data Quality Checklist 85 4.9 Summary 87 Chapter 5 Feature Extraction and Enhancement (Step 3) 89 Purpose 89 Goals 89 5.1 Introduction: A Quick Tutorial on Feature Space 89 5.1.1 Data Preparation Guidelines 90
x Practical Data Mining 5.1.2 General Techniques for Feature Selection and Enhancement 91 5.2 Characterizing and Resolving Data Problems 93 5.2.1 Outlier Case Study 95 5.2.2 Winnowing Case Study: Principal Component Analysis for Feature Extraction 95 5.3 Principal Component Analysis 96 5.3.1 Feature Winnowing and Dimension Reduction Checklist 102 5.3.2 Checklist for Characterizing and Resolving Data Problems 107 5.4 Synthesis of Features 108 5.4.1 Feature Synthesis Case Study 108 5.4.2 Synthesis of Features Checklist 111 5.5 Degapping 112 5.5.1 Degapping Case Study 114 5.5.2 Feature Selection Checklist 117 5.6 Summary 119 Chapter 6 Prototyping Plan and Model Development (Step 4) 121 Purpose 121 Goals 121 6.1 Introduction 121 6.2 Step 4A: Prototyping Plan 122 6.2.1 Prototype Planning as Part of a Data Mining Project 122 6.3 Prototyping Plan Case Study 124 6.4 Step 4B: Prototyping/Model Development 133 6.5 Model Development Case Study 135 6.6 Summary 141 Chapter 7 Model Evaluation (Step 5) 143 Purpose 143 Goals 143 7.1 Introduction 143 7.2 Evaluation Goals and Methods 144 7.2.1 Performance Evaluation Components 144 7.2.2 Stability Evaluation Components 144
Contents xi 7.3 What Does Accuracy Mean? 146 7.3.1 Confusion Matrix Example 146 7.3.2 Other Metrics Derived from the Confusion Matrix 150 7.3.3 Model Evaluation Case Study: Addressing Queuing Problems by Simulation 150 7.3.4 Model Evaluation Checklist 152 7.4 Summary 155 Chapter 8 Implementation (Step 6) 157 Purpose 157 Goals 157 8.1 Introduction 157 8.1.1 Implementation Checklist 158 8.2 Quantifying the Benefits of Data Mining 160 8.2.1 ROI Case Study 160 8.2.2 ROI Checklist 162 8.3 Tutorial on Ensemble Methods 164 8.3.1 Many Predictive Modeling Paradigms Are Available 165 8.3.2 Adaptive Training 167 8.4 Getting It Wrong: Mistakes Every Data Miner Has Made 169 8.5 Summary 176 Chapter 9 Supervised Learning Genre Section 1 Detecting and Characterizing Known Patterns 179 Purpose 179 Goals 179 9.1 Introduction 180 9.2 Representative Example of Supervised Learning: Building a Classifier 180 9.2.1 Problem Description 180 9.2.2 Data Description: Background Research/ Planning 181 9.2.3 Descriptive Modeling of Data: Preprocessing and Data Conditioning 182 9.2.4 Data Exploitation: Feature Extraction and Enhancement 185
xii Practical Data Mining 9.2.5 Model Selection and Development 187 9.2.6 Model Training 189 9.2.7 Model Evaluation 189 9.3 Specific Challenges, Problems, and Pitfalls of Supervised Learning 190 9.3.1 High-Dimensional Feature Vectors (PCA, Winnowing) 190 9.3.2 Not Enough Data 191 9.3.3 Too Much Data 192 9.3.4 Unbalanced Data 192 9.3.5 Overtraining 193 9.3.6 Noncommensurable Data: Outliers 193 9.3.7 Missing Features 195 9.3.8 Missing Ground Truth 195 9.4 Recommended Data Mining Architectures for Supervised Learning 195 9.5 Descriptive Analysis 198 9.5.1 Technical Component: Problem Definition 198 9.5.2 Technical Component: Data Selection and Preparation 200 9.5.3 Technical Component: Data Representation 200 9.6 Predictive Modeling 201 9.6.1 Technical Component: Paradigm Selection 201 9.6.2 Technical Component: Model Construction and Validation 202 9.6.3 Technical Component: Model Evaluation (Functional and Performance Metrics) 202 9.6.4 Technical Component: Model Deployment 202 9.6.5 Technical Component: Model Maintenance 202 9.7 Summary 204 Chapter 10 Forensic Analysis Genre Section 2 Detecting, Characterizing, and Exploiting Hidden Patterns 205 Purpose 205 Goals 205 10.1 Introduction 206 10.2 Genre Overview 207 10.3 Recommended Data Mining Architectures for Unsupervised Learning 207
Contents xiii 10.4 Examples and Case Studies for Unsupervised Learning 209 10.4.1 Case Study: Reducing Cost by Optimizing a System Configuration 212 10.4.2 Case Study: Stacking Multiple Pattern Processors for Broad Functionality 214 10.4.3 Multiparadigm Engine for Cognitive Intrusion Detection 215 10.5 Tutorial on Neural Networks 217 10.5.1 The Neural Analogy 217 10.5.2 Artificial Neurons: Their Form and Function 218 10.5.3 Using Neural Networks to Learn Complex Patterns 219 10.6 Making Syntactic Methods Smarter: The Search Engine Problem 222 10.6.1 A Submetric for Sensitivity 224 10.6.2 A Submetric for Specificity 224 10.6.3 Combining the Submetrics to Obtain a Single Score 225 10.6.4 Putting It All Together: Building a Simple Search Engine 226 10.6.5 The Objective Function for This Search Engine and How to Use It 231 10.7 Summary 231 Chapter 11 Purpose Genre Section 3 Knowledge: Its Acquisition, Representation, and Use 233 233 Goals 233 11.1 Introduction to Knowledge Engineering 233 11.1.1 The Prototypical Example: Knowledge-Based Expert Systems (KBES) 234 11.1.2 Inference Engines Implement Inferencing Strategies 236 11.2 Computing with Knowledge 237 11.2.1 Graph Methods: Decision Trees, Forward/ Backward Chaining, Belief Nets 238 11.2.2 Bayesian Belief Networks 243 11.2.3 Non-Graph Methods: Belief Accumulation 245 11.3 Inferring Knowledge from Data: Machine Learning 246 11.3.1 Learning Machines 247
xiv Practical Data Mining 11.3.2 Using Modeling Techniques to Infer Knowledge from History 248 11.3.3 Domain Knowledge the Learner Will Use 250 11.3.4 Inferring Domain Knowledge from Human Experts 251 11.3.5 Writing on a Blank Slate 255 11.3.6 Mathematizing Human Reasoning 256 11.3.7 Using Facts in Rules 256 11.3.8 Problems and Properties 258 11.4 Summary 259 References 261 Glossary 263 Index 269