CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19



Similar documents
Customer and Business Analytic

Exploratory Data Analysis with MATLAB

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Data Mining. Dr. Saed Sayad. University of Toronto

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Azure Machine Learning, SQL Data Mining and R

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

A fast, powerful data mining workbench designed for small to midsize organizations

Alabama Department of Postsecondary Education

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

How To Understand Multivariate Models

Advanced In-Database Analytics

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Computer-Aided Multivariate Analysis

Principles of Data Mining by Hand&Mannila&Smyth

Diagrams and Graphs of Statistical Data

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Multivariate Statistical Inference and Applications

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Fairfield Public Schools

The Visual Statistics System. ViSta

Schneps, Leila; Colmez, Coralie. Math on Trial : How Numbers Get Used and Abused in the Courtroom. New York, NY, USA: Basic Books, p i.

Predictive Modeling Techniques in Insurance

Data Mining and Visualization

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Instructions for SPSS 21

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

The Data Mining Process

Easily Identify Your Best Customers

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

430 Statistics and Financial Mathematics for Business

THE CERTIFIED SIX SIGMA BLACK BELT HANDBOOK

Course Syllabus. Purposes of Course:

2015 Workshops for Professors

MTH 140 Statistics Videos

Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report

An Introduction to Data Mining

The Comparisons. Grade Levels Comparisons. Focal PSSM K-8. Points PSSM CCSS 9-12 PSSM CCSS. Color Coding Legend. Not Identified in the Grade Band

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

UNIT 1: COLLECTING DATA

MS1b Statistical Data Mining

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Directions for using SPSS

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Big Data Analytics and Optimization

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Quick Start. Creating a Scoring Application. RStat. Based on a Decision Tree Model

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Predict Influencers in the Social Network

Data analysis process

How to Get More Value from Your Survey Data

Statistics Graduate Courses

A Property & Casualty Insurance Predictive Modeling Process in SAS

Analytics on Big Data

Week 1. Exploratory Data Analysis

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

What is Data mining?

Business Intelligence. Data Mining and Optimization for Decision Making

Simple Predictive Analytics Curtis Seare

Big Ideas in Mathematics

Additional sources Compilation of sources:

Data exploration with Microsoft Excel: analysing more than one variable

Guidebook to R Graphics Using Microsoft Windows

Unit 1: Introduction to Quality Management

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

MANAGEMENT STUDIES (MBA) DETAILED SYLLABUS FOR PART A & B PART A GENERAL PAPER ON TEACHING AND RESEARCH APTITUDE

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

SPSS Explore procedure

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)

Machine Learning with MATLAB David Willingham Application Engineer

CHAPTER 1 THE CERTIFIED QUALITY ENGINEER EXAM. 1.0 The Exam. 2.0 Suggestions for Study. 3.0 CQE Examination Content. Where shall I begin your majesty?

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, , , 4-9

Ira J. Haimowitz Henry Schwarz

Chapter 20: Data Analysis

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Data Mining Techniques Chapter 6: Decision Trees

Body of Knowledge for Six Sigma Green Belt

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Transcription:

PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations 3 1.3.5 Data Cleaning 6 1.3.6 Transformation 7 1.3.7 Variable Reduction 9 1.3.8 Segmentation 10 1.3.9 Preparing Data to Apply 10 1.4 Analysis 11 1.4.1 Data Mining Tasks 11 1.4.2 Optimization 12 1.4.3 Evaluation 12 1.4.4 Model Forensics 13 1.5 Deployment 13 1.6 Outline of Book 14 1.6.1 Overview 14 1.6.2 Data Visualization 14 1.6.3 Clustering 15 1.6.4 Predictive Analytics 15 1.6.5 Applications 16 1.6.6 Software 16 1.7 Summary 16 1.8 Further Reading 17 2 DATA VISUALIZATION 19 2.1 Overview 19 2.2 Visualization Design Principles 20 2.2.1 General Principles 20 2.2.2 Graphics Design 23 2.2.3 Anatomy of a Graph 28 v

vi CONTENTS 2.3 Tables 32 2.3.1 Simple Tables 32 2.3.2 Summary Tables 33 2.3.3 Two-Way Contingency Tables 34 2.3.4 Supertables 34 2.4 Univariate Data Visualization 36 2.4.1 Bar Chart 36 2.4.2 Histograms 37 2.4.3 Frequency Polygram 41 2.4.4 Box Plots 41 2.4.5 Dot Plot 43 2.4.6 Stem-and-Leaf Plot 44 2.4.7 Quantile Plot 46 2.4.8 Quantile Quantile Plot 48 2.5 Bivariate Data Visualization 49 2.5.1 Scatterplot 49 2.6 Multivariate Data Visualization 50 2.6.1 Histogram Matrix 52 2.6.2 Scatterplot Matrix 54 2.6.3 Multiple Box Plot 56 2.6.4 Trellis Plot 56 2.7 Visualizing Groups 59 2.7.1 Dendrograms 59 2.7.2 Decision Trees 60 2.7.3 Cluster Image Maps 60 2.8 Dynamic Techniques 63 2.8.1 Overview 63 2.8.2 Data Brushing 64 2.8.3 Nearness Selection 65 2.8.4 Sorting and Rearranging 65 2.8.5 Searching and Filtering 65 2.9 Summary 65 2.10 Further Reading 66 3 CLUSTERING 67 3.1 Overview 67 3.2 Distance Measures 75 3.2.1 Overview 75 3.2.2 Numeric Distance Measures 77 3.2.3 Binary Distance Measures 79 3.2.4 Mixed Variables 84 3.2.5 Other Measures 86 3.3 Agglomerative Hierarchical Clustering 87 3.3.1 Overview 87 3.3.2 Single Linkage 88 3.3.3 Complete Linkage 92 3.3.4 Average Linkage 93 3.3.5 Other Methods 96 3.3.6 Selecting Groups 96

vii 3.4 Partitioned-Based Clustering 98 3.4.1 Overview 98 3.4.2 k-means 98 3.4.3 Worked Example 100 3.4.4 Miscellaneous Partitioned-Based Clustering 101 3.5 Fuzzy Clustering 103 3.5.1 Overview 103 3.5.2 Fuzzy k-means 103 3.5.3 Worked Examples 104 3.6 Summary 109 3.7 Further Reading 110 4 PREDICTIVE ANALYTICS 111 4.1 Overview 111 4.1.1 Predictive Modeling 111 4.1.2 Testing Model Accuracy 116 4.1.3 Evaluating Regression Models Predictive Accuracy 117 4.1.4 Evaluating Classification Models Predictive Accuracy 119 4.1.5 Evaluating Binary Models Predictive Accuracy 120 4.1.6 ROC Charts 122 4.1.7 Lift Chart 124 4.2 Principal Component Analysis 126 4.2.1 Overview 126 4.2.2 Principal Components 126 4.2.3 Generating Principal Components 127 4.2.4 Interpretation of Principal Components 128 4.3 Multiple Linear Regression 130 4.3.1 Overview 130 4.3.2 Generating Models 133 4.3.3 Prediction 136 4.3.4 Analysis of Residuals 136 4.3.5 Standard Error 139 4.3.6 Coefficient of Multiple Determination 140 4.3.7 Testing the Model Significance 142 4.3.8 Selecting and Transforming Variables 143 4.4 Discriminant Analysis 145 4.4.1 Overview 145 4.4.2 Discriminant Function 146 4.4.3 Discriminant Analysis Example 146 4.5 Logistic Regression 151 4.5.1 Overview 151 4.5.2 Logistic Regression Formula 151 4.5.3 Estimating Coefficients 153 4.5.4 Assessing and Optimizing Results 156 4.6 Naive Bayes Classifiers 157 4.6.1 Overview 157 4.6.2 Bayes Theorem and the Independence Assumption 158 4.6.3 Independence Assumption 158 4.6.4 Classification Process 159

viii CONTENTS 4.7 Summary 161 4.8 Further Reading 163 5 APPLICATIONS 165 5.1 Overview 165 5.2 Sales and Marketing 166 5.3 Industry-Specific Data Mining 169 5.3.1 Finance 169 5.3.2 Insurance 171 5.3.3 Retail 172 5.3.4 Telecommunications 173 5.3.5 Manufacturing 174 5.3.6 Entertainment 175 5.3.7 Government 176 5.3.8 Pharmaceuticals 177 5.3.9 Healthcare 179 5.4 microrna Data Analysis Case Study 181 5.4.1 Defining the Problem 181 5.4.2 Preparing the Data 181 5.4.3 Analysis 183 5.5 Credit Scoring Case Study 192 5.5.1 Defining the Problem 192 5.5.2 Preparing the Data 192 5.5.3 Analysis 199 5.5.4 Deployment 203 5.6 Data Mining Nontabular Data 203 5.6.1 Overview 203 5.6.2 Data Mining Chemical Data 203 5.6.3 Data Mining Text 210 5.7 Further Reading 213 APPENDIX A MATRICES 215 A.1 Overview of Matrices 215 A.2 Matrix Addition 215 A.3 Matrix Multiplication 216 A.4 Transpose of a Matrix 217 A.5 Inverse of a Matrix 217 APPENDIX B SOFTWARE 219 B.1 Software Overview 219 B.1.1 Software Objectives 219 B.1.2 Access and Installation 221 B.1.3 User Interface Overview 221 B.2 Data Preparation 223 B.2.1 Overview 223 B.2.2 Reading in Data 224 B.2.3 Searching the Data 225

ix B.2.4 Variable Characterization 227 B.2.5 Removing Observations and Variables 228 B.2.6 Cleaning the Data 228 B.2.7 Transforming the Data 230 B.2.8 Segmentation 235 B.2.9 Principal Component Analysis 236 B.3 Tables and Graphs 238 B.3.1 Overview 238 B.3.2 Contingency Tables 239 B.3.3 Summary Tables 240 B.3.4 Graphs 242 B.3.5 Graph Matrices 246 B.4 Statistics 246 B.4.1 Overview 246 B.4.2 Descriptive Statistics 248 B.4.3 Confidence Intervals 248 B.4.4 Hypothesis Tests 249 B.4.5 Chi-Square Test 250 B.4.6 ANOVA 251 B.4.7 Comparative Statistics 251 B.5 Grouping 253 B.5.1 Overview 253 B.5.2 Clustering 254 B.5.3 Associative Rules 257 B.5.4 Decision Trees 258 B.6 Prediction 261 B.6.1 Overview 261 B.6.2 Linear Regression 263 B.6.3 Discriminant Analysis 265 B.6.4 Logistic Regression 266 B.6.5 Naive Bayes 267 B.6.6 knn 269 B.6.7 CART 269 B.6.8 Neural Networks 270 B.6.9 Apply Model 271 BIBLIOGRAPHY 273 INDEX 279