Predictive Analytics: Modeling the World. Richard D. De Veaux Professor of Statistics, Williams College January 28, 2005 OR/MS Seminar


 Clarence Bell
 1 years ago
 Views:
Transcription
1 Predictive Analytics: Modeling the World Richard D. De Veaux Professor of Statistics, Williams College January 28, 2005 OR/MS Seminar
2 Getting to Know Your Customers 50 years ago this was easy Customer data base could fit in one person s head Retention of customers depended on ability to do so 2
3 21 st Century Data Bases Ability to anticipate customer s needs crucial for retention Even Sam Walton didn t know all his customer s preferences Amazon.com Earth s biggest selection $390,000 Diamond Necklace World s biggest book Yak Cheese from Tibet No one can do this without help Well, almost no one! 3
4 Direct Marketing Example Paralyzed Veterans of America KDD 1998 cup Mailing list of 3.5 million potential donors Lapsed donors Made their last donation to PVA 13 to 24 months prior to June ,000 (training and test sets) Who should get the current mailing? Cost effective strategy 4
5 Why is this Hard? Amount of Information 481 predictors 2 responses Cross tabs / OLAP How many combinations? What to focus on? Data Preparation This alone can be 6095% of the effort Categorical vs. Quantitative 5
6 What s Hard? Example 6
7 TCode 7
8 So, what does it mean? TCode Title 0 _ 1 6 DEAN 4 8 CORP ORAL LIC. 1 M R. 1 7 J UDGE 5 0 ELDER S A M ES SRS J UDGE & M RS. 5 6 M AYOR DA M R. & M RS. 1 8 M AJ OR LIEUTENANT & M RS S R. 2 M RS M AJ OR & M RS. 6 2 LORD S RA MESDAMES 19 SENATOR 63 CARDINAL 118 SRTA. 3 M IS S 2 0 GOVERNOR 6 4 FRIEND YOUR M AJ ES TY M IS S ES S ERGEANT & M RS. 6 5 FRIENDS HIS HIGHNES S 4 DR COLNEL & MRS. 68 ARCHDEACON 123 HER HIGHNESS 4002 DR. & MRS. 24 LIEUTENANT 69 CANON 124 COUNT DOCTORS 2 6 M ONSIGNOR 7 0 BIS HOP LADY 5 MADAME 27 REVEREND REVEREND & MRS. 126 PRINCE 6 S ERGEANT 2 8 M S. 7 3 PAS TOR P RINCES S 9 RABBI MSS. 75 ARCHBISHOP 128 CHIEF 1 0 P ROFES S OR 2 9 BIS HOP 8 5 SP ECIALIS T BARON P ROFES S OR & M RS. 3 1 AM BAS S ADOR 8 7 PRIVATE S HEIK P ROFES S ORS AM BAS S ADOR & M RS 8 9 SEAM AN P RINCE AND P RINCES S 1 1 ADM IRAL 3 3 CANTOR 9 0 AIRM AN YOUR IM PERIAL M AJ ES T ADM IRAL & M RS. 3 6 BROTHER 9 1 J US TICE M. ET M M E. 1 2 GENERAL 3 7 S IR 9 2 M R. J US TICE P ROF GENERAL & M RS. 3 8 COM M ODORE M. 1 3 COLONEL 4 0 FATHER M LLE COLONEL & MRS. 42 SISTER 104 CHANCELLOR 1 4 CAPTAIN 4 3 P RES IDENT REP RES ENTATIV E CAPTAIN & M RS. 4 4 M AS TER SECRETARY 1 5 COM M ANDER 4 6 M OTHER LT. GOVERNOR COM M ANDER & M RS. 4 7 CHAP LAIN 8
9 Results for PVA Data Set If entire list (100,000 donors) are mailed, net donation is $10,500 Using data mining techniques, this was increased 41.37% 9
10 KDD CUP 98 Results 10
11 KDD CUP 98 Results 2 11
12 Data Mining Is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.  Fayyad finding interesting structure (patterns, statistical models, relationships) in data bases. Fayyad, Chaduri and Bradley a knowledge discovery process of extracting previously unknown, actionable information from very large data bases  Zornes a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. Edelstein 12
13 Data Mining Is 13
14 Case Study I Ingot Cracking ,000 lb. Ingots 20% cracking rate $30,000 per recast 90 Potential Explanatory Variables Water composition Metal composition Process variables Other environmental variables Can we predict under what conditions ingots will crack? 14
15 Case Study II Car Insurance mature policies 65 Potential Predictors Can we find a pattern for the unprofitable policies? 15
16 Case Study III Breast Cancer Diagnosis Mammograms used as screening instrument Expensive radiologist read Inaccurate False positive and negative rates over 25% Over a decade, nearly 100% false positive rate Can we do better? Automatically read by a scanning algorithm Automatically diagnosed by a model 16
17 Why not Queries? Queries Describe Models promote understanding Models can be assessed both by their understanding and their predictions It s difficult to predict especially the future Queries are Event Driven Models are phenomenon driven Queries are reactive Models are proactive 17
18 What Happened on the Titanic? Class Crew First Second Third 18
19 Mosaic Plot 1 F D M F S M C C312 C 3 A 2 19
20 Models Powerful predictors for optimizing performance Powerful summaries for understanding Used to explore data set Are not perfect All models are wrong, but some are useful Statisticians, like artists, have the bad habit of falling in love with their models. 20
21 Tree Diagram M F Adult Child 3 1,2,C 2 or 3 1 or Crew 3 1 or 2 46% 93% 14% Crew 1st 27% 100% 23% 33% 21
22 Why Models? What s interesting? Most associated variables in the census What s associated with shampoo purchases? Beer and Diapers In the convenience stores we looked at, on Friday nights, purchases of beer and purchases of diapers are highly associated Conclusions? Actions? 22
23 Beer and Diapers Picture from Tandem TM ad 23
24 Toy Toy Problem train2[, i] 24 train2$y train2[, i] train2$y train2$y train2[, i] train2[, i] train2$y train2$y train2[, i] train2[, i] train2$y train2$y train2[, i] train2[, i] train2$y train2$y train2[, i] train2[, i] train2$y
25 Familiar Models Linear Regression 25
26 Logistic Regression 26
27 Linear Regression Term Estimate Std Error t Ratio Prob> t Intercept x <.0001 x <.0001 x x <.0001 x <.0001 x x x x x Rsquared: 76.1% Train 73.3% Test 27
28 Stepwise Regression Term Estimate Std Error t Ratio Prob> t Intercept x <.0001 x <.0001 x x <.0001 x <.0001 x Rsquared 76.0% on Train 73.4% Test 28
29 Stepwise 2 ND Order Model Term Estimate Std Error t Ratio Prob> t Intercept x <.0001 (x )*(x ) <.0001 x <.0001 (x )*(x ) <.0001 x <.0001 (x )*(x ) <.0001 x x <.0001 (x )*(x ) x x (x )*(x ) (x )*(x ) (x )*(x ) (x )*(x ) (x )*(x ) Rsquared 90.0% Train 88.5% Test 29
30 Next Steps Higher order terms? When to stop? Transformations? Too simple: underfitting bias Too complex: inconsistent predictions, overfitting high variance Selecting models is Occam s razor Keep goals of interpretation vs. Prediction in mind 30
31 Tree Model x4< x1< x1< x4< x2< x5< x2< x3< x5< x2< x5< x3< x5< x5< x3< x2< x4< x4< x4< x4< x4< x2< x3< x4< x2< x4< x3< x3< x3< x3< x1< x3< x3< x3< x8< x4< x4< R squared 82.3% Train 67.2% Test
32 Feature Creation New predictor based on original predictors Often linear: z = α + b x i 1 1 b p x p Principal components Factor analysis Multidimensional scaling 32
33 Neural Nets Don t resemble the brain Are just a statistical model 33
34 A Single Neuron x1 x x3 x4 x Input (z1) s(z1) Output x0 z1 = x1 +.7x2 .2x3 +.4x4 .5x5 34
35 More exotic Neural networks x1 z1 x2 z2 y z3 Output layer Input layer Hidden layer 35
36 Running a Neural Net 36
37 Predictions for Example R squared 92.7% Train 90.6% Test 37
38 What Does This Get Us? Enormous flexibility Ability to fit anything Including noise Interpretation? 38
39 Case Study Warranty Data A new backpack inkjet printer is showing higher than expected warranty claims What are the important variables? What s going on? A neural networks shows that Zipcode is the most important predictor 39
40 Spatial Analysis Warranty Data showing problem with ink jet printer Use the model as a black box for variable selection 40
41 y MARS Multivariate Adaptive Regression Splines What do they do? Replace each step function in a tree model by a pair of linear functions y y x x x 41
42 MARS Variable Importance Rsquared 95.0% Train 94.3% Test (96.3%) (95.8%) 42
43 MARS Function Output 43
44 Collaborative Filtering Goal: predict what movies people will like Data: list of movies each person has watched Lyle Andre, Starwars Ellen Andre, Starwars, Coeur en Hiver Fred Starwars, Batman Dean Starwars, Batman, Rambo Jason Coeur en Hiver, Chocolat 44
45 Data Base Data can be represented as a sparse matrix Andre Starwars Batman Rambo Coeur Chocolat Lyle y y Ellen y y y Fred y y Dean y y y Jason y y y Karen y????? Karen likes Andre. What else might she like? CDNow doubled responses 45
46 How Do We Really Start? Life is not so kind Categorical variables Missing data 500 variables, not variables where to start? 46
47 Where to Start? EDM Use a tree to find a smaller subset of variables to investigate Explore this set graphically Start the modeling process over Build model Compare model on small subset with full predictive model 47
48 Start With a Simple Model Maybe a Tree: x4< x2< x1< x5< x1< x5< x2< x1< x5< x2< x4< x4< x5< x5<
49 Automatic Models KXEN 49
50 PVA Results from KXEN 50
51 Combining Models  Bagging Bagging (Bootstrap Aggregation) Bootstrap a data set repeatedly Take many versions of same model (e.g. tree) Form a committee of models Take majority rule of predictions 51
52 Combining Models  Boosting Take the data and apply a simple classifier Reweight the data, weighting the misclassified data much higher. Reapply the classifier Repeat over and over The final prediction is a combination of the output of each classifier, weighted by the overall misclassification rate. Details in Freund, Y. Boosting a weak learning algorithm by majority, Information and Computation 121(2),
53 Breast Cancer Diagnosis 53
54 Results from Random Forest Results from 1000 splits of Training and Test data False Positive Rate False Negative Rate Tree 32.20% 33.70% Boosted Trees 24.90% 32.50% Random Forest 19.30% 28.80% Neural Netw ork 25.50% 31.70% Radiologists 22.40% 35.80% 54
55 Case Study Ingot failures Ingot cracking ,000 lb. Ingots 20% cracking rate $30,000 per recast 90 potential explanatory variables Water composition (reduced) Metal composition Process variables Other environmental variables 55
56 Model building process Model building Train Test Evaluate 56
57 Most Important Variable Take One Here we started with trees Alloy We know that OK, take two Yttrium What do you think is in the alloy? Third time s the charm? Selenium! OH! 57
58 Case Study Car Insurance Now that we have mature policies, can we find other factors to price policies better? 65 potential predictors Industry, vehicle age, color, numbers of vehicles, usage and location etc 58
59 Fast Fail Not every modeling effort is a success A model search can save lots of queries Data took 8 months to get ready Analyst spent 2 months exploring it A new model search program (KXEN) running for several hours found no out of sample predictive ability Tree model gave similar results 59
60 PVA Recap Remember predictor variables Need a way to trim this down Need an exploratory model Neural network? Tree? 60
61 Students in Data Mining Class Student #1 $15,024 Student #2 $14,695 Student #3 $14,345 61
62 Take Home Messages What a great time to be a Statistician! Problems are exciting Research is exciting Success in Data mining Requires Team Work Requires Flexibility in modeling Means that you Act on Your results Depends much more on the way you mine the data rather than the specific model or tool that you use Which method to use? Yes!! Have fun! 62
63 Thank you! 63
Introduction to Data Mining and Knowledge Discovery
Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining
More informationUsing Data Mining to Predict Automobile Insurance Fraud
UNIVERSIDADE CATÓLICA PORTUGUESA CATÓLICA LISBON SCHOOL OF BUSINESS AND ECONOMICS Master of Science in Business Administration Using Data Mining to Predict Automobile Insurance Fraud JOÃO BERNARDO DO VALE
More informationBig Data: New Tricks for Econometrics
Big Data: New Tricks for Econometrics Hal R. Varian June 203 Revised: April 4, 204 Abstract Nowadays computers are in the middle of most economic transactions. These computermediated transactions generate
More informationGetting Started with SAS Enterprise Miner 7.1
Getting Started with SAS Enterprise Miner 7.1 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc 2011. Getting Started with SAS Enterprise Miner 7.1.
More informationA First Encounter with Machine Learning. Max Welling Donald Bren School of Information and Computer Science University of California Irvine
A First Encounter with Machine Learning Max Welling Donald Bren School of Information and Computer Science University of California Irvine November 4, 2011 2 Contents Preface Learning and Intuition iii
More information1 The Data Revolution and Economic Analysis
1 The Data Revolution and Economic Analysis Liran Einav, Stanford University and NBER Jonathan Levin, Stanford University and NBER Executive Summary Many believe that big data will transform business,
More informationRegression Modeling and MetaAnalysis for Decision Making: A CostBenefit Analysis of Incentives in Telephone Surveys
Regression Modeling and MetaAnalysis for Decision Making: A CostBenefit Analysis of Incentives in Telephone Surveys Andrew Gelman, Matt Stevens, and Valerie Chan Departments of Statistics and Political
More informationPRACTICAL DATA MINING IN A LARGE UTILITY COMPANY
QÜESTIIÓ, vol. 25, 3, p. 509520, 2001 PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY GEORGES HÉBRAIL We present in this paper the main applications of data mining techniques at Electricité de France,
More informationResults from the 2014 AP Statistics Exam. Jessica Utts, University of California, Irvine Chief Reader, AP Statistics jutts@uci.edu
Results from the 2014 AP Statistics Exam Jessica Utts, University of California, Irvine Chief Reader, AP Statistics jutts@uci.edu The six freeresponse questions Question #1: Extracurricular activities
More informationCrossSectional Study Design and Data Analysis
The Young Epidemiology Scholars Program (YES) is supported by The Robert Wood Johnson Foundation and administered by the College Board. CrossSectional Study Design and Data Analysis Chris Olsen Mathematics
More informationScalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park
More informationTesting a Hypothesis about Two Independent Means
1314 Testing a Hypothesis about Two Independent Means How can you test the null hypothesis that two population means are equal, based on the results observed in two independent samples? Why can t you use
More informationClinical Trials: What You Need to Know
Clinical Trials: What You Need to Know Clinical trials are studies in which people volunteer to test new drugs or devices. Doctors use clinical trials to learn whether a new treatment works and is safe
More informationMINING DATA STREAMS WITH CONCEPT DRIFT
Poznan University of Technology Faculty of Computing Science and Management Institute of Computing Science Master s thesis MINING DATA STREAMS WITH CONCEPT DRIFT Dariusz Brzeziński Supervisor Jerzy Stefanowski,
More informationPersonal Injury Case Management. Time Matters for the PI Attorney 6/14/2006 Robert Gray GrayLint Enterprises, Inc.
Time Matters for the PI Attorney 6/14/2006 Robert Gray GrayLint Enterprises, Inc. Personal Injury Case Management...1 Time Matters Things You Might Like...2 Time Matters Things You Might Not Like...3 Other
More informationUsing Credit to Your Advantage.
Using Credit to Your Advantage. Topic Overview. The Using Credit To Your Advantage topic will provide participants with all the basic information they need to understand credit what it is and how to make
More informationTeens version. Instructor guide. 2003, 2012 Wells Fargo Bank, N.A. All rights reserved. Member FDIC. ECG714394
Teens version Instructor guide 2003, 2012 Wells Fargo Bank, N.A. All rights reserved. Member FDIC. ECG714394 Hands on Banking Instructor s Guide Teens Version (Grades 6 8) Table of Contents Introduction...
More informationClimate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault
Climate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault Why are we releasing information about climate surveys? Sexual assault is a significant
More informationAN INTRODUCTION TO. Data Science. Jeffrey Stanton, Syracuse University
AN INTRODUCTION TO Data Science Jeffrey Stanton, Syracuse University INTRODUCTION TO DATA SCIENCE 2012, Jeffrey Stanton This book is distributed under the Creative Commons Attribution NonCommercialShareAlike
More informationMartian Chronicles: Is MARS better than Neural Networks? by Louise Francis, FCAS, MAAA
Martian Chronicles: Is MARS better than Neural Networks? by Louise Francis, FCAS, MAAA Abstract: A recently developed data mining technique, Multivariate Adaptive Regression Splines (MARS) has been hailed
More informationWORKBOOK G: CONDUCTING INPERSON INTERVIEWS
WORKBOOK G: CONDUCTING INPERSON INTERVIEWS TABLE OF CONTENTS OVERVIEW OF INPERSON INTERVIEWING... 3 STEPS INVOLVED IN CONDUCTING INPERSON INTERVIEWS... 4 Advantages and Disadvantages of InPerson Interviews...
More informationSECTION 1: DEVELOPMENT PROCESSES. 1.1 Performance Measurement Process
SECTION 1: DEVELOPMENT POCESSES 1.1 PEFOMANCE MEASUEMENT POCESS 1.1 Performance Measurement Process Introduction Performance measures are recognized as an important element of all Total Quality Management
More informationCompetitive Intelligence. 4imprint.com
Competitive Intelligence 4imprint.com Mission possible: Become a corporate sleuth with competitive intelligence When you hear the phrase competitive intelligence do you think of spies, covert activities
More informationBuying APM in the enterprise A survival guide from a veteran performance geek
Buying APM in the enterprise A survival guide from a veteran performance geek Buying APM in the enterprise A survival guide from a veteran performance geek Introduction...3 Part one: APM maturity as you
More informationIntroduction to the Research Process
The NIHR Research Design Service for Yorkshire & the Humber Introduction to the Research Process Authors Antony Arthur Beverley Hancock This Resource Pack is one of a series produced by The NIHR RDS for
More informationEVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION. Carl Edward Rasmussen
EVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION Carl Edward Rasmussen A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy, Graduate
More informationSix Sigma Black Belts: What Do They Need to Know?
This paper was presented during the Journal of Quality Technology Session at the 45th Annual Fall Technical Conference of the Chemical and Process Industries Division and Statistics Division of the American
More informationBusiness Intelligence Software Customers Understanding, Expectations and Needs
Business Intelligence Software 1 Running head: BUSINESS INTELLIGENCE SOFTWARE Business Intelligence Software Customers Understanding, Expectations and Needs Adis Sabanovic Thesis for the Master s degree
More informationA Perspective on Data Mining
at Northern Arizona University A Perspective on Data Mining July 1998 Authors: Dr. Kenneth Collier Dr. Bernard Carey Ms. Ellen Grusy Mr. Curt Marjaniemi Mr. Donald Sautter Table of Contents Executive Summary...
More informationClean Answers over Dirty Databases: A Probabilistic Approach
Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University
More information