Applied Data Mining Analysis: Guide Via Examples Dan Steinberg, Milkail Golovnya, N Scott Cardell July 2013 Salford Systems http://www.salford-systems.com
Modern Analytics Interest and research in what we now think of data mining and machine learning goes back to at least 1960s Perceptron pre-neural Network introduced in 1957 IEEE Transactions on Pattern Analysis and Machine Intelligence, January 1979 (vol. 1 no. 1) ACM KDD Knowledge Discovery in Databases series of conferences began informally in 1989, formally in 1995 Field is now in a stage of extraordinary growth and brings together concepts and techniques from statistics and computer science Recent extension of topics to Big Data fueled by Google s MapReduce, Yahoo! Development of Hadoop, Amazon EC Easy, massively parallel data processing
KDD Conference 1995
IEEE Pattern Analysis 1979 Vol 1, No. 1 Copyright Salford Systems 2013
Data Mining Key concepts differentiating data mining from traditional statistics Very few assumptions about the data or about the models to be built Emphasis on learning as much as possible from the data Emphasis on a fair degree of automation and search Allowing for a far larger space of possible models Typically computer intensive methods (a simple MARS spline regression might fit the equivalent of 70,000 models) Some early definitions of data mining emphasized the volume of data being analyzed (data mining=lots of data) but can use these techniques with very few data records Essentially data mining (or machine learning) is defined by the tools we use to analyze the data
Challenges for Data Miners Conceptually same as for statistician Understand problem Acquire appropriate data Define unit of observation and what is being predicted or explained Select appropriate methodology (classification, regression, survival, clustering) and tools Construct useful predictors if not present in data (feature extraction) Some differences in next steps as Data Mining uses much more flexible and adaptive learning algorithms Predictor selection Choice of learning algorithm Avoid overfitting (model too flexible, memorizes train data) Avoid underfitting (model not flexible enough, too few features)
Bias/Variance TradeOff Underfitting Vs Overfitting Rigid models (eg linear regression) when inappropriate have high bias and relatively low variance New training samples tend to yield similar results Overly flexible models can reach the extreme of memorizing the learn data Low bias but high variance Data miner needs to be alert to signals of over- or underfitting and strike the right balance
Decision Trees A major advance in analytical technology in which many important concepts of modern analytics were first clearly spelled out Learning machine is actually quite simple conceptually (but the details in making a successful implementation are challenging) Abandons the worlds of hypothesis testing and estimation of parameters (as conventionally understood) Several early versions which did not function well (ID3, AID) were followed by CART which perfected the methodology Paper by Jerome H. Friedman in 1975
Tools Used In This Introduction Decision Tree (single CART tree) MARS Adaptive Regression Splines Gradient Boosting (TreeNet Boosted Trees) RandomForests (Ensembles of CART trees) Regularized Regression (GPS Generalized PathSeeker) These tools can get you very far and cover classification, regression, and unsupervised learning (clustering) Neural Networks have cycled in and out of favor (can require considerable experience to learn to use well)
CAR_CLAIM DataSet Insurance related data focused on FRAUD 15,420 records 923 records labeled FRAUD Should keep in mind that not all FRAUD is caught so there may be a few FRAUD cases lurking among the so-called good records Have a classification problem with a minority of the data in the class of interest Data published on the CD ROM included with Dorian Pyle s (1999) Data Preparation for Data Mining http://hfs1.duytan.edu.vn/upload/ebooks/3836.pdf Data is on separate CD ROM Some used copies appear to be available on Amazon Book is focused principally on the specifics of data preparation for Neural Networks which have their own unique requirements
Salford Predictive Modeler Predictive Modeling package used for the examples Core Statistics Linear Regression Logistic Regression CART Decision Tree (original, by Jerome Friedman) MARS Spline Regression (original, by Jerome Friedman) GPS regularized regression (extended elastic net Jerome Friedman) TreeNet gradient boosting machine ((original, by Jerome Friedman) RandomForests (original, Breiman and Cutler) Automation and model acceleration
Open Raw Data: CarClaim.CSV Basic peek at data set to obtain main table dimensions (rows and columns) Copyright Salford Systems 2013
Too many Character Variables We see that variables intended to capture numeric measures have been coded as and imported as text
Request Descriptive Statistics Icon on toolbar for statistics. Here we request just numeric variables
Basic Stats We always check for prevalence of missing data Always review number of distinct values (too few?, too many?
Detailed Stats and Tables Copyright Salford Systems 2013
Data Prep/Data Repair Not different than what any statistician would do in the earliest stages of data cleaning Remove inconsistent coding that varies across records Enforce consistent spelling of character values Checking for missing value codes Some coding uses NULL for a valid value (e.g. NO or 0) In our case many of the Character variables are intended to encode numeric information AGE of VEHICLE coded as text new 2 years 3 years 4 years 5 years Copyright Salford Systems 2013
Straighforward Recoding Here we use built-in BASIC language and command builder/assistant
Prepped Data Set has 17 Numeric Variables Previously only had 8 numeric variables Also requested generation of a SAMPLE$ variable (random partition of TEST data) Copyright Salford Systems 2013
CORR for NUMERIC Vars Copyright Salford Systems 2013
MDS Scaling of CORR Matrix Positions Variables Quick check for anything bizarre NDAYS_POL_ACCIDENT is at upper right far from other variables AGE and CAR_AGE are on the left
Build CART Model Select TARGET (Dependent variable) Avoid clearly inappropriate predictors (RECORDID), clones of TARGET Copyright Salford Systems 2013
Test (or Validation) Method Normally reserve some data for testing (typically called validation data)
CART Model Learn and Test results respectable and also very close to each other Smallest node has 45 records, reasonable size (we can control this if we want) FOCUS class is YES for FRAUD (Blue= Not Fraud, Red= Fraud) Copyright Salford Systems 2013
Quick Overview of Main Tree Logic: Variables Driving Model Copyright Salford Systems 2013
Root Node Split Very Effective Very low FRAUD rate among insureds with just Liability coverage Need to grow a Probability Tree to make progress on left Copyright Salford Systems 2013
Detailed Inspection of an Interesting Split Test data confirms relatively higher FRAUD risk for Chevy, Toyota, Accura But still low risk Copyright Salford Systems 2013
Train vs Test Lift By Node: 9 Nodes versus 4 Nodes Simpler tree generalizes better at the node level If we were to deploy the larger tree we might wish to remap node 4 to its parent (a form of shrinking)
Tweaking the Tree BATTERY: Experimental Parameter Variation 35 pre-packaged experiments we might consider running to tweak model
BATTERY ROOT: Limited Look Ahead Dictate which variable splits the ROOT node YEAR in the root appears to give better performance but not ideal for prediction Copyright Salford Systems 2013
BATTERY ONEOFF Build Trees on One Predictor Only Best predictors in isolation do make sense Allows for nonlinear relationship for continuous variables Copyright Salford Systems 2013
BATTERY BOOTSTRAP: Bootstrap Resample Traini Data/Test Data Fixed Assess performance of CART Tree Via Bootstrap Resampling (100 times)
Variable Importance Averaging Over Trees Unambiguous ranking of importance of predictors
BATTERY PARTITION: Repartition Data Into Learn/Test (Sample Sizes Fixed) Performance evaluation over 100 splits of the data into learn and test (all same size)
Logistic Regression: All Variables in Model 83 coefficients estimated due to dummy variable expansion of categoricals. Test ROC is.775
Regularized Logistic Regression via GPS: 100 replications on different test partitions Median TEST ROC=.7850 5 th pctile=.774 95 th pctile=.794 median coefs=19 Optimal models were LASSO or near-lasso Slightly better performance than conventional logistic but much smaller model Copyright Salford Systems 2013
Generalized PathSeeker Generalized Elastic Net Even a 3 coefficient model can each a test partition ROC of.764
LASSO Variable Importance Ranking Copyright Salford Systems 2013
YES/NO Odds Graph Nice monotonic pattern
RandomForests: BATTERY NPREDS Here BAGGER works just as well or better than true RF. ROC=.804 Copyright Salford Systems 2013
RandomForests Performance Curve ROC test partition=.804
YES/NO Odds Graph
RandomForests Variable Importance Measure Forest is a collection of trees (for post-processing reasons we recommend at least 500 trees) Normal scoring: drop a record down each tree and compute number of votes for each possible target outcome Instead of normal scoring do the following for each tree and each variable in the model: Randomly scramble the values of a specified variable in place Summary statistics for that variable are unchanged but the values of that variable are now located on the wrong row Scrambling is repeated anew for each tree Compute number of votes for each target outcome for each record Compute deterioration of overall sample performance Most important variable would be hurt most by this scrambling
RandomForests Variance Importance Ranking Based on variable scrambling to measure loss of accuracy Scrambling an important variable should hurt accuracy more
Summary Data Preparation was essential to make data fully usable Series of models developed rapidly using automated search tools Judgment assisted perfection of model Variety of well performing models
KDDCup 1998 http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html Page contains data, documentation, description of the challenge (all freely downloadable) KDD conferences began in 1995 and hosted first data mining competition in 1997 KDDCup 1998 used exactly the same data as for 1997 but provided fuller documentation Top performers in 1998 had already analyzed data in 1997 Results of the 1998 competition posted at http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98-results.html
KDDCup 1998 Data Set and Challenge Raw data contains 481 variables Learn sample N: 95,412 Validation N: 96,367 TARGET_B=0 90,569 94.92% 91,494 94.945 TARGET_B=1 4,843 5.08% 4873 5.06% Mean IQR SD Mean IQR SD TARGET_D 15.62 10.00 12.45 15.61 10.00 15.51 Objective: Create net revenue maximizing mailing list if each mail piece costs $0.68 to send Optimal decision rule: mail if expected value of donation>0.68 defined as Prob(Respond)*E(gift Respond)
Modeling Strategies Require two models RESPONSE to mailing BINARY YES/NO GIFTAMOUNT if responded REGRESSION conditional on response Naïve models ignore sample selection process Model each part separately and combine for final scoring Two stage model: Weight records in regression by inverse probability of sample inclusion Upweight records representative of those excluded Model each historical campaign Every list member responded with a gift at least once Every mailing represents a new opportunity to respond Have gift amount at least once for everybody Start with naïve two separate models for simplicity
KDDCup98 Possible Outomes Perfect targeting would mail only to the 4,873 respondents Total gifts $76,090 Mailing Cost $ 3,314 Net Revenue $72,776 Mail all net revenue $10,560 Winner in 1998 $14,712 mailing 56,330 people 58.5% of list Using routine methods in SPM (train just on learn sample) net revenue of $15,596 mailing 58,525 people Should beat winner using SPM out of the box (train on learn)
Check Randomness of Data Partition Pool data into one file containing all data Create LABEL variable, 1 for learn sample and 0 otherwise Check if LABEL is predictable using any kind of model We just use all variables available Could of course run t-tests or single variable models looking for any problematic difference We found nothing of concern here But have seen major problems in other partitioned data set Something that was supposed to have been randomly determined was clearly not
Nature of Data-I Elementary Demographics AGE GENDER Number of children by age group (0-3, 4-7, 8-12, 13-18) STATE of Residence, ZIPCODE Homeowner/renter Household Income Salutation (Dr., Mr., Mrs., Admiral, etc) Census Tract Level Data 286 socioeconomic and demographic indicators covering ethnicity, occupation, industry of employment, type of housing
Nature of Data-II Behavioral Data on prior gifts and response patterns RFA style of data, details and coded into groups Recency of response -how recently from a given date Frequency of response -how frequently in previous 12-24 months Amount of gift when responded -dollar value of gift Campaign characteristics Offer type Calendars Stickers Christmas Cards Other types of cards, such as birthday, condolence, blank Notepad Thank You printed outside Date of mailing
Activity Window: Useful Launchpad to Next Actions RAW DATA: Select next action such as histograms, View Data, Summary Stats
Stats: Brief View Essential to review data for unexpected coding, quirks, special handling needed
Sort By %Missing Descending Might want to drop variables that are genuinely missing at some ultra-high rate But for variables with only 1 good level the missing usually means something -- Often presence/absence or for continuous measures might really signify a 0 Also useful to sort ascending to note vars which have no or vey few missings Copyright Salford Systems 2013
Diagnostic CART Run DEPTH=1 LIMIT command ATOM=50 MINCHILD=30 DEPTH=1 Missing Values Controls Dummies for continuous Extra Level for categorical Want to examine power in ROOT node (nonlinear) Should run very fast Using controls suggested results in very useful diagnostics Informative missingness, nonlinear relationship Copyright Salford Systems 2013
Partition Test=.2 Some exploratory work needs to cover all the learn data Here we allocate a 20% random subset for testing
Ranking ROOT Node Splitters Top splitters are mostly missing. Need to consider why and maybe drop variables RDATE is date of response and missing if donor did not respond to that mailing
Model Setup PENALTY Tab Second stage model Penalize for missingness; penalize for high cardinality categorical
Key Points to Remember Working with RAW DATA Specify automatic creation of missing value dummy vars Dummy now repairs all binary variables coded BLANK=0 For all other variables tests for informative missingness Use a modeling tool that can handle missings CART, MARS, TreeNet have built-in missing handling CART methods most sophisticated and most flexible as can score any future pattern of missings MARS and TreeNet only handle missing patterns seen in learn data RandomForests uses imputation for missing value handling GPS is a regression/logistic regression tool and will listwise delete
ROOT Node Splitter Rankings (penalties active) Missing is now a predictive value. RFA_2$ only has 14 levels and so penalty is inactive (NLEARN>2 K-1 where K=14) PEPSTRFL$ blank is predictive Best splitters have 0 missing and all look reasonable RDATE variables now appear only in the form: Missing or Not Missing (gave or not) Copyright Salford Systems 2013
CART Run on All Variables Test Partition ROC=.5822 Kitchen sink model with PENALTIES on missing and HLCs Performance for a single tree is very good and hard to beat
BATTERY BOOTSTRAP and PARTITION 30 resamples, test partition fixed (shown below) 30 reruns test partition varying Can we have confidence in our main CART results ROC=.5822? BATTERY BOOTSTRAP: Median ROC=.5761 3 rd rank=.5644 27 th rank=.5824 BATTERY PARTITION : Median ROC=.5911 3 rd rank=.5767 27 th rank=.6009 Copyright Salford Systems 2013
Data Prep-1 Here we report on the data prep we did to facilitate subsequent analysis Conversion of dates to days since January 1, 1960 to facilitate date arithmetic (aka SAS dates) Create REGION$ variable from STATE$ (South, West, Northeast, Midwest, Other) Conversion of ZIP to a number and extraction of ZIP1 (first digit) and ZIP3 (first 3 digits) Dummy (0/1) recoding of variables that use 1 vs BLANK coding Break various RFA codes into 3 separate variables (R version, F version, A version)
Exploding RFA L3F becomes three separate variables Copyright Salford Systems 2013
Data Prep -2 Extract separate Socioeconomic Status (SES) and Urbanicity from combined variable DOMAIN Calculate average, min, max solicited donations Check for unsolicited donations, count, stats of gift amounts Create trend variables in donation amount, frequency Create one principle component per group of Census vars Create TREND vars for Recency, Frequency, and Amount variables Create overall summaries of RFA dimensions limited to last 24 months
Census Variables Examples Neighborhood level variables POP901 Number of Persons ETH1 Percent White AGE901 Median Age of Population CHIL1 Percent Children Under Age 7 HHN1 Percent 1 Person Households MARR1 Percent Married DW1 Percent Single Unit Structure HV1 Median Home Value in hundreds HVP1 Percent Home Value >= $200,000 RP1 Percent Renters Paying >= $500 per Month IC1 Median Household Income in hundreds TPE1 Percent Driving to Work Alone Car/Truck/Van LFC1 Percent Adults in Labor Force OCC1 Percent Professional EIC1 Percent Employed in Agriculture EC1 Median Years of School Completed by Adults 25+ VC1 Percent Vietnam Veterans Age 16+ ANC1 Percent Dutch Ancestry LSC1 Percent English Only Speaking
Principle Components Created Census variables only Overall Group Specific for 17 subgroups of census variables Also, judgmental slection of a few solo variables PRIN1 PRIN5 over all census variables PRINxx1 group specific first principle component only
CART on Census Principle Component Vars Ranks: Ethnicity, Home Costs, Education, Income, Omnibus. ROC=.5288
TreeNet Variable Importance Rankings: ROC=.5520 Ranking: Income, Housing, Transportation Ancestry, Labor Force, Household, Ethnicity, Interests Compared to using all raw variables these are quite effective
Data Prep-3 Create a new data set which contains one record per mailing per donor Donor who was mailed 5 times would have 5 rows of data By definition each donor responded at least once to a campaign Can build donation models using donation data from every person in data set In original data format TARGET_D is available only for the subset of responders to the most recent campaign Alternative approach did not improve test sample performance and so was abandoned But modeling experiments on this data helped refine predictor list
TreeNet Prepped Data: ROC=.6256 Some selection of variables, dropping of redundant variables Very difficult to beat this model (3 rd decimal place ROC only) But can make model performance more stable by trimming predictor list
Prepped TreeNet Variable Importances Large predictor list
Wrapper Vs Filter Feature Selection Filters treat each predictor in isolation and are thus vulnerable to missing variables important only via interactions Wrapper method uses a model to select variables, typically repeatedly and recursively Build model with many variables and rank all by importance Remove some variables from bottom of list and repeat We call this variable shaving and often run process removing one variable at a time until only one left Recursive Feature Elimination Judgment to select smallest defensible model (tradeoff accuracy for substantial simplification)
BATTERY SHAVING: Wrapper Method for Variable Selection Recursive Feature Elimination phrase also seen in literature For rapid scan of data we removed 5 predictors at each step Shaving from the BOTTOM of the ranked list of predictors Shaving from the TOP can also be helpful and enlightening Sometimes removing most important predictors improves generalization error
Tabular Display of Backward Shaving Five variables eliminated in each step Eliminating more than one variable per back step is a rough and ready to make rapid progress Could drop many more when working with tens of thousands of predictors Final refinements better done dropping ONE variable per back step
SHAVE ERROR and LOVO Instead of dropping least important predictor could TEST which variable to drop by running a LOVO experiment Leave One Variable Out (LOVO) With 20 predictors Shave via Importance Ranking 20 models required Shave via LOVO requires (20*21)/2 210 models required SHAVE ERROR number of models is quadratic in K SHAVE ERROR is repeated LOVO We shave from bottom generally to reach a reduced set of predictors and use SHAVE ERROR for the final refinement
Best Performing Model TARGET_B: ROC=.6293 More stable than previous model using many more predictors Can cut back to 12 predictors and still maintain ROC=.6228
12 variable RESPONSE Model Copyright Salford Systems 2013
TreeNet Partial Dependency Plots Copyright Salford Systems 2013
Dependency Plots-2
Dependency Plots-3
KDD98 Objective: Maximum Net Revenue Objective is not simply to maximize response rate Want to select mailing list based on expected donation Cost of mailing is $0.68 so maximizing net revenue means mailing to those whose expected donation>$0.68 Have moderately good response model after much experimentation Now need donation model which will be a regression Simple approach is to construct Prob(response)* E(Gift Predictors, Response=1) Might want to factor in sample selection bias as regression on TARGET_D is fit to the 5.08% of the prospect list who actually responded
KDDCup 1998 Data Set and Challenge Raw data contains 481 variables Learn sample N: 95,412 Validation N: 96,367 TARGET_B=0 90,569 94.92% 91,494 94.945 TARGET_B=1 4,843 5.08% 4873 5.06% Mean IQR SD Mean IQR SD TARGET_D 15.62 10.00 12.45 15.61 10.00 15.51 Objective: Create net revenue maximizing mailing list if each mail piece costs $0.68 to send Optimal decision rule: mail if expected value of donation>0.68 defined as Prob(Respond)*E(gift Respond)
TARGET_D Amount of Gift to Campaign 97 Learn Data N=4843 100% Max 200 99% 50 97.5% 46 95% 32 90% 25 75% Q3 20 50% Median 13 25% Q1 10 10% 5 5% 5 2.5% 4 1% 3 0% Min 1 80% of all gifts between $5 and $25 Smallest gift $1.00 so if we know for sure someone will give we should mail 2 nd model required to perfect targeted mailing list Sample Size much smaller so simpler model probably required
Kitchen Sink Model Prepped Data Exclude only Raw variables with processed versions TreeNet Test Sample MSE=77.229 CART MSE=94.292 No point in pursuing a CART model here
Outlier Analysis: Lift (Percent of PredictionError/Percent of Data) N Test sample=964 Just 10 records account for 58.33% of SSE (sum of prediction error)
Largest Positive Residuals Larger Than Negative Copyright Salford Systems 2013
Shaving Five Variables Every Step: For quick search for smaller better model Test sample MSE=76.598 with 50 predictors Copyright Salford Systems 2013
Imposing Additivity Constraints: TARGET_D Backwards stepwise imposition of additivity. Slight improvement obtained when two variables are constrained. Fully additive model is only slightly worse
Sequence of Models Built Raw Data Largest plausible KEEP list on Prepared Data Shaving (Backwards Feature Elimination) to Moderate number of variables Judgmentally selected smallest plausible model Less likely to be overfit and likely to have smallest prediction variance Raw data models were inferior to refined models Looked at Validation results only after Moderate models constructed
Results of Several Models Results of models following recommended procedures BATTERY RELATED intended to correct for sampling bias and uses 114 variables in KEEP list Combination BATTERY RELATED Inverse BATTERY RELATED no weight RA Best Smallest Keeplists tgtb18tlud17th tgtb18tlud17tlad tgtb18tlud156th # sent (validate) profit (validate) 58525 15596.95 57682 15622.19 58145 15848.17 54739 14781.25 56613 15318.11 53983 15037.01 53823 15095.81 Observe that every model reported does better than original winners ($14,712) Any user modeling with TreeNet and everyday data prep and model selection should beat winners in 2-3 rounds of model refinement