Ensemble Modeling with R

Transcription

1 Doctoral Candidate/Merchandise Data Scientist MatthewALanham.com Virginia Tech Department of Business Information Technology Advance Auto Parts, Inc.

2 Outline Outline My Background and Research Pros and Cons of R for Data Science Modeling Using CRISP-DM Framework What is Ensemble Modeling? Fitting Models Bagging a decision tree Optimal Decision Cut Points for Binary Classification SEPTEMBER 15,

3 Background (2005) B.A. Economics/Mathematics, Indiana University-Bloomington ( ) Genscape, Inc., Louisville, KY Energy transparency start-up ( ) M.S. Biostatistics-Decision Science, University of Louisville ( ) M.S. Statistics, Virginia Tech ( Current) Ph.D. Business Information Technology, Virginia Tech ( Current) Advance Auto Parts, Inc. Fortune 500 Retailer (#402.. for now) Research Focus How can we build better predictive models that are empirically sound (stats) as input parameters to prescriptive models that are process representative (optimization) to provide the best (maintainable, timely, scalable, KPI fused) decision-support for a retailer s assortment plan? Why is assortment planning so important? Why is assortment planning problem so challenging? Where does Data Science & Big Data Analytics (BDA) come into play? Where does Information Technology (IT) come into play? Where does Business come into play?

4 Predictive and Prescriptive Analytics INTEGRATING PREDICTIVE AND PRESCRIPTIVE ANALYTICS Determine Optimal Solution (loop) Search Algorithm Prescriptive Model Optimality Conditions Decision Criteria Max Profit Decision Variables 1) Assortment 2) Prices 3) Promotion 4) Shelf Space* Decision Model Sales Model Performance Measures 1) Obj. Function - Revenue 2) Constraints - Budget(s) Estimation Model Data Predictive Model(s) Market Specs Demand Model Preference Structure Similarity Measures Utility Model Demand Forecast Parameter(s) Time Summary (TS) Ex: Sum, Avg. Time summary performance measures (TSPM) Scenario summary performance measures (SSPM) Scenario Summary (SS) Ex: Sum, Avg. Performance measures that are functions over a time horizon or random variables that must be summarized over their distributions SEPTEMBER 15,

5 Oracle + SPSS Modeler WHAT I M WORKING WITH CURRENTLY My opinion: IBM SPSS Modeler and SAS Enterprise Miner are: 1) Great for teaching 2) Great for stand-alone data mining projects 3) Visually appealing to Management 4) Not great for real-time production analytics 5) Not great for customized solutions 6) Not designed for prescriptive analytics An Example of an IBM SPSS Modeler stream building predictive models SEPTEMBER 17,

6 Data Mining, Data Science, and Predictive Modeling with R DATA SCIENCE WITH R R is an open-source and freely accessible software language under the GNU General Public License, version 2 agreement for statistical and mathematical computing (Ihaka & Gentleman, 1996). R is compatible with many operating systems such as with Windows, Macintosh, Unix, and Linux. According to Eric Sigel, Author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, R is The leading free, open-source software tool for PA (Predictive Analytics), has a rapidly expanding base of users as well as enthusiastic volunteer developers who add to and support its functionalities (Siegel, 2013). Today there are several thousand available user-developed packages (also referred to as libraries). Packages are collections of R functions, compiled code, and data put together in a specific format following CRAN s guidelines. You can search for packages by application area here ( As of July 2014, there are 33 different application areas. In the Machine Learning application area there are 72 different packages offering libraries that have functions to do nearly any methodology. There are many newer techniques available here that are not available in commercial software packages. Cons Memory, memory, memory! See memory_example.r SEPTEMBER 17,

7 Data Mining, Data Science, and Predictive Modeling with R WHAT IS ENSEMBLE MODELING? Ensemble methods train multiple predictive models and then combine the predictions to achieve a higher overall performance and stability. Pros Ensemble methods require little tuning Ensemble methods operate on a variety of input types (Categorical variables, Integers, and real numbers) Ensemble methods can be used on variety of problems (binary and multi-class classification and rankings, regression, etc.) SEPTEMBER 17,

8 CRISP-DM DATA MINING FRAMEWORK Cross-Industry Standard for Data Mining (CRISP-DM) is a general data mining process model that can be applied to solve any business problem. There are other popular data mining and analytics process models, such as Sample-Explore-Modify-Model- Assess (SEMMA), but in my opinion CRISP-DM is more structured and detailed. CRISP-DM was created and modified over time by leading practitioners and researchers in the data mining field and has been shown to lead to analytical results that align with business objectives. The CRISP-DM process model and techniques primarily fall under the predictive analytics domain in business analytics, where the objective is to help organizations predict future events and proactively act upon such insights in a systematic fashion to drive better business outcomes (Provost & Fawcett, 2013). However, this process could be extended to prescriptive (i.e. optimization) analytics endeavors as well. SEPTEMBER 15,

9 CRISP-DM CRISP-DM DETAILED VIEW CRISP-DM Model Phases and Tasks (Source: Modified from SEPTEMBER 15,

10 Business Understanding BUSINESS UNDERSTANDING Business Objectives Retail Assortment planning, at the most basic level asks which products to offer and how many (Mantrala et al., 2009). Assortment planning is one of the most important decisions faced by retailers (Sauré & Zeevi, 2013). Because of financial and physical capacity constraints, operationally a retailer does not have to ability to stock, let alone hold in store every possible product a consumer may desire (Sauré & Zeevi, 2013). You must get the project sponsor to detail the business success criteria. It s not some predictive model accuracy statistic. Examples: Increased Sales of X% at Stores Y and Z. Reduced non-working inventory of W% at Stores Y and Z. Assess Situation May use R and any of its available packages Deadline is September 17 th at Meetup Competition winner gets $100, losers learn something, Speaker gets feedback Data Mining Goals Determine best overall test accuracy on a 10% out-of-training set Neither the sensitivity nor specificity must fall below 0.70 on the out-oftraining set to qualify. Project Plan Layout your expected work schedule, breaks, etc. Will vary depending on your experience using R SEPTEMBER 15,

11 Data Understanding DATA UNDERSTANDING Collect Data Describe Data?? Variable Description store_number A unique store identifier sku_number A unique SKU identifier Y SOLD The SKU in a respective store sold (1=yes, 0=no) sold in the last 13 periods after it was replinished/maxied. NUM_SOLD The number of realized unit SKU sales for a respective store over the past 1-13 periods. X NUM_SOLD_LAST The number of realized unit SKU sales for a respective store over the past periods. X application_count The total number of different year-make-model vehicle options that the respective SKU could be used for. X projected_growth_pct The projected percentage growth for this SKU in the next 13 periods based on financial experts. X offset For each store-sku, the positive deviation based on unit sales from the center of the part-type-specific distribution For each store-sku the positive deviation based on unit sales from the center of the part-type-specific distribution adjusted based on an ad-hoc X adjusted_offset calculation X unit_sales_py The total number of units sold for this particular SKU over all stores for between the past 27 and 39 periods. X unit_sales_cy The total number of units sold for this particular SKU over all stores for the past 14 and 26 periods. X unit_sales_fy The total number of units sold for this particular SKU over all stores over the past 13 periods. X total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. X adjusted_total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. The percentage of vehicles in operations (VIO) for a respective store compared to the total number of VIO for all stores associated to a cluster over X vio_compared_to_cluster the past 14 to 26 periods. X avg_cluster_cy_unit_sales The average number of SKUs sold based on a clustering of all stores over the past 13 to 26 periods. X avg_cluster_cy_total_sales The average number of total sales which is a combination of unit and lost sales based on store clusters over the past 14 to 26 periods. X avg_cluster_cy_lost_sales The average number of lost sales, clustered by all stores over the past 14 to 26 periods. X pop_est_cy Estimated number of persons in the population where the store is located based on the latest period. X pop_density_cy Estimated density (a percentage) of the population where the store is located based on the latest period. X pct_white Estimated number of caucasion-identified persons where the store is located based on the latest period. X age Estimated median person-age where the store is located over based on the latest period. X pct_college Estimated percentage of college-education persons where the store is located based on the latest period. X pct_blue_collar Estimated percentage of blue-collar type workers where the store is located based on the latest period. X median_household_income Estimated median household income where the store is located based on the latest period. Estimated number of physical locations where business is conducted or where services or industrial operations are performed where the store is X establishments located based on the latest period. X road_quality_index A measure of the quality of the roads in the area the store is located. Usually you will create such a table yourself but make it more descriptive. The data scientist will ask the domain expert(s) questions such as: What are the variables units of measure? Where does the data come from? When is it updated? How and why was clustering performed a particular way? How and why was a variable adjusted? SEPTEMBER 15,

12 Data Understanding DATA UNDERSTANDING (CONT.) Explore Data Matt s source code: Find the main.r Data Quality DataQualityReport(skus) DataQualityReportOverall(dataSetName=skus) SEPTEMBER 17,

13 Data Preparation DATA PREPARATION Data Description?? Variable Description store_number A unique store identifier sku_number A unique SKU identifier Y SOLD The SKU in a respective store sold (1=yes, 0=no) sold in the last 13 periods after it was replinished/maxied. NUM_SOLD The number of realized unit SKU sales for a respective store over the past 1-13 periods. X NUM_SOLD_LAST The number of realized unit SKU sales for a respective store over the past periods. X application_count The total number of different year-make-model vehicle options that the respective SKU could be used for. X projected_growth_pct The projected percentage growth for this SKU in the next 13 periods based on financial experts. X offset For each store-sku, the positive deviation based on unit sales from the center of the part-type-specific distribution For each store-sku the positive deviation based on unit sales from the center of the part-type-specific distribution adjusted based on an ad-hoc X adjusted_offset calculation X unit_sales_py The total number of units sold for this particular SKU over all stores for between the past 27 and 39 periods. X unit_sales_cy The total number of units sold for this particular SKU over all stores for the past 14 and 26 periods. unit_sales_fy The total number of units sold for this particular SKU over all stores over the past 13 periods. X total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. X adjusted_total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. The percentage of vehicles in operations (VIO) for a respective store compared to the total number of VIO for all stores associated to a cluster over X vio_compared_to_cluster the past 14 to 26 periods. X avg_cluster_cy_unit_sales The average number of SKUs sold based on a clustering of all stores over the past 13 to 26 periods. X avg_cluster_cy_total_sales The average number of total sales which is a combination of unit and lost sales based on store clusters over the past 14 to 26 periods. X avg_cluster_cy_lost_sales The average number of lost sales, clustered by all stores over the past 14 to 26 periods. X pop_est_cy Estimated number of persons in the population where the store is located based on the latest period. X pop_density_cy Estimated density (a percentage) of the population where the store is located based on the latest period. X pct_white Estimated number of caucasion-identified persons where the store is located based on the latest period. X age Estimated median person-age where the store is located over based on the latest period. X pct_college Estimated percentage of college-education persons where the store is located based on the latest period. X pct_blue_collar Estimated percentage of blue-collar type workers where the store is located based on the latest period. X median_household_income Estimated median household income where the store is located based on the latest period. Estimated number of physical locations where business is conducted or where services or industrial operations are performed where the store is X establishments located based on the latest period. X road_quality_index A measure of the quality of the roads in the area the store is located. skus = skus[which(complete.cases(skus)),] DataQualityReportOverall(dataSetName=skus) SEPTEMBER 17,

14 Modeling Modeling Techniques C5.0 Decision tree Logistic Regression CART Decision tree MODELING SEPTEMBER 17,

15 Modeling MODELING (CONT.) Design When building and testing predictive models using observational data (i.e. data that is not controlled like in laboratory experimentation), the question that must be answered is how valid is my model in regards to what will happen next? In a properly designed and controlled experiment, data (samples) used in the experiment are used to make inferences about the population. Regardless of how large or small the sample is compared to the true size of the population, this single randomly selected subset of the population allows for generalizability of the remaining subset of data not used in the study. Cross-validation is the most practical and cost effective means of obtaining a proxy for truth in predictive analytics. ## Randomly partition data into training and test sets my_seed = skus = GenerateTTV(dataSetName=skus, response='sold', trainpct=.90, testpct=.10, my_seed) GeneratePartitionPcts(dataSetName=skus) Using the training data error rate as a proxy for a model s generalization error is not wise, especially when the training error is low to almost perfect. Most likely the model has been overfit (or over trained) and will not perform as well when new examples are feed through and evaluated from a validation data set (Zhou, 2012). SEPTEMBER 17,

16 Modeling MODELING (CONT.) Design ## Percentage of Target is 1 (or Y='SOLD') in total data set dim(skus[which(skus$sold==1),])[[1]] / dim(skus[which(skus$sold==0 skus$sold==1),])[[1]] ## Percentage of Target is 1 (or Y='SOLD') in training set dim(skus[which(skus$sold==1 & skus$spss_partition=='train'),])[[1]] / dim(skus[which((skus$sold==1 skus$sold==0) & skus$spss_partition=='train'),])[[1]] ## Percentage of Target is 1 (or Y='SOLD') in test set dim(skus[which(skus$sold==1 & skus$spss_partition=='test'),])[[1]] / dim(skus[which((skus$sold==1 skus$sold==0) & skus$spss_partition=='test'),])[[1]] ## remove independent variables that you don't want to use names(skus) skus2 = skus[,c(3,5:19,21:28)] head(skus2) skus2$sold = as.factor(skus2$sold) SEPTEMBER 17,

17 Modeling MODELING (CONT.) ## set up data for algorithms trainx = skus2[which(skus2$spss_partition=='train'),2:(length(skus2)-1)] trainy = skus2[which(skus2$spss_partition=='train'),'sold'] train = cbind(trainy,trainx) testx = skus2[which(skus2$spss_partition=='test'),2:(length(skus2)-1)] testy = skus2[which(skus2$spss_partition=='test'),'sold'] test = cbind(testy,testx) SEPTEMBER 17,

18 Modeling MODELING (CONT.) Build Models C5.0 Decision tree require(c50) #Fit classification tree models or rule-based models using Quinlan's C5.0 algorithm C5Params = C5.0Control( ) C5 = C5.0(x=trainX, y=trainy,,control=c5params #control parameters defined above,trails=1 ) summary(c5) #number of boosting iterations; 1 implies a single model is used Changing the trials from 1 to 1000 doesn t change the result in this case Overall error rate The confusion matrix is based on a decision cutoff threshold of 0.50 Variables that were used to create the tree SEPTEMBER 17,

19 Modeling MODELING (CONT.) Build Models C5.0 Decision tree ## training probabilities and predicted classes C5trainp = predict(c5,newdata = trainx,trials = C5$trials["Actual"],type = "prob", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass)[,2] C5trainc = predict(c5,newdata = trainx,trials = C5$trials["Actual"],type = "class", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass) ## testing probabilities and predicted classes C5testp = predict(c5,newdata = testx,trials = C5$trials["Actual"],type = "prob", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass)[,2] C5testc = predict(c5,newdata = testx,trials = C5$trials["Actual"],type = "class", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass) SEPTEMBER 17,

20 Modeling MODELING (CONT.) Build Models Logistic Regression logit.fit = glm(trainy ~ NUM_SOLD_LAST+TOTAL_VIO+ADJ_TOTAL_VIO+VIO_COMPARED_TO_CLUSTER+POP_EST_CY+POP_DEN SITY_CY+PCT_WHITE+AGE+PCT_COLLEGE+PCT_BLUE_COLLAR+MEDIAN_HOUSEHOLD_INCOME+ESTABLI SHMENTS+ROAD_QUALITY_INDEX+APPLICATION_COUNT+PROJECTED_GROWTH_PCT+UNIT_SALES_CY +UNIT_SALES_PY+OFFSET+ADJUSTED_OFFSET+AVG_CLUSTER_CY_UNIT_SALES+AVG_CLUSTER_CY_TOTA L_SALES #+AVG_CLUSTER_CY_LOST_SALES,family = binomial,data = train) summary(logit.fit) SEPTEMBER 17,

21 Modeling MODELING (CONT.) Build Models CART require(rpart) set.seed( ) tree = rpart(trainy ~.,data = train,method = "class",cp = 0,minsplit = 4,minbucket = 2,parms = list(prior=c(0.5, 0.5))) #summary(tree) <-this will take awhile ## find the best pruned tree i.min = which.min(tree$cptable[,"xerror"]) i.se = which.min(abs(tree$cptable[,"xerror"] - (tree$cptable[i.min,"xerror"] + tree$cptable[i.min,"xstd"]))) alpha.best = tree$cptable[i.se, "CP"] tree.p = prune(tree, cp=alpha.best) ## obtain predictions treetrainp = predict(tree.p, train)[,2] treetrainc = treetrainp treetrainc[which(treetrainc>.5)] = 1 treetrainc[which(treetrainc<=.5)] = 0 treetestp = predict(tree.p, test)[,2] treetestc = treetestp treetestc[which(treetestc>.5)] = 1 treetestc[which(treetestc<=.5)] = 0 SEPTEMBER 17,

22 Modeling ENSEMBLING VIA BAGGING Bootstrap aggregation (Bagging) Bagging is a simple way to increase the predictive power of a model Pros Useful when the predictors are more unstable, meaning that the more variation observed Cons Using smaller samples will yield more instability and too small yields poor models How Take several random samples with replacement from the training data set Use each sample to construct a separate predictive model with predictions as the testing data set. Average the predictions to come up with one final predicted value my_list = getbs_samples(seed=my_seed, Ntrees=100, SampleSize=1000) bagged_tree = BS_Trees(response=train[,1], datasetname=train, samplelist=my_list, Ntrees=100) bagged_probs = getfinalpredictions(bagged_tree, datasetname=train, Ntrees=100) What R packages are available for bagging? SEPTEMBER 17,

23 Modeling C5.0 MODEL ASSESSMENT - TRAINING Bagged CART Logit CART Why does the bagged tree perform worse? SEPTEMBER 17,

24 Modeling C5.0 MODEL ASSESSMENT - TESTING Bagged CART How do we store a Bagged Model in R? Logit CART SEPTEMBER 17,

25 Modeling OPTIMAL DECISION CUTPOINTS Why use a decision cutoff threshold of 0.50? Example of an ROC curve SEPTEMBER 17,

26 Modeling OPTIMAL DECISION CUTPOINTS See require(optimalcutpoints) ## Define my methods list methodlist = list( "Youden" #1 (Youden Index);,"ROC01" #2 (minimizes distance between ROC plot and point (0,1));,"PROC01" #3 (minimizes distance between PROC plot and point (0,1)); #,"MaxAccuracyArea" #4 (maximizes Accuracy Area); #,"AUC" #5 (maximizes concordance which is a function of AUC);,"MaxEfficiency" #6 (maximizes Efficiency or Accuracy);,"MaxKappa" #7 (maximizes Kappa Index); #,"MinErrorRate" #8 (minimizes Error Rate); C5.0 Training using the Youden cutoff method Using other cutoff methods.. SEPTEMBER 17,