Ensemble Modeling with R



Similar documents
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION

How To Make A Credit Risk Model For A Bank Account

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Azure Machine Learning, SQL Data Mining and R

Data Mining. Nonlinear Classification

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Predictive Data modeling for health care: Comparative performance study of different prediction models

THE THREE "Rs" OF PREDICTIVE ANALYTICS

Better credit models benefit us all

ANALYTICS CENTER LEARNING PROGRAM

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Gerry Hobbs, Department of Statistics, West Virginia University

Data Mining - Evaluation of Classifiers

Overview, Goals, & Introductions

Knowledge Discovery and Data Mining

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Methods: Applications for Institutional Research

An Overview of Predictive Analytics for Practitioners. Dean Abbott, Abbott Analytics

not possible or was possible at a high cost for collecting the data.

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

IBM SPSS Direct Marketing 22

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Numerical Algorithms Group

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Knowledge Discovery and Data Mining

IBM SPSS Direct Marketing 23

Using multiple models: Bagging, Boosting, Ensembles, Forests

Chapter 6. The stacking ensemble approach

Classification and Regression by randomforest

The Scientific Data Mining Process

Hexaware E-book on Predictive Analytics

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Why do statisticians "hate" us?

RevoScaleR Speed and Scalability

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Advanced analytics at your hands

Didacticiel Études de cas

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG

Distributed forests for MapReduce-based machine learning

How To Perform An Ensemble Analysis

Data Mining Applications in Higher Education

IBM SPSS Modeler Professional

A Property & Casualty Insurance Predictive Modeling Process in SAS

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Performance Metrics for Graph Mining Tasks

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

Indian School of Business Forecasting Sales for Dairy Products

Chapter 12 Bagging and Random Forests

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

Leveraging Ensemble Models in SAS Enterprise Miner

Enhancing Compliance with Predictive Analytics

Lecture 10: Regression Trees

The Predictive Data Mining Revolution in Scorecards:

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

The Prophecy-Prototype of Prediction modeling tool

Decision Trees from large Databases: SLIQ

Make Better Decisions Through Predictive Intelligence

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

Dynamic Predictive Modeling in Claims Management - Is it a Game Changer?

Advanced Big Data Analytics with R and Hadoop

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Using Data Mining to Detect Insurance Fraud

Software for Supply Chain Design and Analysis

Using Data Mining to Detect Insurance Fraud

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Prediction of Stock Performance Using Analytical Techniques

Statement of Work. Shin Woong Sung

Knowledge Discovery and Data Mining

Data Science with R. Introducing Data Mining with Rattle and R.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Strengthening Diverse Retail Business Processes with Forecasting: Practical Application of Forecasting Across the Retail Enterprise

Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal Bank of Scotland, Bridgeport, CT

Predictive Modeling Techniques in Insurance

Marketing Strategies for Retail Customers Based on Predictive Behavior Models

Data Mining Part 5. Prediction

Prescriptive Analytics. A business guide

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Data Mining Introduction

Transcription:

Doctoral Candidate/Merchandise Data Scientist MatthewALanham.com Virginia Tech Department of Business Information Technology Advance Auto Parts, Inc.

Outline Outline My Background and Research Pros and Cons of R for Data Science Modeling Using CRISP-DM Framework What is Ensemble Modeling? Fitting Models Bagging a decision tree Optimal Decision Cut Points for Binary Classification SEPTEMBER 15, 2014 2

Background (2005) B.A. Economics/Mathematics, Indiana University-Bloomington (2005-2010) Genscape, Inc., Louisville, KY Energy transparency start-up (2008-2010) M.S. Biostatistics-Decision Science, University of Louisville (2010-2012) M.S. Statistics, Virginia Tech (2011 - Current) Ph.D. Business Information Technology, Virginia Tech (2014 - Current) Advance Auto Parts, Inc. Fortune 500 Retailer (#402.. for now) Research Focus How can we build better predictive models that are empirically sound (stats) as input parameters to prescriptive models that are process representative (optimization) to provide the best (maintainable, timely, scalable, KPI fused) decision-support for a retailer s assortment plan? Why is assortment planning so important? Why is assortment planning problem so challenging? Where does Data Science & Big Data Analytics (BDA) come into play? Where does Information Technology (IT) come into play? Where does Business come into play?

Predictive and Prescriptive Analytics INTEGRATING PREDICTIVE AND PRESCRIPTIVE ANALYTICS Determine Optimal Solution (loop) Search Algorithm Prescriptive Model Optimality Conditions Decision Criteria Max Profit Decision Variables 1) Assortment 2) Prices 3) Promotion 4) Shelf Space* Decision Model Sales Model Performance Measures 1) Obj. Function - Revenue 2) Constraints - Budget(s) Estimation Model Data Predictive Model(s) Market Specs Demand Model Preference Structure Similarity Measures Utility Model Demand Forecast Parameter(s) Time Summary (TS) Ex: Sum, Avg. Time summary performance measures (TSPM) Scenario summary performance measures (SSPM) Scenario Summary (SS) Ex: Sum, Avg. Performance measures that are functions over a time horizon or random variables that must be summarized over their distributions SEPTEMBER 15, 2014 4

Oracle + SPSS Modeler WHAT I M WORKING WITH CURRENTLY My opinion: IBM SPSS Modeler and SAS Enterprise Miner are: 1) Great for teaching 2) Great for stand-alone data mining projects 3) Visually appealing to Management 4) Not great for real-time production analytics 5) Not great for customized solutions 6) Not designed for prescriptive analytics An Example of an IBM SPSS Modeler stream building predictive models SEPTEMBER 17, 2014 5

Data Mining, Data Science, and Predictive Modeling with R DATA SCIENCE WITH R R is an open-source and freely accessible software language under the GNU General Public License, version 2 agreement for statistical and mathematical computing (Ihaka & Gentleman, 1996). R is compatible with many operating systems such as with Windows, Macintosh, Unix, and Linux. According to Eric Sigel, Author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, R is The leading free, open-source software tool for PA (Predictive Analytics), has a rapidly expanding base of users as well as enthusiastic volunteer developers who add to and support its functionalities (Siegel, 2013). Today there are several thousand available user-developed packages (also referred to as libraries). Packages are collections of R functions, compiled code, and data put together in a specific format following CRAN s guidelines. You can search for packages by application area here (http://cran.rproject.org/web/views/). As of July 2014, there are 33 different application areas. In the Machine Learning application area there are 72 different packages offering libraries that have functions to do nearly any methodology. There are many newer techniques available here that are not available in commercial software packages. Cons Memory, memory, memory! See memory_example.r SEPTEMBER 17, 2014 6

Data Mining, Data Science, and Predictive Modeling with R WHAT IS ENSEMBLE MODELING? Ensemble methods train multiple predictive models and then combine the predictions to achieve a higher overall performance and stability. Pros Ensemble methods require little tuning Ensemble methods operate on a variety of input types (Categorical variables, Integers, and real numbers) Ensemble methods can be used on variety of problems (binary and multi-class classification and rankings, regression, etc.) SEPTEMBER 17, 2014 7

CRISP-DM DATA MINING FRAMEWORK Cross-Industry Standard for Data Mining (CRISP-DM) is a general data mining process model that can be applied to solve any business problem. There are other popular data mining and analytics process models, such as Sample-Explore-Modify-Model- Assess (SEMMA), but in my opinion CRISP-DM is more structured and detailed. CRISP-DM was created and modified over time by leading practitioners and researchers in the data mining field and has been shown to lead to analytical results that align with business objectives. The CRISP-DM process model and techniques primarily fall under the predictive analytics domain in business analytics, where the objective is to help organizations predict future events and proactively act upon such insights in a systematic fashion to drive better business outcomes (Provost & Fawcett, 2013). However, this process could be extended to prescriptive (i.e. optimization) analytics endeavors as well. SEPTEMBER 15, 2014 8

CRISP-DM CRISP-DM DETAILED VIEW CRISP-DM Model Phases and Tasks (Source: Modified from www.crisp-dm.org) SEPTEMBER 15, 2014 9

Business Understanding BUSINESS UNDERSTANDING Business Objectives Retail Assortment planning, at the most basic level asks which products to offer and how many (Mantrala et al., 2009). Assortment planning is one of the most important decisions faced by retailers (Sauré & Zeevi, 2013). Because of financial and physical capacity constraints, operationally a retailer does not have to ability to stock, let alone hold in store every possible product a consumer may desire (Sauré & Zeevi, 2013). You must get the project sponsor to detail the business success criteria. It s not some predictive model accuracy statistic. Examples: Increased Sales of X% at Stores Y and Z. Reduced non-working inventory of W% at Stores Y and Z. Assess Situation May use R and any of its available packages Deadline is September 17 th at Meetup Competition winner gets $100, losers learn something, Speaker gets feedback Data Mining Goals Determine best overall test accuracy on a 10% out-of-training set Neither the sensitivity nor specificity must fall below 0.70 on the out-oftraining set to qualify. Project Plan Layout your expected work schedule, breaks, etc. Will vary depending on your experience using R SEPTEMBER 15, 2014 10

Data Understanding DATA UNDERSTANDING Collect Data http://www.matthewalanham.com/presentations/skus.xlsx Describe Data?? Variable Description store_number A unique store identifier sku_number A unique SKU identifier Y SOLD The SKU in a respective store sold (1=yes, 0=no) sold in the last 13 periods after it was replinished/maxied. NUM_SOLD The number of realized unit SKU sales for a respective store over the past 1-13 periods. X NUM_SOLD_LAST The number of realized unit SKU sales for a respective store over the past 14-26 periods. X application_count The total number of different year-make-model vehicle options that the respective SKU could be used for. X projected_growth_pct The projected percentage growth for this SKU in the next 13 periods based on financial experts. X offset For each store-sku, the positive deviation based on unit sales from the center of the part-type-specific distribution For each store-sku the positive deviation based on unit sales from the center of the part-type-specific distribution adjusted based on an ad-hoc X adjusted_offset calculation X unit_sales_py The total number of units sold for this particular SKU over all stores for between the past 27 and 39 periods. X unit_sales_cy The total number of units sold for this particular SKU over all stores for the past 14 and 26 periods. X unit_sales_fy The total number of units sold for this particular SKU over all stores over the past 13 periods. X total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. X adjusted_total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. The percentage of vehicles in operations (VIO) for a respective store compared to the total number of VIO for all stores associated to a cluster over X vio_compared_to_cluster the past 14 to 26 periods. X avg_cluster_cy_unit_sales The average number of SKUs sold based on a clustering of all stores over the past 13 to 26 periods. X avg_cluster_cy_total_sales The average number of total sales which is a combination of unit and lost sales based on store clusters over the past 14 to 26 periods. X avg_cluster_cy_lost_sales The average number of lost sales, clustered by all stores over the past 14 to 26 periods. X pop_est_cy Estimated number of persons in the population where the store is located based on the latest period. X pop_density_cy Estimated density (a percentage) of the population where the store is located based on the latest period. X pct_white Estimated number of caucasion-identified persons where the store is located based on the latest period. X age Estimated median person-age where the store is located over based on the latest period. X pct_college Estimated percentage of college-education persons where the store is located based on the latest period. X pct_blue_collar Estimated percentage of blue-collar type workers where the store is located based on the latest period. X median_household_income Estimated median household income where the store is located based on the latest period. Estimated number of physical locations where business is conducted or where services or industrial operations are performed where the store is X establishments located based on the latest period. X road_quality_index A measure of the quality of the roads in the area the store is located. Usually you will create such a table yourself but make it more descriptive. The data scientist will ask the domain expert(s) questions such as: What are the variables units of measure? Where does the data come from? When is it updated? How and why was clustering performed a particular way? How and why was a variable adjusted? SEPTEMBER 15, 2014 11

Data Understanding DATA UNDERSTANDING (CONT.) Explore Data Matt s source code: https://github.com/malanham/datathon.git Find the main.r Data Quality DataQualityReport(skus) DataQualityReportOverall(dataSetName=skus) SEPTEMBER 17, 2014 12

Data Preparation DATA PREPARATION Data Description?? Variable Description store_number A unique store identifier sku_number A unique SKU identifier Y SOLD The SKU in a respective store sold (1=yes, 0=no) sold in the last 13 periods after it was replinished/maxied. NUM_SOLD The number of realized unit SKU sales for a respective store over the past 1-13 periods. X NUM_SOLD_LAST The number of realized unit SKU sales for a respective store over the past 14-26 periods. X application_count The total number of different year-make-model vehicle options that the respective SKU could be used for. X projected_growth_pct The projected percentage growth for this SKU in the next 13 periods based on financial experts. X offset For each store-sku, the positive deviation based on unit sales from the center of the part-type-specific distribution For each store-sku the positive deviation based on unit sales from the center of the part-type-specific distribution adjusted based on an ad-hoc X adjusted_offset calculation X unit_sales_py The total number of units sold for this particular SKU over all stores for between the past 27 and 39 periods. X unit_sales_cy The total number of units sold for this particular SKU over all stores for the past 14 and 26 periods. unit_sales_fy The total number of units sold for this particular SKU over all stores over the past 13 periods. X total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. X adjusted_total_vio The total number of "estimated" vehicles in operations associated to a particular store based on an ad-hoc calculation. The percentage of vehicles in operations (VIO) for a respective store compared to the total number of VIO for all stores associated to a cluster over X vio_compared_to_cluster the past 14 to 26 periods. X avg_cluster_cy_unit_sales The average number of SKUs sold based on a clustering of all stores over the past 13 to 26 periods. X avg_cluster_cy_total_sales The average number of total sales which is a combination of unit and lost sales based on store clusters over the past 14 to 26 periods. X avg_cluster_cy_lost_sales The average number of lost sales, clustered by all stores over the past 14 to 26 periods. X pop_est_cy Estimated number of persons in the population where the store is located based on the latest period. X pop_density_cy Estimated density (a percentage) of the population where the store is located based on the latest period. X pct_white Estimated number of caucasion-identified persons where the store is located based on the latest period. X age Estimated median person-age where the store is located over based on the latest period. X pct_college Estimated percentage of college-education persons where the store is located based on the latest period. X pct_blue_collar Estimated percentage of blue-collar type workers where the store is located based on the latest period. X median_household_income Estimated median household income where the store is located based on the latest period. Estimated number of physical locations where business is conducted or where services or industrial operations are performed where the store is X establishments located based on the latest period. X road_quality_index A measure of the quality of the roads in the area the store is located. skus = skus[which(complete.cases(skus)),] DataQualityReportOverall(dataSetName=skus) SEPTEMBER 17, 2014 13

Modeling Modeling Techniques C5.0 Decision tree Logistic Regression CART Decision tree MODELING SEPTEMBER 17, 2014 14

Modeling MODELING (CONT.) Design When building and testing predictive models using observational data (i.e. data that is not controlled like in laboratory experimentation), the question that must be answered is how valid is my model in regards to what will happen next? In a properly designed and controlled experiment, data (samples) used in the experiment are used to make inferences about the population. Regardless of how large or small the sample is compared to the true size of the population, this single randomly selected subset of the population allows for generalizability of the remaining subset of data not used in the study. Cross-validation is the most practical and cost effective means of obtaining a proxy for truth in predictive analytics. ## Randomly partition data into training and test sets my_seed = 1234567 skus = GenerateTTV(dataSetName=skus, response='sold', trainpct=.90, testpct=.10, my_seed) GeneratePartitionPcts(dataSetName=skus) Using the training data error rate as a proxy for a model s generalization error is not wise, especially when the training error is low to almost perfect. Most likely the model has been overfit (or over trained) and will not perform as well when new examples are feed through and evaluated from a validation data set (Zhou, 2012). SEPTEMBER 17, 2014 15

Modeling MODELING (CONT.) Design ## Percentage of Target is 1 (or Y='SOLD') in total data set dim(skus[which(skus$sold==1),])[[1]] / dim(skus[which(skus$sold==0 skus$sold==1),])[[1]] ## Percentage of Target is 1 (or Y='SOLD') in training set dim(skus[which(skus$sold==1 & skus$spss_partition=='train'),])[[1]] / dim(skus[which((skus$sold==1 skus$sold==0) & skus$spss_partition=='train'),])[[1]] ## Percentage of Target is 1 (or Y='SOLD') in test set dim(skus[which(skus$sold==1 & skus$spss_partition=='test'),])[[1]] / dim(skus[which((skus$sold==1 skus$sold==0) & skus$spss_partition=='test'),])[[1]] ## remove independent variables that you don't want to use names(skus) skus2 = skus[,c(3,5:19,21:28)] head(skus2) skus2$sold = as.factor(skus2$sold) SEPTEMBER 17, 2014 16

Modeling MODELING (CONT.) ## set up data for algorithms trainx = skus2[which(skus2$spss_partition=='train'),2:(length(skus2)-1)] trainy = skus2[which(skus2$spss_partition=='train'),'sold'] train = cbind(trainy,trainx) testx = skus2[which(skus2$spss_partition=='test'),2:(length(skus2)-1)] testy = skus2[which(skus2$spss_partition=='test'),'sold'] test = cbind(testy,testx) SEPTEMBER 17, 2014 17

Modeling MODELING (CONT.) Build Models C5.0 Decision tree require(c50) #Fit classification tree models or rule-based models using Quinlan's C5.0 algorithm C5Params = C5.0Control( ) C5 = C5.0(x=trainX, y=trainy,,control=c5params #control parameters defined above,trails=1 ) summary(c5) #number of boosting iterations; 1 implies a single model is used Changing the trials from 1 to 1000 doesn t change the result in this case Overall error rate The confusion matrix is based on a decision cutoff threshold of 0.50 Variables that were used to create the tree SEPTEMBER 17, 2014 18

Modeling MODELING (CONT.) Build Models C5.0 Decision tree ## training probabilities and predicted classes C5trainp = predict(c5,newdata = trainx,trials = C5$trials["Actual"],type = "prob", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass)[,2] C5trainc = predict(c5,newdata = trainx,trials = C5$trials["Actual"],type = "class", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass) ## testing probabilities and predicted classes C5testp = predict(c5,newdata = testx,trials = C5$trials["Actual"],type = "prob", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass)[,2] C5testc = predict(c5,newdata = testx,trials = C5$trials["Actual"],type = "class", #either "class" for the predicted class or "prob" for model confidence values.,na.action = na.pass) SEPTEMBER 17, 2014 19

Modeling MODELING (CONT.) Build Models Logistic Regression logit.fit = glm(trainy ~ NUM_SOLD_LAST+TOTAL_VIO+ADJ_TOTAL_VIO+VIO_COMPARED_TO_CLUSTER+POP_EST_CY+POP_DEN SITY_CY+PCT_WHITE+AGE+PCT_COLLEGE+PCT_BLUE_COLLAR+MEDIAN_HOUSEHOLD_INCOME+ESTABLI SHMENTS+ROAD_QUALITY_INDEX+APPLICATION_COUNT+PROJECTED_GROWTH_PCT+UNIT_SALES_CY +UNIT_SALES_PY+OFFSET+ADJUSTED_OFFSET+AVG_CLUSTER_CY_UNIT_SALES+AVG_CLUSTER_CY_TOTA L_SALES #+AVG_CLUSTER_CY_LOST_SALES,family = binomial,data = train) summary(logit.fit) SEPTEMBER 17, 2014 20

Modeling MODELING (CONT.) Build Models CART require(rpart) set.seed(1234567) tree = rpart(trainy ~.,data = train,method = "class",cp = 0,minsplit = 4,minbucket = 2,parms = list(prior=c(0.5, 0.5))) #summary(tree) <-this will take awhile ## find the best pruned tree i.min = which.min(tree$cptable[,"xerror"]) i.se = which.min(abs(tree$cptable[,"xerror"] - (tree$cptable[i.min,"xerror"] + tree$cptable[i.min,"xstd"]))) alpha.best = tree$cptable[i.se, "CP"] tree.p = prune(tree, cp=alpha.best) ## obtain predictions treetrainp = predict(tree.p, train)[,2] treetrainc = treetrainp treetrainc[which(treetrainc>.5)] = 1 treetrainc[which(treetrainc<=.5)] = 0 treetestp = predict(tree.p, test)[,2] treetestc = treetestp treetestc[which(treetestc>.5)] = 1 treetestc[which(treetestc<=.5)] = 0 SEPTEMBER 17, 2014 21

Modeling ENSEMBLING VIA BAGGING Bootstrap aggregation (Bagging) Bagging is a simple way to increase the predictive power of a model Pros Useful when the predictors are more unstable, meaning that the more variation observed Cons Using smaller samples will yield more instability and too small yields poor models How Take several random samples with replacement from the training data set Use each sample to construct a separate predictive model with predictions as the testing data set. Average the predictions to come up with one final predicted value my_list = getbs_samples(seed=my_seed, Ntrees=100, SampleSize=1000) bagged_tree = BS_Trees(response=train[,1], datasetname=train, samplelist=my_list, Ntrees=100) bagged_probs = getfinalpredictions(bagged_tree, datasetname=train, Ntrees=100) What R packages are available for bagging? SEPTEMBER 17, 2014 22

Modeling C5.0 MODEL ASSESSMENT - TRAINING Bagged CART Logit CART Why does the bagged tree perform worse? SEPTEMBER 17, 2014 23

Modeling C5.0 MODEL ASSESSMENT - TESTING Bagged CART How do we store a Bagged Model in R? Logit CART SEPTEMBER 17, 2014 24

Modeling OPTIMAL DECISION CUTPOINTS Why use a decision cutoff threshold of 0.50? Example of an ROC curve SEPTEMBER 17, 2014 25

Modeling OPTIMAL DECISION CUTPOINTS See require(optimalcutpoints) ## Define my methods list methodlist = list( "Youden" #1 (Youden Index);,"ROC01" #2 (minimizes distance between ROC plot and point (0,1));,"PROC01" #3 (minimizes distance between PROC plot and point (0,1)); #,"MaxAccuracyArea" #4 (maximizes Accuracy Area); #,"AUC" #5 (maximizes concordance which is a function of AUC);,"MaxEfficiency" #6 (maximizes Efficiency or Accuracy);,"MaxKappa" #7 (maximizes Kappa Index); #,"MinErrorRate" #8 (minimizes Error Rate); C5.0 Training using the Youden cutoff method Using other cutoff methods.. SEPTEMBER 17, 2014 26