ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous target as a linear function of one or Regression: more independent inputs Logistic attempts to predict the probability that a binary or ordinal target will acquire the Regression: event of interest as a function of one or more independent inputs N.B. : Regression cannot handle nominal target. Let there are three variables: A, B and C Effect: Main input / effect: Multiplication effect / Interaction terms: Polynomial effect: Selection Method: Selection Criteria: Optimization Method: Linking Function: Variable used to model the value / probability A, B and C AB, ABC, A**2, B**3, Method to select effects (e.g. starting from all, starting from zero) Criteria used to evaluate the effects of a model on the target Method used to optimize the selection function among a set of candidate effects Function used to link response to the linear predictor e.g. From logistic to linear Example Data: SAMPSIO.DMAGECR Variable: GOOD_BAD (Model: use) GOOD_BAD edit target profile Assessment information Add matrix Accept Good (i.e. true positive): -1, Accept Bad(i.e. false positive): 5, Others: 0 Edit Decision: Minimize Loss Data Partition: 70% Training 30% Validation With stratification, keep Good and Bad in proportion. Model Options Tab - lists details about the target variable and the regression process and enables you to specify options for both Target Definition Subtab - lists the name, measurement level, and event level of the target variable Regression Subtab 1

Enterprise Miner - Regression 2 Type Binary or ordinal targets Interval targets logistic (default) linear (default) Link Function For logistic regression: logit (default) cloglog (complementary log-log) probit Input Coding - convert categorical inputs to discrete integer values Deviation use middle level as reference level GLM use highest / lowest (descending / ascending) level as reference Selection Method Tab Method Backward Forward Stepwise Begins with all candidate effects, remove effect Begins with no candidate effects, add effect Begins with no candidate, add and remove effect All candidate effects are included Criteria AIC Akaike's Information Criterion (smallest) SBC Schwarz's Bayesian Criterion (smallest) Validation Error smallest error rate for the validation data set Validation smallest misclassification rate for the validation data set Misclassification Cross-Validation Error smallest cross validation error rate for the training data set Cross Validation smallest cross validation misclassification rate for the training Misclassification data set Profit/Loss maximizes the profit or minimizes the loss for the cases in the validation data set Cross Validation maximizes the cross validation profit or minimizes the cross Profit/Loss validation loss last model produced by the effects selection method 2

Enterprise Miner - Regression 3 Selection Method Number of Variables: Start - number of effects to use in the first model - list of candidate effects can be seen in the Tools Model Ordering window - first n effects will be selected in the first model Stop - Forward method: maximum number of effects to appear in the final model - Backward method: minimum number of effects to appear in the final model - effect selection method may terminate for other reasons before the Stop criterion is applied. Force - force specific effects into the final models - set force no. and arrange effects in the Tools Model Ordering window Initialization Tab You can set one of the following options in the Initialization tab: (default) Do not use initial parameter estimates Current estimates Use the current parameter estimates from an initial run of the Regression node as starting values Selected data set Specify a data set that contains starting values for the parameter estimates Advanced Tab - set the optimization method, iteration controls, and convergence criteria in the Advanced tab. 3

Enterprise Miner - Regression 4 Optimization Method Max Iterations Max Function Calls No. of variables (n) Conjugate Gradient 400 1000 n > 400 Double Dogleg 200 500 40 < n < 400 Newton-Raphson with Line Search 50 125 n < 40 Newton-Raphson with Ridging 50 125 n < 40 Quasi-Newton 200 500 40 < n < 400 Trust-Region 50 125 n < 40 Note: To learn about these optimization methods, see SAS/OR Technical Report: The NLP Procedure. Running the Regression Node Regression Results Browser The Regression node results help you interpret the regression analysis of your data. It provides a graphic display of parameter estimates, statistics of fit, and a full listing of the regression output, log, and code. Estimates Tab - T-scores: the larger the value, the higher the strength of the effect on the target Plot Tab The taller the bar, the higher the agreement between the predicted (the into variable) and the actual (the from variable) target values the more useful the model 4

Enterprise Miner - Regression 5 Statistics Tab - fit statistics, in alphabetical order, for the training data, validation data, and test data analyzed with the regression model - the fit statistics show how good the trained model using different assessment methods To learn about these statistics, read either the LOGISTIC procedure or the REG procedure documentation in the SAS/STAT User's Guide, Version 6, Volume 2. 5