Statistics & Analysis

Size: px
Start display at page:

Download "Statistics & Analysis"

Transcription

1 NESUG How to Increase Sales of Orthopedic Equipment in United States: Factor and Cluster Analysis using SAS and R George Obsekov American College of Radiology Research Center Philadelphia, PA INTRODUCTION This paper was designed to analyze the sales of orthopedic equipment to United States hospitals. The purpose of the analysis was to find a way to increase sales from the company to hospitals, and to define the list of hospitals where sale gains could be maximized. In order to construct such a comprehensive list of hospitals, I created a subset of, hospitals based on geographical location (Southern USA). I analyzed different descriptive variables for each hospital such as, number of beds, number of outpatient visits, number of certain types of operations, administrative cost, and etc. My response variable was the sales of rehabilitation equipment. After analyzing scatter plots for each explanatory variable against our response variable I performed either a log or square root transformation to make the scatter plots look close to display linear trend. The next step was to use the factor analysis in order to split all variables into main groups that appropriately describe the different aspects of our hospitals. Based on rotated table results, the first factor included a number of operations; the second factor included the size, and the third one included rehab. After defining all factors I applied a cluster analysis in order to group hospitals with similar characteristics and properties together. Reviewing Ward s minimum variance table, I investigated the gap between SPRSQ values and found that clusters were the best for our appropriate cutoff. Then I chose a cluster with high average sales that contained a few hospitals having very low or no sales. Following regression analysis helped me to determine the list of hospitals with the large possibility of the highest potential sale gains. Finally, I applied the methods for robust clustering (PAM) and for classifications and regression trees (rpart) using R-software. The selected clusters from my cluster analysis were well supported by PAM method as well. MARKET SEGMENT SELECTION In order to increase the sales of orthopedic materials to US-based hospitals I was trying to create a subset of, -, hospitals out of the total, hospitals given. The hospitals from the following states were chosen based on geographical selection (Southern states): California, Texas, Louisiana, Alabama, Georgia, Florida, South Carolina, and North Carolina. In total, this gave me the final subset of analyzing the market segment with the amount of, hospitals with variables describing the main characteristics of each hospital in subset. All major variables are presented in Table.

2 NESUG Variables considered in dataset Response variable: Y : Description of variable in data set Sales of Rehabilitation Equipment Jan - July Sales of Rehabilitation Equipment for previous months Comments Zero means missing. ZIP US Postal Code HID Hospital ID CITY City Name STATE State Name BEDS Number of Hospital Beds RBEDS Number of Rehab Beds OUT-V Number of Outpatient Visits ADM* Administrative Cost In $ s per year. SIR Revenue from Inpatient HIP9 Number of HIP Operations for 99 KNEE9 Number of KNEE Operations for 99 TH (binary)* Teaching hospital = teaching, = non-teaching. TRAUMA (binary)* Do They Have a Trauma Unit? =Yes, =No. REHAB (binary)* Do They Have a Rehab Unit? =Yes, =No. HIP9 Number of HIP Operations for 99 KNEE9 Number of KNEE Operations for 99 FEMUR9 Number of FEMUR Operations for 99 Table : Demographic and Operational Variables used in the prediction of maximize sales TRANSFORMATION Analyzing the selected subset of hospitals I reviewed all scatter plots of each explanatory variable against my response variable (). After a close analysis it s seems to be clear that all my variables required transformation in order to appear close to linear. Figure A and B showed that variable BEDS can be better in square root transformation rather than log transformation.

3 NESUG BEDS Figure A: VS. BEDS Transformation Selection in SQRT BEDS Figure B: vs. BEDS Transformation Selection in LOG The number of rehab beds (RBEDS) and operational variables (HIP9, KNEE9, HIP9, KNEE9 and FEMUR9) were transformed to be log (+.xi), while OUT-V, ADM and SIR variables

4 NESUG appeared to be closer to linear when log (+.xi) was applied. All binary variables (TH, TRAUMA and REHAB) didn t require any transformations. Finally, my response variable was also transformed from y to log (+y) where y was a combination of all sales for rehabilitation equipment. My final scatter plots after transformations appeared in Figure A and B. BEDS RBEDS HIP KNEE HIP KNEE9 Figure A: vs. BEDS, RBEDS, HIP9, KNEE9, HIP9, and KNEE9 after transformation

5 NESUG FEMUR9 OUTV ADM SIR Figure B: vs. FEMUR9, OUTV, ADM, and SIR TRANSFORMATION DIMENTION REDUCTION Dimension reduction has been made by using factor analysis to summarize operational and demographic variables in the selected subset. Using the factor procedure the three factors were constructed for future analysis: an operational factor (HIP9, KNEE9, HIP9, KNEE9, and FEMUR9), a size factor (BEDS, OUTV, ADM, SIR, TH, and TRAUMA) and a rehab factor (RBEDS and REHAB). After initial factor analysis of all variables in one stage I decided to use two stages for factor analysis in order to find a better interpretation of the factors. Factor analysis in two stages forced me to break the variables into two subgroups, one subgroup with operational variables only and another one with a size and rehab. As we see from an eigenvalues Table A for operational variables the eigenvalue for Factor has a proportion of 9.% while the eigenvalue for other factors has the proportion of more than % according to Table C. Factor pattern for stage Two divided all variables into groups: SIZE group (BEDS, OUTV, ADM, SIR, TH and TRAUMA) and REHAB group (RBEDS and REHAB).

6 NESUG Stage One: NFACT= Eigenvalue Difference Proportion Cumulative factor will be retained by the NFACTOR criterion. Table A: Eigenvalues of the Correlation Matrix in Stage One Variable Description Factor HIP9 NUMBER OF HIP OPERATIONS FOR 99.9 KNEE9 NUMBER OF KNEE OPERATIONS FOR 99.9 HIP9 NUMBER HIP OPERATIONS FOR 99.9 KNEE9 NUMBER KNEE OPERATIONS FOR 99.9 FEMUR9 NUMBER FEMUR OPERATIONS FOR 99.9 Table B: Factor Pattern in Stage One including Number of Operations Stage Two: NFACT= Eigenvalue Difference Proportion Cumulative factors will be retained by the NFACTOR criterion. Table C: Eigenvalues of the Correlation Matrix in Stage Two Variable Description Factor Factor BEDS NUMBER OF HOSPITAL BEDS.9. RBEDS NUMBER OF REHAB BEDS -..9 OUTV NUMBER OF OUTPATIENT VISITS ADM ADMINISTRATIVE COST.9 -. SIR REVENUE FROM INPATIENT.9 -. TH TEACHING HOSPITAL?.. TRAUMA DO THEY HAVE A TRAUMA UNIT?.. REHAB DO THEY HAVE A REHAB UNIT?..9 Table D: Rotated Factor Pattern for Two Factors (SIZE and REHAB) Figure presented Eigen Values distribution for stage using one factor and stage using two factors.

7 NESUG Eigen values for one factor Eigen values for two factors Egien value... Egien value Stage one Stage two Figure : Eigen values for an operational (left) and size/rehab (right) factors Final distribution of analyzing variables is shown in Table. Variable Description Variable Name Stage One Stage Two Factor Factor Factor Number of hospital beds BEDS Number of rehab beds RBEDS Number of outpatient visits OUT-V Administrative Cost ADM Revenue from inpatient SIR Number of HIP operation for 99 HIP9 Number of KNEE operation for 99 KNEE9 Teaching hospital TH Do they have a trauma unit? TRAUMA Do they have a rehab unit? REHAB Number of HIP operation for 99 HIP9 Number of KNEE operation for 99 KNEE9 Number of FEMUR operation for 99 FEMUR9 Table Final Distribution of Variables in Factors CLUSTER ANALYSIS I used a cluster analysis in order to determine the best cluster to concentrate on for improving our sales. Table demonstrates Ward s Analysis and presents the biggest jump between cluster and with % difference. Therefore, I chose clusters for my future analysis.

8 NESUG NCL Clusters Joined SPRSQ Difference CL CL9.9 9 CL CL. CL CL. CL CL. CL CL. CL9 OB. CL CL. CL CL. CL CL.9 CL CL.99 CL CL. 9 CL CL.9 CL9 CL. % CL CL. CL CL. CL9 CL. Table : Cluster selection based on cluster history using WARD variance table Next, I created a box plot of sales against the clusters (Figure ). Based on this graph, cluster had the highest mean for sales and had some hospitals within it that didn t have any sales at all. CLUSTER Figure : Box Plot with per CLUSTER

9 NESUG Following examination of the table with means sales (Table ) discovered that the chosen cluster has the highest mean sales. Cluster contained hospitals in it, so it cannot be assumed that they are homogeneous. Since our sample size is large, I applied regression estimate for future analysis. CLUSTER FREQ msales mf mf mf Table : Sales and factors per cluster ( clusters) REGRESSION ANALYSIS In my regression analysis I used the stepwise backwards elimination procedure to determine if any of the factors are significant and must be retained in the model. The elimination did remove two factors and define an operational factor that is significant for our model. Next, I considered which hospitals have no sales. Once they were indentified, I analyzed the gain for each hospital and found that in order to increase the sales of orthopedic equipment; we should concentrate on six hospitals for potential gain of $9,9. Here is the list of the hospitals with their hospitals ID: HID (Galveston, TX), HID (Thomasville, GA), HID 9 (Los Angeles, CA), HID 9 (Valdosta, GA), HID 9 (Fort Myers, FL), and HID 999 (Downey, CA). PAM ANALYSIS IN R I used the R software to apply the method for robust clustering (PAM) in order to identify the best cluster. As we can see on Figure, k= generate the highest average silhouette width (.9). 9

10 SAS Cluster Selection NESUG Figure : Average Silhouette Width Comparison based on Clusters Table proves that robust clustering method used in R match well to the same selection for our market segments as SAS software. clpam PAM Cluster Selection Table : Cluster Selection Table using SAS and R software

11 NESUG RPART ANALYSIS IN R Using RPART analysis I found that the segment with the highest number of potential gain (.) contained missing from 9 total observations (Figure ). Figure : Final Regression Tree with following number of observations: (n=, n=, n=, n=9, n=) I did identify hospitals in this segment to determine if any of those were previously chosen for increasing our sales gain. As shown in Table, hospital with HID 9 (Fort Myers, FL) has been selected before by the regression analysis. Observation CITY STATE HID CLUSTER 9 Hemet CA 99 NA Cape Coral FL 9 NA Hawaiian CA 99 NA Oakland CA 9 NA Greensboro NC NA 9 Melbourne FL 9 NA Sacramento CA 99 NA Fort Myers FL 9 NA Table : Cluster Selection Table using RPART Method

12 NESUG RANDOM FOREST METHOD Using random forest method I found for Fort Myers hospital (HID 9) that the more accurate number for analysis is.9 and exp (.9) can generate about $, sales gain. CONCLUSIONS According to the project I was able to analyze the subset of hospitals, based on geographical selection, and put them into market segments that closely resemble one to another by using cluster analysis. By finding the best cluster, having the highest mean sales that contained hospitals with no sales, I was able to estimate potential sales gain above $, for the following hospitals: HID (Galveston, TX), HID (Thomasville, GA), HID 9 (Los Angeles, CA), HID 9 (Valdosta, GA), HID 9 (Fort Myers, FL), and HID 999 (Downey, CA). I found the final results by running PAM, RPART and Random Forest methods providing strong evidence that these selected hospitals will be the perfect candidates for improving sales of orthopedic equipment as a short term solution. REFERENCES: Statistical Consulting, Javier Cabrera and Andrew McDougall, Springer-Verlag, New York,, No. of pages: xii + 9. ISBN --9- "Understanding Robust and Exploratory Data Analysis," by Hoaglin, Mosteller and Tukey, John Wiley & Sons, 9 SAS Institute Inc., SAS Programming Tips: A Guide to Efficient SAS Processing, Cary, NC: SAS Institute Inc., 99, pp. Rajan Sambandam (9), Cluster Analysis Gets Complicated. Reprinted with permission from the American Marketing Association (Marketing Research, Vol., No., Spring ) Robert Adams (), Merck & Co., Inc., North Wales, PA, Box Plots in SAS : UNIVARIATE, BOXPLOT, or GPLOT? NESUG ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Thanks also to Stan Legum, co-chair of section, whose feedback has proved invaluable to the writing of this paper. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: George Obsekov American College of Radiology Research Center Market Street Philadelphia, PA 9 Work Phone: -- Fax: Web:

13 NESUG APENDIX: Following code has been used to make full analysis of the selecting dataset: DATA sasuser.hospital; INFILE 'hospital.txt' DELIMITER=','; INPUT ZIP $ HID $ CITY $ STATE $ BEDS RBEDS OUTV ADM SIR Y HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9; DATA hospital; SET sasuser.hospital; label ZIP = US POSTAL CODE HID = HOSPITAL ID CITY = CITY NAME STATE = STATE NAME BEDS = NUMBER OF HOSPITAL BEDS RBEDS = NUMBER OF REHAB BEDS OUTV = NUMBER OF OUTPATIENT VISITS ADM = ADMINISTRATIVE COST SIR = REVENUE FROM INPATIENT Y = OF REHABILITATION EQUIPMENT SINCE "JAN -JULY " = OF REHAB EQUIP FOR THE PREVIOUS "" MO HIP9 = NUMBER OF HIP OPERATIONS FOR "99" KNEE9 = NUMBER OF KNEE OPERATIONS FOR "99" TH = TEACHING HOSPITAL? TRAUMA = DO THEY HAVE A TRAUMA UNIT? REHAB = DO THEY HAVE A REHAB UNIT? HIP9 = NUMBER HIP OPERATIONS FOR "99" KNEE9 = NUMBER KNEE OPERATIONS FOR "99" FEMUR9 = NUMBER FEMUR OPERATIONS FOR "99"; /* new response variable - */ = log(+ +Y); IF = THEN =.; /* code for selecting subsets based on hospital location -south STATES*/ IF STATE EQ 'CA' OR STATE EQ 'FL' or state='tx' or state='sc' or state='la' or state='ga' or state='nc' or state='al'; ARRAY X {} BEDS RBEDS HIP9 KNEE9 HIP9 KNEE9 FEMUR9 OUTV ADM SIR; /* STEP TRANSFORMATIONS */ DO I= TO ; X{I} = SQRT(X{I}); END; DO i= to ; X{I} = LOG(+.*X{I}); END; DO I= TO ; X{I} = LOG(+.*X{I}); END; /* factor analysis in two stages, grouping the variables in subgroups */ PROC FACTOR data=hospital METHOD=PRIN NFACT= out=z; VAR HIP9 KNEE9 HIP9 KNEE9 FEMUR9; PROC FACTOR data=hospital METHOD=PRIN NFACT= ROTATE=VARIMAX out=z; VAR BEDS RBEDS OUTV ADM SIR TH trauma rehab; DATA z; set z; factor = factor; keep factor factor; DATA hospout; merge z z;

14 NESUG /*cluster analysis using WARD */ PROC CLUSTER data=hospout METHOD=WARD; VAR factor-factor; COPY ZIP CITY STATE HID BEDS RBEDS OUTV ADM SIR HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9 factor-factor; PROC TREE NOPRINT NCL= OUT=TXCLUST; COPY ZIP CITY STATE HID BEDS RBEDS OUTV ADM SIR HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9 factor-factor; /* produce the cluster summary and pick the best cluster*/ PROC sort data= TXCLUST; by cluster; PROC means noprint; BY cluster; VAR factor-factor; OUTPUT out=c mean= msales mf-mf; PROC boxplot data= TXCLUST; plot *cluster; SELECT TXCLUST; DATA cl; set TXCLUST; if cluster=; PROC REG DATA=cl; MODEL sales = Factor-factor/ P R selection=b; OUTPUT OUT=C P=PRED R=RESID STDP=STDP; /* finally undo the clusters and calculate the potential gain */ DATA C; SET C; rowp = exp(pred+.*stdp*stdp)-; epred = exp(pred)-; sales = exp(sales) -; gain = rowp - sales; PROC sort; by gain; PROC print; /* code for the special case when the cluster size is very small*/ DATA cl; SET TXCLUST; IF cluster=; sales = exp(sales) -; PROC print; PROC means data=cl; VAR sales; endsas; /* suppose the mean of sales is. */ DATA cl; set cl; gain =. - sales; PROC sort; by gain; PROC print;

15 NESUG /***** FACTOR ANALYSIS USING R-SOFTWARE hh = read.xport("hosp.xpt") dim(hh) hh[,] library(cluster) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) plot(silhouette(pam(hh[,:], k=9)), main = paste("k = ",), do.n.k=false) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) clpam = pam(hh[,:], k=)$cluster table(clpam) table(hh[,]) table(clpam,hh[,]) library(rpart) rpart( ~FACTOR+FACTOR+FACTOR, data=hh) predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) length(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) length(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) plot(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) text(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) /***** RPART ANALYSIS USING R-SOFTWARE library(rpart) hh[,] hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh) plot(hh.rp) text(hh.rp) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) plot(hh.rp) text(hh.rp) plot(hh.rp, uni=t) text(hh.rp,use.n=true,cex=.) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) plot(hh.rp, uni=t) text(hh.rp,use.n=true,cex=.) hh.rp table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) predv = (predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) factor(predv)[:] as.numeric(factor(predv))[:] table(as.numeric(factor(predv))) cluster = (as.numeric(factor(predv))) hh[ cluster==,] factor(predv)[:] exp(.) - factor(predv)[:]

What is Data mining?

What is Data mining? STAT : DATA MIIG Javier Cabrera Fall Business Question Answer Business Question What is Data mining? Find Data Data Processing Extract Information Data Analysis Internal Databases Data Warehouses Internet

More information

Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA

Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA PROC FACTOR: How to Interpret the Output of a Real-World Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a real-world example of a factor

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

More information

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,

More information

Exploratory Factor Analysis

Exploratory Factor Analysis Introduction Principal components: explain many variables using few new variables. Not many assumptions attached. Exploratory Factor Analysis Exploratory factor analysis: similar idea, but based on model.

More information

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

More information

CLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining.

CLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining. CLUSTER ANALYSIS Introduction Cluster analysis is a technique for grouping individuals or objects hierarchically into unknown groups suggested by the data. Cluster analysis can be considered an alternative

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

More information

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation ABSTRACT Customer segmentation is fundamental for successful marketing

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

New SAS Procedures for Analysis of Sample Survey Data

New SAS Procedures for Analysis of Sample Survey Data New SAS Procedures for Analysis of Sample Survey Data Anthony An and Donna Watts, SAS Institute Inc, Cary, NC Abstract Researchers use sample surveys to obtain information on a wide variety of issues Many

More information

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Data Mining and Visualization

Data Mining and Visualization Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

A Comparison of Variable Selection Techniques for Credit Scoring

A Comparison of Variable Selection Techniques for Credit Scoring 1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:

More information

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables. FACTOR ANALYSIS Introduction Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables Both methods differ from regression in that they don t have

More information

Chapter 11 Introduction to Survey Sampling and Analysis Procedures

Chapter 11 Introduction to Survey Sampling and Analysis Procedures Chapter 11 Introduction to Survey Sampling and Analysis Procedures Chapter Table of Contents OVERVIEW...149 SurveySampling...150 SurveyDataAnalysis...151 DESIGN INFORMATION FOR SURVEY PROCEDURES...152

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property & Casualty Insurance Predictive Modeling Process in SAS Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

More information

Overview of Factor Analysis

Overview of Factor Analysis Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

From The Little SAS Book, Fifth Edition. Full book available for purchase here. From The Little SAS Book, Fifth Edition. Full book available for purchase here. Acknowledgments ix Introducing SAS Software About This Book xi What s New xiv x Chapter 1 Getting Started Using SAS Software

More information

SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria

SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria Paper SA01_05 SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN

More information

9.2 User s Guide SAS/STAT. Introduction. (Book Excerpt) SAS Documentation

9.2 User s Guide SAS/STAT. Introduction. (Book Excerpt) SAS Documentation SAS/STAT Introduction (Book Excerpt) 9.2 User s Guide SAS Documentation This document is an individual chapter from SAS/STAT 9.2 User s Guide. The correct bibliographic citation for the complete manual

More information

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

More information

Technical Notes for HCAHPS Star Ratings

Technical Notes for HCAHPS Star Ratings Overview of HCAHPS Star Ratings Technical Notes for HCAHPS Star Ratings As part of the initiative to add five-star quality ratings to its Compare Web sites, the Centers for Medicare & Medicaid Services

More information

Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona

Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona Paper 1285-2014 Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona ABSTRACT In this new era of healthcare reform, health insurance companies have

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded

More information

Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA

Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA Abstract Virtually all businesses collect and use data that are associated with geographic locations, whether

More information

4. There are no dependent variables specified... Instead, the model is: VAR 1. Or, in terms of basic measurement theory, we could model it as:

4. There are no dependent variables specified... Instead, the model is: VAR 1. Or, in terms of basic measurement theory, we could model it as: 1 Neuendorf Factor Analysis Assumptions: 1. Metric (interval/ratio) data 2. Linearity (in the relationships among the variables--factors are linear constructions of the set of variables; the critical source

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

Alex Vidras, David Tysinger. Merkle Inc.

Alex Vidras, David Tysinger. Merkle Inc. Using PROC LOGISTIC, SAS MACROS and ODS Output to evaluate the consistency of independent variables during the development of logistic regression models. An example from the retail banking industry ABSTRACT

More information

Innovative Techniques and Tools to Detect Data Quality Problems

Innovative Techniques and Tools to Detect Data Quality Problems Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful

More information

MEASURES OF LOCATION AND SPREAD

MEASURES OF LOCATION AND SPREAD Paper TU04 An Overview of Non-parametric Tests in SAS : When, Why, and How Paul A. Pappas and Venita DePuy Durham, North Carolina, USA ABSTRACT Most commonly used statistical procedures are based on the

More information

2. Linearity (in relationships among the variables--factors are linear constructions of the set of variables) F 2 X 4 U 4

2. Linearity (in relationships among the variables--factors are linear constructions of the set of variables) F 2 X 4 U 4 1 Neuendorf Factor Analysis Assumptions: 1. Metric (interval/ratio) data. Linearity (in relationships among the variables--factors are linear constructions of the set of variables) 3. Univariate and multivariate

More information

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC A PROC SQL Primer Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC ABSTRACT Most SAS programmers utilize the power of the DATA step to manipulate their datasets. However, unless they pull

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

More information

The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?

The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY? The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health, ABSTRACT

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study Tongshan Chang The University of California Office of the President CAIR Conference in Pasadena 11/13/2008

More information

Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS

Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Paper 114-27 Predicting Customer in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Junxiang Lu, Ph.D. Sprint Communications Company Overland Park, Kansas ABSTRACT

More information

Data analysis process

Data analysis process Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis

More information

Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can.

Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can. SAS Enterprise Guide for Educational Researchers: Data Import to Publication without Programming AnnMaria De Mars, University of Southern California, Los Angeles, CA ABSTRACT In this workshop, participants

More information

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency S. David Riba, JADE Tech, Inc., Clearwater, FL ABSTRACT PROC FORMAT is one of the old standards among SAS Procedures,

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC

Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC ABSTRACT Have you used PROC MEANS or PROC SUMMARY and wished there was something intermediate between the NWAY option

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

CC03 PRODUCING SIMPLE AND QUICK GRAPHS WITH PROC GPLOT

CC03 PRODUCING SIMPLE AND QUICK GRAPHS WITH PROC GPLOT 1 CC03 PRODUCING SIMPLE AND QUICK GRAPHS WITH PROC GPLOT Sheng Zhang, Xingshu Zhu, Shuping Zhang, Weifeng Xu, Jane Liao, and Amy Gillespie Merck and Co. Inc, Upper Gwynedd, PA Abstract PROC GPLOT is a

More information

SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis

SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2007. SAS Add-In 2.1 for Microsoft Office: Getting

More information

SAS CLINICAL TRAINING

SAS CLINICAL TRAINING SAS CLINICAL TRAINING Presented By 3S Business Corporation Inc www.3sbc.com Call us at : 281-823-9222 Mail us at : info@3sbc.com Table of Contents S.No TOPICS 1 Introduction to Clinical Trials 2 Introduction

More information

Chapter 27 Using Predictor Variables. Chapter Table of Contents

Chapter 27 Using Predictor Variables. Chapter Table of Contents Chapter 27 Using Predictor Variables Chapter Table of Contents LINEAR TREND...1329 TIME TREND CURVES...1330 REGRESSORS...1332 ADJUSTMENTS...1334 DYNAMIC REGRESSOR...1335 INTERVENTIONS...1339 TheInterventionSpecificationWindow...1339

More information

Teaching Multivariate Analysis to Business-Major Students

Teaching Multivariate Analysis to Business-Major Students Teaching Multivariate Analysis to Business-Major Students Wing-Keung Wong and Teck-Wong Soon - Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis

More information

Exploratory Analysis of Marketing Data: Trees vs. Regression

Exploratory Analysis of Marketing Data: Trees vs. Regression Exploratory Analysis of Marketing Data: Trees vs. Regression J. Scott Armstrong Assistant Professor of Marketing, The Wharton School and James G. Andress Consultant at Booz, Allen, and Hamilton, Inc.,

More information

Cluster this! June 2011

Cluster this! June 2011 Cluster this! June 2011 Agenda On the agenda today: SAS Enterprise Miner (some of the pros and cons of using) How multivariate statistics can be applied to a business problem using clustering Some cool

More information

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1 M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1 15.7 Analytics and Data Mining 15.7 Analytics and Data Mining 1 Section 1.5 noted that advances in computing processing during the past 40 years have

More information

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA Michael A. Walega Covance, Inc. INTRODUCTION In broad terms, Exploratory Data Analysis (EDA) can be defined as the numerical and graphical examination

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

A fast, powerful data mining workbench designed for small to midsize organizations

A fast, powerful data mining workbench designed for small to midsize organizations FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business

More information

An Exploratory Data Analysis Approach to Qualitative Response Modelling Using SAS/IML(R) and SAS/GRAPH(R) Software

An Exploratory Data Analysis Approach to Qualitative Response Modelling Using SAS/IML(R) and SAS/GRAPH(R) Software An Exploratory Data Analysis Approach to Qualitative Response Modelling Using SAS/IML(R) and SAS/GRAPH(R) Software Merwyn L. Elliott Ross Hightower Caleb Chan Statistical Services Laboratory Georgia State

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

Effective Use of SQL in SAS Programming

Effective Use of SQL in SAS Programming INTRODUCTION Effective Use of SQL in SAS Programming Yi Zhao Merck & Co. Inc., Upper Gwynedd, Pennsylvania Structured Query Language (SQL) is a data manipulation tool of which many SAS programmers are

More information

Factor Analysis. Chapter 420. Introduction

Factor Analysis. Chapter 420. Introduction Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

More information

METHODS FAIR HEALTH FH NPIC DATA

METHODS FAIR HEALTH FH NPIC DATA Understanding Patterns in the Utilization and Cost of Elbow Reconstruction Surgeries: A Healthcare Procedure that is Common among Baseball Pitchers Eric Okurowski, MBA; Jeff Dang, PhD, FAIR Health Inc.,

More information

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES Irron Williams Northwestern University IrronWilliams2015@u.northwestern.edu Abstract--Data science is evolving. In

More information

Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA

Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA ABSTRACT Throughout the course of a clinical trial the Statistical Programming group is

More information

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges A Basic Guide to Modeling Techniques for All Direct Marketing Challenges Allison Cornia Database Marketing Manager Microsoft Corporation C. Olivia Rud Executive Vice President Data Square, LLC Overview

More information

PRINCIPAL COMPONENT ANALYSIS

PRINCIPAL COMPONENT ANALYSIS 1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2

More information

Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

More information

Data Exploration Data Visualization

Data Exploration Data Visualization Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Paper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Paper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI Paper D10 2009 Ranking Predictors in Logistic Regression Doug Thompson, Assurant Health, Milwaukee, WI ABSTRACT There is little consensus on how best to rank predictors in logistic regression. This paper

More information

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501 CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups

More information

EXST SAS Lab Lab #4: Data input and dataset modifications

EXST SAS Lab Lab #4: Data input and dataset modifications EXST SAS Lab Lab #4: Data input and dataset modifications Objectives 1. Import an EXCEL dataset. 2. Infile an external dataset (CSV file) 3. Concatenate two datasets into one 4. The PLOT statement will

More information

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA Paper 156-2010 The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA Abstract JMP has a rich set of visual displays that can help you see the information

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

2009 CREDO Center for Research on Education Outcomes (CREDO) Stanford University Stanford, CA http://credo.stanford.edu June 2009

2009 CREDO Center for Research on Education Outcomes (CREDO) Stanford University Stanford, CA http://credo.stanford.edu June 2009 Technical Appendix 2009 CREDO Center for Research on Education Outcomes (CREDO) Stanford University Stanford, CA http://credo.stanford.edu June 2009 CREDO gratefully acknowledges the support of the State

More information

ABSTRACT INTRODUCTION %CODE MACRO DEFINITION

ABSTRACT INTRODUCTION %CODE MACRO DEFINITION Generating Web Application Code for Existing HTML Forms Don Boudreaux, PhD, SAS Institute Inc., Austin, TX Keith Cranford, Office of the Attorney General, Austin, TX ABSTRACT SAS Web Applications typically

More information

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking Dummy Coding for Dummies Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health ABSTRACT There are a number of ways to incorporate categorical variables into

More information

Chapter 25 Specifying Forecasting Models

Chapter 25 Specifying Forecasting Models Chapter 25 Specifying Forecasting Models Chapter Table of Contents SERIES DIAGNOSTICS...1281 MODELS TO FIT WINDOW...1283 AUTOMATIC MODEL SELECTION...1285 SMOOTHING MODEL SPECIFICATION WINDOW...1287 ARIMA

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Doing Quantitative Research 26E02900, 6 ECTS Lecture 2: Measurement Scales. Olli-Pekka Kauppila Rilana Riikkinen

Doing Quantitative Research 26E02900, 6 ECTS Lecture 2: Measurement Scales. Olli-Pekka Kauppila Rilana Riikkinen Doing Quantitative Research 26E02900, 6 ECTS Lecture 2: Measurement Scales Olli-Pekka Kauppila Rilana Riikkinen Learning Objectives 1. Develop the ability to assess a quality of measurement instruments

More information

Identification of noisy variables for nonmetric and symbolic data in cluster analysis

Identification of noisy variables for nonmetric and symbolic data in cluster analysis Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,

More information

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression

More information

Joseph Twagilimana, University of Louisville, Louisville, KY

Joseph Twagilimana, University of Louisville, Louisville, KY ST14 Comparing Time series, Generalized Linear Models and Artificial Neural Network Models for Transactional Data analysis Joseph Twagilimana, University of Louisville, Louisville, KY ABSTRACT The aim

More information

Statistical Discovery

Statistical Discovery SCSUG 2014 JMP Visual Statistics Charles Edwin Shipp, Consider Consulting Corp, Los Angeles, CA ABSTRACT For beginners, we review the continuing merging of statistics and graphics. Statistical graphics

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

More information

SAS R IML (Introduction at the Master s Level)

SAS R IML (Introduction at the Master s Level) SAS R IML (Introduction at the Master s Level) Anton Bekkerman, Ph.D., Montana State University, Bozeman, MT ABSTRACT Most graduate-level statistics and econometrics programs require a more advanced knowledge

More information

Sun Li Centre for Academic Computing lsun@smu.edu.sg

Sun Li Centre for Academic Computing lsun@smu.edu.sg Sun Li Centre for Academic Computing lsun@smu.edu.sg Elementary Data Analysis Group Comparison & One-way ANOVA Non-parametric Tests Correlations General Linear Regression Logistic Models Binary Logistic

More information

Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA

Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA ABSTRACT PharmaSUG 2015 - Paper QT12 Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA It is common to export SAS data to Excel by creating a new Excel file. However, there

More information

Assessing Model Fit and Finding a Fit Model

Assessing Model Fit and Finding a Fit Model Paper 214-29 Assessing Model Fit and Finding a Fit Model Pippa Simpson, University of Arkansas for Medical Sciences, Little Rock, AR Robert Hamer, University of North Carolina, Chapel Hill, NC ChanHee

More information

Didacticiel - Études de cas

Didacticiel - Études de cas 1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.

More information

Performance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe

Performance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe Performance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe A SAS White Paper Table of Contents The SAS and IBM Relationship... 1 Introduction...1 Customer Jobs Test Suite... 1

More information