Statistics & Analysis

Save this PDF as:

Size: px
Start display at page:

Transcription

1 NESUG How to Increase Sales of Orthopedic Equipment in United States: Factor and Cluster Analysis using SAS and R George Obsekov American College of Radiology Research Center Philadelphia, PA INTRODUCTION This paper was designed to analyze the sales of orthopedic equipment to United States hospitals. The purpose of the analysis was to find a way to increase sales from the company to hospitals, and to define the list of hospitals where sale gains could be maximized. In order to construct such a comprehensive list of hospitals, I created a subset of, hospitals based on geographical location (Southern USA). I analyzed different descriptive variables for each hospital such as, number of beds, number of outpatient visits, number of certain types of operations, administrative cost, and etc. My response variable was the sales of rehabilitation equipment. After analyzing scatter plots for each explanatory variable against our response variable I performed either a log or square root transformation to make the scatter plots look close to display linear trend. The next step was to use the factor analysis in order to split all variables into main groups that appropriately describe the different aspects of our hospitals. Based on rotated table results, the first factor included a number of operations; the second factor included the size, and the third one included rehab. After defining all factors I applied a cluster analysis in order to group hospitals with similar characteristics and properties together. Reviewing Ward s minimum variance table, I investigated the gap between SPRSQ values and found that clusters were the best for our appropriate cutoff. Then I chose a cluster with high average sales that contained a few hospitals having very low or no sales. Following regression analysis helped me to determine the list of hospitals with the large possibility of the highest potential sale gains. Finally, I applied the methods for robust clustering (PAM) and for classifications and regression trees (rpart) using R-software. The selected clusters from my cluster analysis were well supported by PAM method as well. MARKET SEGMENT SELECTION In order to increase the sales of orthopedic materials to US-based hospitals I was trying to create a subset of, -, hospitals out of the total, hospitals given. The hospitals from the following states were chosen based on geographical selection (Southern states): California, Texas, Louisiana, Alabama, Georgia, Florida, South Carolina, and North Carolina. In total, this gave me the final subset of analyzing the market segment with the amount of, hospitals with variables describing the main characteristics of each hospital in subset. All major variables are presented in Table.

2 NESUG Variables considered in dataset Response variable: Y : Description of variable in data set Sales of Rehabilitation Equipment Jan - July Sales of Rehabilitation Equipment for previous months Comments Zero means missing. ZIP US Postal Code HID Hospital ID CITY City Name STATE State Name BEDS Number of Hospital Beds RBEDS Number of Rehab Beds OUT-V Number of Outpatient Visits ADM* Administrative Cost In \$ s per year. SIR Revenue from Inpatient HIP9 Number of HIP Operations for 99 KNEE9 Number of KNEE Operations for 99 TH (binary)* Teaching hospital = teaching, = non-teaching. TRAUMA (binary)* Do They Have a Trauma Unit? =Yes, =No. REHAB (binary)* Do They Have a Rehab Unit? =Yes, =No. HIP9 Number of HIP Operations for 99 KNEE9 Number of KNEE Operations for 99 FEMUR9 Number of FEMUR Operations for 99 Table : Demographic and Operational Variables used in the prediction of maximize sales TRANSFORMATION Analyzing the selected subset of hospitals I reviewed all scatter plots of each explanatory variable against my response variable (). After a close analysis it s seems to be clear that all my variables required transformation in order to appear close to linear. Figure A and B showed that variable BEDS can be better in square root transformation rather than log transformation.

3 NESUG BEDS Figure A: VS. BEDS Transformation Selection in SQRT BEDS Figure B: vs. BEDS Transformation Selection in LOG The number of rehab beds (RBEDS) and operational variables (HIP9, KNEE9, HIP9, KNEE9 and FEMUR9) were transformed to be log (+.xi), while OUT-V, ADM and SIR variables

4 NESUG appeared to be closer to linear when log (+.xi) was applied. All binary variables (TH, TRAUMA and REHAB) didn t require any transformations. Finally, my response variable was also transformed from y to log (+y) where y was a combination of all sales for rehabilitation equipment. My final scatter plots after transformations appeared in Figure A and B. BEDS RBEDS HIP KNEE HIP KNEE9 Figure A: vs. BEDS, RBEDS, HIP9, KNEE9, HIP9, and KNEE9 after transformation

5 NESUG FEMUR9 OUTV ADM SIR Figure B: vs. FEMUR9, OUTV, ADM, and SIR TRANSFORMATION DIMENTION REDUCTION Dimension reduction has been made by using factor analysis to summarize operational and demographic variables in the selected subset. Using the factor procedure the three factors were constructed for future analysis: an operational factor (HIP9, KNEE9, HIP9, KNEE9, and FEMUR9), a size factor (BEDS, OUTV, ADM, SIR, TH, and TRAUMA) and a rehab factor (RBEDS and REHAB). After initial factor analysis of all variables in one stage I decided to use two stages for factor analysis in order to find a better interpretation of the factors. Factor analysis in two stages forced me to break the variables into two subgroups, one subgroup with operational variables only and another one with a size and rehab. As we see from an eigenvalues Table A for operational variables the eigenvalue for Factor has a proportion of 9.% while the eigenvalue for other factors has the proportion of more than % according to Table C. Factor pattern for stage Two divided all variables into groups: SIZE group (BEDS, OUTV, ADM, SIR, TH and TRAUMA) and REHAB group (RBEDS and REHAB).

6 NESUG Stage One: NFACT= Eigenvalue Difference Proportion Cumulative factor will be retained by the NFACTOR criterion. Table A: Eigenvalues of the Correlation Matrix in Stage One Variable Description Factor HIP9 NUMBER OF HIP OPERATIONS FOR 99.9 KNEE9 NUMBER OF KNEE OPERATIONS FOR 99.9 HIP9 NUMBER HIP OPERATIONS FOR 99.9 KNEE9 NUMBER KNEE OPERATIONS FOR 99.9 FEMUR9 NUMBER FEMUR OPERATIONS FOR 99.9 Table B: Factor Pattern in Stage One including Number of Operations Stage Two: NFACT= Eigenvalue Difference Proportion Cumulative factors will be retained by the NFACTOR criterion. Table C: Eigenvalues of the Correlation Matrix in Stage Two Variable Description Factor Factor BEDS NUMBER OF HOSPITAL BEDS.9. RBEDS NUMBER OF REHAB BEDS -..9 OUTV NUMBER OF OUTPATIENT VISITS ADM ADMINISTRATIVE COST.9 -. SIR REVENUE FROM INPATIENT.9 -. TH TEACHING HOSPITAL?.. TRAUMA DO THEY HAVE A TRAUMA UNIT?.. REHAB DO THEY HAVE A REHAB UNIT?..9 Table D: Rotated Factor Pattern for Two Factors (SIZE and REHAB) Figure presented Eigen Values distribution for stage using one factor and stage using two factors.

7 NESUG Eigen values for one factor Eigen values for two factors Egien value... Egien value Stage one Stage two Figure : Eigen values for an operational (left) and size/rehab (right) factors Final distribution of analyzing variables is shown in Table. Variable Description Variable Name Stage One Stage Two Factor Factor Factor Number of hospital beds BEDS Number of rehab beds RBEDS Number of outpatient visits OUT-V Administrative Cost ADM Revenue from inpatient SIR Number of HIP operation for 99 HIP9 Number of KNEE operation for 99 KNEE9 Teaching hospital TH Do they have a trauma unit? TRAUMA Do they have a rehab unit? REHAB Number of HIP operation for 99 HIP9 Number of KNEE operation for 99 KNEE9 Number of FEMUR operation for 99 FEMUR9 Table Final Distribution of Variables in Factors CLUSTER ANALYSIS I used a cluster analysis in order to determine the best cluster to concentrate on for improving our sales. Table demonstrates Ward s Analysis and presents the biggest jump between cluster and with % difference. Therefore, I chose clusters for my future analysis.

8 NESUG NCL Clusters Joined SPRSQ Difference CL CL9.9 9 CL CL. CL CL. CL CL. CL CL. CL9 OB. CL CL. CL CL. CL CL.9 CL CL.99 CL CL. 9 CL CL.9 CL9 CL. % CL CL. CL CL. CL9 CL. Table : Cluster selection based on cluster history using WARD variance table Next, I created a box plot of sales against the clusters (Figure ). Based on this graph, cluster had the highest mean for sales and had some hospitals within it that didn t have any sales at all. CLUSTER Figure : Box Plot with per CLUSTER

9 NESUG Following examination of the table with means sales (Table ) discovered that the chosen cluster has the highest mean sales. Cluster contained hospitals in it, so it cannot be assumed that they are homogeneous. Since our sample size is large, I applied regression estimate for future analysis. CLUSTER FREQ msales mf mf mf Table : Sales and factors per cluster ( clusters) REGRESSION ANALYSIS In my regression analysis I used the stepwise backwards elimination procedure to determine if any of the factors are significant and must be retained in the model. The elimination did remove two factors and define an operational factor that is significant for our model. Next, I considered which hospitals have no sales. Once they were indentified, I analyzed the gain for each hospital and found that in order to increase the sales of orthopedic equipment; we should concentrate on six hospitals for potential gain of \$9,9. Here is the list of the hospitals with their hospitals ID: HID (Galveston, TX), HID (Thomasville, GA), HID 9 (Los Angeles, CA), HID 9 (Valdosta, GA), HID 9 (Fort Myers, FL), and HID 999 (Downey, CA). PAM ANALYSIS IN R I used the R software to apply the method for robust clustering (PAM) in order to identify the best cluster. As we can see on Figure, k= generate the highest average silhouette width (.9). 9

10 SAS Cluster Selection NESUG Figure : Average Silhouette Width Comparison based on Clusters Table proves that robust clustering method used in R match well to the same selection for our market segments as SAS software. clpam PAM Cluster Selection Table : Cluster Selection Table using SAS and R software

11 NESUG RPART ANALYSIS IN R Using RPART analysis I found that the segment with the highest number of potential gain (.) contained missing from 9 total observations (Figure ). Figure : Final Regression Tree with following number of observations: (n=, n=, n=, n=9, n=) I did identify hospitals in this segment to determine if any of those were previously chosen for increasing our sales gain. As shown in Table, hospital with HID 9 (Fort Myers, FL) has been selected before by the regression analysis. Observation CITY STATE HID CLUSTER 9 Hemet CA 99 NA Cape Coral FL 9 NA Hawaiian CA 99 NA Oakland CA 9 NA Greensboro NC NA 9 Melbourne FL 9 NA Sacramento CA 99 NA Fort Myers FL 9 NA Table : Cluster Selection Table using RPART Method

12 NESUG RANDOM FOREST METHOD Using random forest method I found for Fort Myers hospital (HID 9) that the more accurate number for analysis is.9 and exp (.9) can generate about \$, sales gain. CONCLUSIONS According to the project I was able to analyze the subset of hospitals, based on geographical selection, and put them into market segments that closely resemble one to another by using cluster analysis. By finding the best cluster, having the highest mean sales that contained hospitals with no sales, I was able to estimate potential sales gain above \$, for the following hospitals: HID (Galveston, TX), HID (Thomasville, GA), HID 9 (Los Angeles, CA), HID 9 (Valdosta, GA), HID 9 (Fort Myers, FL), and HID 999 (Downey, CA). I found the final results by running PAM, RPART and Random Forest methods providing strong evidence that these selected hospitals will be the perfect candidates for improving sales of orthopedic equipment as a short term solution. REFERENCES: Statistical Consulting, Javier Cabrera and Andrew McDougall, Springer-Verlag, New York,, No. of pages: xii + 9. ISBN --9- "Understanding Robust and Exploratory Data Analysis," by Hoaglin, Mosteller and Tukey, John Wiley & Sons, 9 SAS Institute Inc., SAS Programming Tips: A Guide to Efficient SAS Processing, Cary, NC: SAS Institute Inc., 99, pp. Rajan Sambandam (9), Cluster Analysis Gets Complicated. Reprinted with permission from the American Marketing Association (Marketing Research, Vol., No., Spring ) Robert Adams (), Merck & Co., Inc., North Wales, PA, Box Plots in SAS : UNIVARIATE, BOXPLOT, or GPLOT? NESUG ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Thanks also to Stan Legum, co-chair of section, whose feedback has proved invaluable to the writing of this paper. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: George Obsekov American College of Radiology Research Center Market Street Philadelphia, PA 9 Work Phone: -- Fax: Web:

13 NESUG APENDIX: Following code has been used to make full analysis of the selecting dataset: DATA sasuser.hospital; INFILE 'hospital.txt' DELIMITER=','; INPUT ZIP \$ HID \$ CITY \$ STATE \$ BEDS RBEDS OUTV ADM SIR Y HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9; DATA hospital; SET sasuser.hospital; label ZIP = US POSTAL CODE HID = HOSPITAL ID CITY = CITY NAME STATE = STATE NAME BEDS = NUMBER OF HOSPITAL BEDS RBEDS = NUMBER OF REHAB BEDS OUTV = NUMBER OF OUTPATIENT VISITS ADM = ADMINISTRATIVE COST SIR = REVENUE FROM INPATIENT Y = OF REHABILITATION EQUIPMENT SINCE "JAN -JULY " = OF REHAB EQUIP FOR THE PREVIOUS "" MO HIP9 = NUMBER OF HIP OPERATIONS FOR "99" KNEE9 = NUMBER OF KNEE OPERATIONS FOR "99" TH = TEACHING HOSPITAL? TRAUMA = DO THEY HAVE A TRAUMA UNIT? REHAB = DO THEY HAVE A REHAB UNIT? HIP9 = NUMBER HIP OPERATIONS FOR "99" KNEE9 = NUMBER KNEE OPERATIONS FOR "99" FEMUR9 = NUMBER FEMUR OPERATIONS FOR "99"; /* new response variable - */ = log(+ +Y); IF = THEN =.; /* code for selecting subsets based on hospital location -south STATES*/ IF STATE EQ 'CA' OR STATE EQ 'FL' or state='tx' or state='sc' or state='la' or state='ga' or state='nc' or state='al'; ARRAY X {} BEDS RBEDS HIP9 KNEE9 HIP9 KNEE9 FEMUR9 OUTV ADM SIR; /* STEP TRANSFORMATIONS */ DO I= TO ; X{I} = SQRT(X{I}); END; DO i= to ; X{I} = LOG(+.*X{I}); END; DO I= TO ; X{I} = LOG(+.*X{I}); END; /* factor analysis in two stages, grouping the variables in subgroups */ PROC FACTOR data=hospital METHOD=PRIN NFACT= out=z; VAR HIP9 KNEE9 HIP9 KNEE9 FEMUR9; PROC FACTOR data=hospital METHOD=PRIN NFACT= ROTATE=VARIMAX out=z; VAR BEDS RBEDS OUTV ADM SIR TH trauma rehab; DATA z; set z; factor = factor; keep factor factor; DATA hospout; merge z z;

14 NESUG /*cluster analysis using WARD */ PROC CLUSTER data=hospout METHOD=WARD; VAR factor-factor; COPY ZIP CITY STATE HID BEDS RBEDS OUTV ADM SIR HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9 factor-factor; PROC TREE NOPRINT NCL= OUT=TXCLUST; COPY ZIP CITY STATE HID BEDS RBEDS OUTV ADM SIR HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9 factor-factor; /* produce the cluster summary and pick the best cluster*/ PROC sort data= TXCLUST; by cluster; PROC means noprint; BY cluster; VAR factor-factor; OUTPUT out=c mean= msales mf-mf; PROC boxplot data= TXCLUST; plot *cluster; SELECT TXCLUST; DATA cl; set TXCLUST; if cluster=; PROC REG DATA=cl; MODEL sales = Factor-factor/ P R selection=b; OUTPUT OUT=C P=PRED R=RESID STDP=STDP; /* finally undo the clusters and calculate the potential gain */ DATA C; SET C; rowp = exp(pred+.*stdp*stdp)-; epred = exp(pred)-; sales = exp(sales) -; gain = rowp - sales; PROC sort; by gain; PROC print; /* code for the special case when the cluster size is very small*/ DATA cl; SET TXCLUST; IF cluster=; sales = exp(sales) -; PROC print; PROC means data=cl; VAR sales; endsas; /* suppose the mean of sales is. */ DATA cl; set cl; gain =. - sales; PROC sort; by gain; PROC print;

15 NESUG /***** FACTOR ANALYSIS USING R-SOFTWARE hh = read.xport("hosp.xpt") dim(hh) hh[,] library(cluster) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) plot(silhouette(pam(hh[,:], k=9)), main = paste("k = ",), do.n.k=false) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) clpam = pam(hh[,:], k=)\$cluster table(clpam) table(hh[,]) table(clpam,hh[,]) library(rpart) rpart( ~FACTOR+FACTOR+FACTOR, data=hh) predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) length(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) length(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) plot(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) text(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) /***** RPART ANALYSIS USING R-SOFTWARE library(rpart) hh[,] hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh) plot(hh.rp) text(hh.rp) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) plot(hh.rp) text(hh.rp) plot(hh.rp, uni=t) text(hh.rp,use.n=true,cex=.) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) plot(hh.rp, uni=t) text(hh.rp,use.n=true,cex=.) hh.rp table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) predv = (predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) factor(predv)[:] as.numeric(factor(predv))[:] table(as.numeric(factor(predv))) cluster = (as.numeric(factor(predv))) hh[ cluster==,] factor(predv)[:] exp(.) - factor(predv)[:]

What is Data mining?

STAT : DATA MIIG Javier Cabrera Fall Business Question Answer Business Question What is Data mining? Find Data Data Processing Extract Information Data Analysis Internal Databases Data Warehouses Internet

Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA

PROC FACTOR: How to Interpret the Output of a Real-World Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a real-world example of a factor

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,

Exploratory Factor Analysis

Introduction Principal components: explain many variables using few new variables. Not many assumptions attached. Exploratory Factor Analysis Exploratory factor analysis: similar idea, but based on model.

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

CLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining.

CLUSTER ANALYSIS Introduction Cluster analysis is a technique for grouping individuals or objects hierarchically into unknown groups suggested by the data. Cluster analysis can be considered an alternative

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

Paper DM05 A METHODOLOGICAL APPROACH TO PERFORMING CLUSTER ANALYSIS WITH SAS. William F. McCarthy. Maryland Medical Research Institute, Baltimore, MD

Paper DM05 A METHODOLOGICAL APPROACH TO PERFORMING CLUSTER ANALYSIS WITH SAS William F. McCarthy Maryland Medical Research Institute, Baltimore, MD ABSTRACT The purpose of this paper is to present an outline

Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation ABSTRACT Customer segmentation is fundamental for successful marketing

Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

New SAS Procedures for Analysis of Sample Survey Data

New SAS Procedures for Analysis of Sample Survey Data Anthony An and Donna Watts, SAS Institute Inc, Cary, NC Abstract Researchers use sample surveys to obtain information on a wide variety of issues Many

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

Overview of Factor Analysis

Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

A Property & Casualty Insurance Predictive Modeling Process in SAS

Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

Data Mining and Visualization

Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.

FACTOR ANALYSIS Introduction Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables Both methods differ from regression in that they don t have

A Comparison of Variable Selection Techniques for Credit Scoring

1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:

SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria

Paper SA01_05 SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

Paper TU04 An Overview of Non-parametric Tests in SAS : When, Why, and How Paul A. Pappas and Venita DePuy Durham, North Carolina, USA ABSTRACT Most commonly used statistical procedures are based on the

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

From The Little SAS Book, Fifth Edition. Full book available for purchase here. Acknowledgments ix Introducing SAS Software About This Book xi What s New xiv x Chapter 1 Getting Started Using SAS Software

Leveraging Ensemble Models in SAS Enterprise Miner

ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

9.2 User s Guide SAS/STAT. Introduction. (Book Excerpt) SAS Documentation

SAS/STAT Introduction (Book Excerpt) 9.2 User s Guide SAS Documentation This document is an individual chapter from SAS/STAT 9.2 User s Guide. The correct bibliographic citation for the complete manual

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona

Paper 1285-2014 Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona ABSTRACT In this new era of healthcare reform, health insurance companies have

Principal Component Analysis

Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded

Technical Notes for HCAHPS Star Ratings

Overview of HCAHPS Star Ratings Technical Notes for HCAHPS Star Ratings As part of the initiative to add five-star quality ratings to its Compare Web sites, the Centers for Medicare & Medicaid Services

Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA

Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA Abstract Virtually all businesses collect and use data that are associated with geographic locations, whether

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

Innovative Techniques and Tools to Detect Data Quality Problems

Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful

4. There are no dependent variables specified... Instead, the model is: VAR 1. Or, in terms of basic measurement theory, we could model it as:

1 Neuendorf Factor Analysis Assumptions: 1. Metric (interval/ratio) data 2. Linearity (in the relationships among the variables--factors are linear constructions of the set of variables; the critical source

Alex Vidras, David Tysinger. Merkle Inc.

Using PROC LOGISTIC, SAS MACROS and ODS Output to evaluate the consistency of independent variables during the development of logistic regression models. An example from the retail banking industry ABSTRACT

Lecture 7: Factor Analysis. Laura McAvinue School of Psychology Trinity College Dublin

Lecture 7: Factor Analysis Laura McAvinue School of Psychology Trinity College Dublin The Relationship between Variables Previous lectures Correlation Measure of strength of association between two variables

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

A PROC SQL Primer Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC ABSTRACT Most SAS programmers utilize the power of the DATA step to manipulate their datasets. However, unless they pull

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

2. Linearity (in relationships among the variables--factors are linear constructions of the set of variables) F 2 X 4 U 4

1 Neuendorf Factor Analysis Assumptions: 1. Metric (interval/ratio) data. Linearity (in relationships among the variables--factors are linear constructions of the set of variables) 3. Univariate and multivariate

Regression Analysis Using ArcMap. By Jennie Murack

Regression Analysis Using ArcMap By Jennie Murack Regression Basics How is Regression Different from other Spatial Statistical Analyses? With other tools you ask WHERE something is happening? Are there

Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS

Paper 114-27 Predicting Customer in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Junxiang Lu, Ph.D. Sprint Communications Company Overland Park, Kansas ABSTRACT

The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?

The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health, ABSTRACT

Univariate Regression

Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study Tongshan Chang The University of California Office of the President CAIR Conference in Pasadena 11/13/2008

Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can.

SAS Enterprise Guide for Educational Researchers: Data Import to Publication without Programming AnnMaria De Mars, University of Southern California, Los Angeles, CA ABSTRACT In this workshop, participants

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA Michael A. Walega Covance, Inc. INTRODUCTION In broad terms, Exploratory Data Analysis (EDA) can be defined as the numerical and graphical examination

Data analysis process

Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

Drug Store Sales Prediction

Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency S. David Riba, JADE Tech, Inc., Clearwater, FL ABSTRACT PROC FORMAT is one of the old standards among SAS Procedures,

CC03 PRODUCING SIMPLE AND QUICK GRAPHS WITH PROC GPLOT

1 CC03 PRODUCING SIMPLE AND QUICK GRAPHS WITH PROC GPLOT Sheng Zhang, Xingshu Zhu, Shuping Zhang, Weifeng Xu, Jane Liao, and Amy Gillespie Merck and Co. Inc, Upper Gwynedd, PA Abstract PROC GPLOT is a

Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC

Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC ABSTRACT Have you used PROC MEANS or PROC SUMMARY and wished there was something intermediate between the NWAY option

Classification and Regression Trees

Classification and Regression Trees Bob Stine Dept of Statistics, School University of Pennsylvania Trees Familiar metaphor Biology Decision tree Medical diagnosis Org chart Properties Recursive, partitioning

Exploratory Analysis of Marketing Data: Trees vs. Regression

Exploratory Analysis of Marketing Data: Trees vs. Regression J. Scott Armstrong Assistant Professor of Marketing, The Wharton School and James G. Andress Consultant at Booz, Allen, and Hamilton, Inc.,

SAS CLINICAL TRAINING

SAS CLINICAL TRAINING Presented By 3S Business Corporation Inc www.3sbc.com Call us at : 281-823-9222 Mail us at : info@3sbc.com Table of Contents S.No TOPICS 1 Introduction to Clinical Trials 2 Introduction

Teaching Multivariate Analysis to Business-Major Students

Teaching Multivariate Analysis to Business-Major Students Wing-Keung Wong and Teck-Wong Soon - Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis

Statistical Discovery

SCSUG 2014 JMP Visual Statistics Charles Edwin Shipp, Consider Consulting Corp, Los Angeles, CA ABSTRACT For beginners, we review the continuing merging of statistics and graphics. Statistical graphics

Cluster this! June 2011

Cluster this! June 2011 Agenda On the agenda today: SAS Enterprise Miner (some of the pros and cons of using) How multivariate statistics can be applied to a business problem using clustering Some cool

Doing Quantitative Research 26E02900, 6 ECTS Lecture 2: Measurement Scales. Olli-Pekka Kauppila Rilana Riikkinen

Doing Quantitative Research 26E02900, 6 ECTS Lecture 2: Measurement Scales Olli-Pekka Kauppila Rilana Riikkinen Learning Objectives 1. Develop the ability to assess a quality of measurement instruments

METHODS FAIR HEALTH FH NPIC DATA

Understanding Patterns in the Utilization and Cost of Elbow Reconstruction Surgeries: A Healthcare Procedure that is Common among Baseball Pitchers Eric Okurowski, MBA; Jeff Dang, PhD, FAIR Health Inc.,

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis

SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2007. SAS Add-In 2.1 for Microsoft Office: Getting

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1 15.7 Analytics and Data Mining 15.7 Analytics and Data Mining 1 Section 1.5 noted that advances in computing processing during the past 40 years have

An Exploratory Data Analysis Approach to Qualitative Response Modelling Using SAS/IML(R) and SAS/GRAPH(R) Software

An Exploratory Data Analysis Approach to Qualitative Response Modelling Using SAS/IML(R) and SAS/GRAPH(R) Software Merwyn L. Elliott Ross Hightower Caleb Chan Statistical Services Laboratory Georgia State

Factor Analysis. Chapter 420. Introduction

Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

Performance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe

Performance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe A SAS White Paper Table of Contents The SAS and IBM Relationship... 1 Introduction...1 Customer Jobs Test Suite... 1

Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

Implementation in Enterprise Miner: Decision Tree with Binary Response

Implementation in Enterprise Miner: Decision Tree with Binary Response Outline 8.1 Example 8.2 The Options in Tree Node 8.3 Tree Results 8.4 Example Continued Appendix A: Tree and Missing Values - 1 -

A fast, powerful data mining workbench designed for small to midsize organizations

FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES Irron Williams Northwestern University IrronWilliams2015@u.northwestern.edu Abstract--Data science is evolving. In

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

Data Exploration Data Visualization

Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA

Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA ABSTRACT Throughout the course of a clinical trial the Statistical Programming group is

EXST SAS Lab Lab #4: Data input and dataset modifications

EXST SAS Lab Lab #4: Data input and dataset modifications Objectives 1. Import an EXCEL dataset. 2. Infile an external dataset (CSV file) 3. Concatenate two datasets into one 4. The PLOT statement will

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

Dummy Coding for Dummies Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health ABSTRACT There are a number of ways to incorporate categorical variables into

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups

Effective Use of SQL in SAS Programming

INTRODUCTION Effective Use of SQL in SAS Programming Yi Zhao Merck & Co. Inc., Upper Gwynedd, Pennsylvania Structured Query Language (SQL) is a data manipulation tool of which many SAS programmers are

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges Allison Cornia Database Marketing Manager Microsoft Corporation C. Olivia Rud Executive Vice President Data Square, LLC Overview

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA

Paper 156-2010 The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA Abstract JMP has a rich set of visual displays that can help you see the information

Paper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Paper D10 2009 Ranking Predictors in Logistic Regression Doug Thompson, Assurant Health, Milwaukee, WI ABSTRACT There is little consensus on how best to rank predictors in logistic regression. This paper

ABSTRACT INTRODUCTION %CODE MACRO DEFINITION

Generating Web Application Code for Existing HTML Forms Don Boudreaux, PhD, SAS Institute Inc., Austin, TX Keith Cranford, Office of the Attorney General, Austin, TX ABSTRACT SAS Web Applications typically

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

Chapter 25 Specifying Forecasting Models

Chapter 25 Specifying Forecasting Models Chapter Table of Contents SERIES DIAGNOSTICS...1281 MODELS TO FIT WINDOW...1283 AUTOMATIC MODEL SELECTION...1285 SMOOTHING MODEL SPECIFICATION WINDOW...1287 ARIMA

PRINCIPAL COMPONENT ANALYSIS

1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2

User-friendly SAS Macro Application for performing all possible mixed model selection - An update

User-friendly SAS Macro Application for performing all possible mixed model selection - An update ABSTRACT George Fernandez, University of Nevada - Reno, Reno NV 89557 A user-friendly SAS macro application

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

Identification of noisy variables for nonmetric and symbolic data in cluster analysis

Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com