Statistics & Analysis

Transcription

1 NESUG How to Increase Sales of Orthopedic Equipment in United States: Factor and Cluster Analysis using SAS and R George Obsekov American College of Radiology Research Center Philadelphia, PA INTRODUCTION This paper was designed to analyze the sales of orthopedic equipment to United States hospitals. The purpose of the analysis was to find a way to increase sales from the company to hospitals, and to define the list of hospitals where sale gains could be maximized. In order to construct such a comprehensive list of hospitals, I created a subset of, hospitals based on geographical location (Southern USA). I analyzed different descriptive variables for each hospital such as, number of beds, number of outpatient visits, number of certain types of operations, administrative cost, and etc. My response variable was the sales of rehabilitation equipment. After analyzing scatter plots for each explanatory variable against our response variable I performed either a log or square root transformation to make the scatter plots look close to display linear trend. The next step was to use the factor analysis in order to split all variables into main groups that appropriately describe the different aspects of our hospitals. Based on rotated table results, the first factor included a number of operations; the second factor included the size, and the third one included rehab. After defining all factors I applied a cluster analysis in order to group hospitals with similar characteristics and properties together. Reviewing Ward s minimum variance table, I investigated the gap between SPRSQ values and found that clusters were the best for our appropriate cutoff. Then I chose a cluster with high average sales that contained a few hospitals having very low or no sales. Following regression analysis helped me to determine the list of hospitals with the large possibility of the highest potential sale gains. Finally, I applied the methods for robust clustering (PAM) and for classifications and regression trees (rpart) using R-software. The selected clusters from my cluster analysis were well supported by PAM method as well. MARKET SEGMENT SELECTION In order to increase the sales of orthopedic materials to US-based hospitals I was trying to create a subset of, -, hospitals out of the total, hospitals given. The hospitals from the following states were chosen based on geographical selection (Southern states): California, Texas, Louisiana, Alabama, Georgia, Florida, South Carolina, and North Carolina. In total, this gave me the final subset of analyzing the market segment with the amount of, hospitals with variables describing the main characteristics of each hospital in subset. All major variables are presented in Table.

2 NESUG Variables considered in dataset Response variable: Y : Description of variable in data set Sales of Rehabilitation Equipment Jan - July Sales of Rehabilitation Equipment for previous months Comments Zero means missing. ZIP US Postal Code HID Hospital ID CITY City Name STATE State Name BEDS Number of Hospital Beds RBEDS Number of Rehab Beds OUT-V Number of Outpatient Visits ADM* Administrative Cost In $ s per year. SIR Revenue from Inpatient HIP9 Number of HIP Operations for 99 KNEE9 Number of KNEE Operations for 99 TH (binary)* Teaching hospital = teaching, = non-teaching. TRAUMA (binary)* Do They Have a Trauma Unit? =Yes, =No. REHAB (binary)* Do They Have a Rehab Unit? =Yes, =No. HIP9 Number of HIP Operations for 99 KNEE9 Number of KNEE Operations for 99 FEMUR9 Number of FEMUR Operations for 99 Table : Demographic and Operational Variables used in the prediction of maximize sales TRANSFORMATION Analyzing the selected subset of hospitals I reviewed all scatter plots of each explanatory variable against my response variable (). After a close analysis it s seems to be clear that all my variables required transformation in order to appear close to linear. Figure A and B showed that variable BEDS can be better in square root transformation rather than log transformation.

3 NESUG BEDS Figure A: VS. BEDS Transformation Selection in SQRT BEDS Figure B: vs. BEDS Transformation Selection in LOG The number of rehab beds (RBEDS) and operational variables (HIP9, KNEE9, HIP9, KNEE9 and FEMUR9) were transformed to be log (+.xi), while OUT-V, ADM and SIR variables

4 NESUG appeared to be closer to linear when log (+.xi) was applied. All binary variables (TH, TRAUMA and REHAB) didn t require any transformations. Finally, my response variable was also transformed from y to log (+y) where y was a combination of all sales for rehabilitation equipment. My final scatter plots after transformations appeared in Figure A and B. BEDS RBEDS HIP KNEE HIP KNEE9 Figure A: vs. BEDS, RBEDS, HIP9, KNEE9, HIP9, and KNEE9 after transformation

5 NESUG FEMUR9 OUTV ADM SIR Figure B: vs. FEMUR9, OUTV, ADM, and SIR TRANSFORMATION DIMENTION REDUCTION Dimension reduction has been made by using factor analysis to summarize operational and demographic variables in the selected subset. Using the factor procedure the three factors were constructed for future analysis: an operational factor (HIP9, KNEE9, HIP9, KNEE9, and FEMUR9), a size factor (BEDS, OUTV, ADM, SIR, TH, and TRAUMA) and a rehab factor (RBEDS and REHAB). After initial factor analysis of all variables in one stage I decided to use two stages for factor analysis in order to find a better interpretation of the factors. Factor analysis in two stages forced me to break the variables into two subgroups, one subgroup with operational variables only and another one with a size and rehab. As we see from an eigenvalues Table A for operational variables the eigenvalue for Factor has a proportion of 9.% while the eigenvalue for other factors has the proportion of more than % according to Table C. Factor pattern for stage Two divided all variables into groups: SIZE group (BEDS, OUTV, ADM, SIR, TH and TRAUMA) and REHAB group (RBEDS and REHAB).

6 NESUG Stage One: NFACT= Eigenvalue Difference Proportion Cumulative factor will be retained by the NFACTOR criterion. Table A: Eigenvalues of the Correlation Matrix in Stage One Variable Description Factor HIP9 NUMBER OF HIP OPERATIONS FOR 99.9 KNEE9 NUMBER OF KNEE OPERATIONS FOR 99.9 HIP9 NUMBER HIP OPERATIONS FOR 99.9 KNEE9 NUMBER KNEE OPERATIONS FOR 99.9 FEMUR9 NUMBER FEMUR OPERATIONS FOR 99.9 Table B: Factor Pattern in Stage One including Number of Operations Stage Two: NFACT= Eigenvalue Difference Proportion Cumulative factors will be retained by the NFACTOR criterion. Table C: Eigenvalues of the Correlation Matrix in Stage Two Variable Description Factor Factor BEDS NUMBER OF HOSPITAL BEDS.9. RBEDS NUMBER OF REHAB BEDS -..9 OUTV NUMBER OF OUTPATIENT VISITS ADM ADMINISTRATIVE COST.9 -. SIR REVENUE FROM INPATIENT.9 -. TH TEACHING HOSPITAL?.. TRAUMA DO THEY HAVE A TRAUMA UNIT?.. REHAB DO THEY HAVE A REHAB UNIT?..9 Table D: Rotated Factor Pattern for Two Factors (SIZE and REHAB) Figure presented Eigen Values distribution for stage using one factor and stage using two factors.

7 NESUG Eigen values for one factor Eigen values for two factors Egien value... Egien value Stage one Stage two Figure : Eigen values for an operational (left) and size/rehab (right) factors Final distribution of analyzing variables is shown in Table. Variable Description Variable Name Stage One Stage Two Factor Factor Factor Number of hospital beds BEDS Number of rehab beds RBEDS Number of outpatient visits OUT-V Administrative Cost ADM Revenue from inpatient SIR Number of HIP operation for 99 HIP9 Number of KNEE operation for 99 KNEE9 Teaching hospital TH Do they have a trauma unit? TRAUMA Do they have a rehab unit? REHAB Number of HIP operation for 99 HIP9 Number of KNEE operation for 99 KNEE9 Number of FEMUR operation for 99 FEMUR9 Table Final Distribution of Variables in Factors CLUSTER ANALYSIS I used a cluster analysis in order to determine the best cluster to concentrate on for improving our sales. Table demonstrates Ward s Analysis and presents the biggest jump between cluster and with % difference. Therefore, I chose clusters for my future analysis.

8 NESUG NCL Clusters Joined SPRSQ Difference CL CL9.9 9 CL CL. CL CL. CL CL. CL CL. CL9 OB. CL CL. CL CL. CL CL.9 CL CL.99 CL CL. 9 CL CL.9 CL9 CL. % CL CL. CL CL. CL9 CL. Table : Cluster selection based on cluster history using WARD variance table Next, I created a box plot of sales against the clusters (Figure ). Based on this graph, cluster had the highest mean for sales and had some hospitals within it that didn t have any sales at all. CLUSTER Figure : Box Plot with per CLUSTER

9 NESUG Following examination of the table with means sales (Table ) discovered that the chosen cluster has the highest mean sales. Cluster contained hospitals in it, so it cannot be assumed that they are homogeneous. Since our sample size is large, I applied regression estimate for future analysis. CLUSTER FREQ msales mf mf mf Table : Sales and factors per cluster ( clusters) REGRESSION ANALYSIS In my regression analysis I used the stepwise backwards elimination procedure to determine if any of the factors are significant and must be retained in the model. The elimination did remove two factors and define an operational factor that is significant for our model. Next, I considered which hospitals have no sales. Once they were indentified, I analyzed the gain for each hospital and found that in order to increase the sales of orthopedic equipment; we should concentrate on six hospitals for potential gain of $9,9. Here is the list of the hospitals with their hospitals ID: HID (Galveston, TX), HID (Thomasville, GA), HID 9 (Los Angeles, CA), HID 9 (Valdosta, GA), HID 9 (Fort Myers, FL), and HID 999 (Downey, CA). PAM ANALYSIS IN R I used the R software to apply the method for robust clustering (PAM) in order to identify the best cluster. As we can see on Figure, k= generate the highest average silhouette width (.9). 9

10 SAS Cluster Selection NESUG Figure : Average Silhouette Width Comparison based on Clusters Table proves that robust clustering method used in R match well to the same selection for our market segments as SAS software. clpam PAM Cluster Selection Table : Cluster Selection Table using SAS and R software

11 NESUG RPART ANALYSIS IN R Using RPART analysis I found that the segment with the highest number of potential gain (.) contained missing from 9 total observations (Figure ). Figure : Final Regression Tree with following number of observations: (n=, n=, n=, n=9, n=) I did identify hospitals in this segment to determine if any of those were previously chosen for increasing our sales gain. As shown in Table, hospital with HID 9 (Fort Myers, FL) has been selected before by the regression analysis. Observation CITY STATE HID CLUSTER 9 Hemet CA 99 NA Cape Coral FL 9 NA Hawaiian CA 99 NA Oakland CA 9 NA Greensboro NC NA 9 Melbourne FL 9 NA Sacramento CA 99 NA Fort Myers FL 9 NA Table : Cluster Selection Table using RPART Method

12 NESUG RANDOM FOREST METHOD Using random forest method I found for Fort Myers hospital (HID 9) that the more accurate number for analysis is.9 and exp (.9) can generate about $, sales gain. CONCLUSIONS According to the project I was able to analyze the subset of hospitals, based on geographical selection, and put them into market segments that closely resemble one to another by using cluster analysis. By finding the best cluster, having the highest mean sales that contained hospitals with no sales, I was able to estimate potential sales gain above $, for the following hospitals: HID (Galveston, TX), HID (Thomasville, GA), HID 9 (Los Angeles, CA), HID 9 (Valdosta, GA), HID 9 (Fort Myers, FL), and HID 999 (Downey, CA). I found the final results by running PAM, RPART and Random Forest methods providing strong evidence that these selected hospitals will be the perfect candidates for improving sales of orthopedic equipment as a short term solution. REFERENCES: Statistical Consulting, Javier Cabrera and Andrew McDougall, Springer-Verlag, New York,, No. of pages: xii + 9. ISBN --9- "Understanding Robust and Exploratory Data Analysis," by Hoaglin, Mosteller and Tukey, John Wiley & Sons, 9 SAS Institute Inc., SAS Programming Tips: A Guide to Efficient SAS Processing, Cary, NC: SAS Institute Inc., 99, pp. Rajan Sambandam (9), Cluster Analysis Gets Complicated. Reprinted with permission from the American Marketing Association (Marketing Research, Vol., No., Spring ) Robert Adams (), Merck & Co., Inc., North Wales, PA, Box Plots in SAS : UNIVARIATE, BOXPLOT, or GPLOT? NESUG ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Thanks also to Stan Legum, co-chair of section, whose feedback has proved invaluable to the writing of this paper. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: George Obsekov American College of Radiology Research Center Market Street Philadelphia, PA 9 Work Phone: -- Fax: gobsekov@acr.org Web:

13 NESUG APENDIX: Following code has been used to make full analysis of the selecting dataset: DATA sasuser.hospital; INFILE 'hospital.txt' DELIMITER=','; INPUT ZIP $ HID $ CITY $ STATE $ BEDS RBEDS OUTV ADM SIR Y HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9; DATA hospital; SET sasuser.hospital; label ZIP = US POSTAL CODE HID = HOSPITAL ID CITY = CITY NAME STATE = STATE NAME BEDS = NUMBER OF HOSPITAL BEDS RBEDS = NUMBER OF REHAB BEDS OUTV = NUMBER OF OUTPATIENT VISITS ADM = ADMINISTRATIVE COST SIR = REVENUE FROM INPATIENT Y = OF REHABILITATION EQUIPMENT SINCE "JAN -JULY " = OF REHAB EQUIP FOR THE PREVIOUS "" MO HIP9 = NUMBER OF HIP OPERATIONS FOR "99" KNEE9 = NUMBER OF KNEE OPERATIONS FOR "99" TH = TEACHING HOSPITAL? TRAUMA = DO THEY HAVE A TRAUMA UNIT? REHAB = DO THEY HAVE A REHAB UNIT? HIP9 = NUMBER HIP OPERATIONS FOR "99" KNEE9 = NUMBER KNEE OPERATIONS FOR "99" FEMUR9 = NUMBER FEMUR OPERATIONS FOR "99"; /* new response variable - */ = log(+ +Y); IF = THEN =.; /* code for selecting subsets based on hospital location -south STATES*/ IF STATE EQ 'CA' OR STATE EQ 'FL' or state='tx' or state='sc' or state='la' or state='ga' or state='nc' or state='al'; ARRAY X {} BEDS RBEDS HIP9 KNEE9 HIP9 KNEE9 FEMUR9 OUTV ADM SIR; /* STEP TRANSFORMATIONS */ DO I= TO ; X{I} = SQRT(X{I}); END; DO i= to ; X{I} = LOG(+.*X{I}); END; DO I= TO ; X{I} = LOG(+.*X{I}); END; /* factor analysis in two stages, grouping the variables in subgroups */ PROC FACTOR data=hospital METHOD=PRIN NFACT= out=z; VAR HIP9 KNEE9 HIP9 KNEE9 FEMUR9; PROC FACTOR data=hospital METHOD=PRIN NFACT= ROTATE=VARIMAX out=z; VAR BEDS RBEDS OUTV ADM SIR TH trauma rehab; DATA z; set z; factor = factor; keep factor factor; DATA hospout; merge z z;

14 NESUG /*cluster analysis using WARD */ PROC CLUSTER data=hospout METHOD=WARD; VAR factor-factor; COPY ZIP CITY STATE HID BEDS RBEDS OUTV ADM SIR HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9 factor-factor; PROC TREE NOPRINT NCL= OUT=TXCLUST; COPY ZIP CITY STATE HID BEDS RBEDS OUTV ADM SIR HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9 factor-factor; /* produce the cluster summary and pick the best cluster*/ PROC sort data= TXCLUST; by cluster; PROC means noprint; BY cluster; VAR factor-factor; OUTPUT out=c mean= msales mf-mf; PROC boxplot data= TXCLUST; plot *cluster; SELECT TXCLUST; DATA cl; set TXCLUST; if cluster=; PROC REG DATA=cl; MODEL sales = Factor-factor/ P R selection=b; OUTPUT OUT=C P=PRED R=RESID STDP=STDP; /* finally undo the clusters and calculate the potential gain */ DATA C; SET C; rowp = exp(pred+.*stdp*stdp)-; epred = exp(pred)-; sales = exp(sales) -; gain = rowp - sales; PROC sort; by gain; PROC print; /* code for the special case when the cluster size is very small*/ DATA cl; SET TXCLUST; IF cluster=; sales = exp(sales) -; PROC print; PROC means data=cl; VAR sales; endsas; /* suppose the mean of sales is. */ DATA cl; set cl; gain =. - sales; PROC sort; by gain; PROC print;

15 NESUG /***** FACTOR ANALYSIS USING R-SOFTWARE hh = read.xport("hosp.xpt") dim(hh) hh[,] library(cluster) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) plot(silhouette(pam(hh[,:], k=9)), main = paste("k = ",), do.n.k=false) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) clpam = pam(hh[,:], k=)$cluster table(clpam) table(hh[,]) table(clpam,hh[,]) library(rpart) rpart( ~FACTOR+FACTOR+FACTOR, data=hh) predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) length(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) length(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) plot(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) text(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) /***** RPART ANALYSIS USING R-SOFTWARE library(rpart) hh[,] hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh) plot(hh.rp) text(hh.rp) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) plot(hh.rp) text(hh.rp) plot(hh.rp, uni=t) text(hh.rp,use.n=true,cex=.) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) plot(hh.rp, uni=t) text(hh.rp,use.n=true,cex=.) hh.rp table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) predv = (predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) factor(predv)[:] as.numeric(factor(predv))[:] table(as.numeric(factor(predv))) cluster = (as.numeric(factor(predv))) hh[ cluster==,] factor(predv)[:] exp(.) - factor(predv)[:]