Statistics & Analysis


 Caren Perry
 1 years ago
 Views:
Transcription
1 NESUG How to Increase Sales of Orthopedic Equipment in United States: Factor and Cluster Analysis using SAS and R George Obsekov American College of Radiology Research Center Philadelphia, PA INTRODUCTION This paper was designed to analyze the sales of orthopedic equipment to United States hospitals. The purpose of the analysis was to find a way to increase sales from the company to hospitals, and to define the list of hospitals where sale gains could be maximized. In order to construct such a comprehensive list of hospitals, I created a subset of, hospitals based on geographical location (Southern USA). I analyzed different descriptive variables for each hospital such as, number of beds, number of outpatient visits, number of certain types of operations, administrative cost, and etc. My response variable was the sales of rehabilitation equipment. After analyzing scatter plots for each explanatory variable against our response variable I performed either a log or square root transformation to make the scatter plots look close to display linear trend. The next step was to use the factor analysis in order to split all variables into main groups that appropriately describe the different aspects of our hospitals. Based on rotated table results, the first factor included a number of operations; the second factor included the size, and the third one included rehab. After defining all factors I applied a cluster analysis in order to group hospitals with similar characteristics and properties together. Reviewing Ward s minimum variance table, I investigated the gap between SPRSQ values and found that clusters were the best for our appropriate cutoff. Then I chose a cluster with high average sales that contained a few hospitals having very low or no sales. Following regression analysis helped me to determine the list of hospitals with the large possibility of the highest potential sale gains. Finally, I applied the methods for robust clustering (PAM) and for classifications and regression trees (rpart) using Rsoftware. The selected clusters from my cluster analysis were well supported by PAM method as well. MARKET SEGMENT SELECTION In order to increase the sales of orthopedic materials to USbased hospitals I was trying to create a subset of, , hospitals out of the total, hospitals given. The hospitals from the following states were chosen based on geographical selection (Southern states): California, Texas, Louisiana, Alabama, Georgia, Florida, South Carolina, and North Carolina. In total, this gave me the final subset of analyzing the market segment with the amount of, hospitals with variables describing the main characteristics of each hospital in subset. All major variables are presented in Table.
2 NESUG Variables considered in dataset Response variable: Y : Description of variable in data set Sales of Rehabilitation Equipment Jan  July Sales of Rehabilitation Equipment for previous months Comments Zero means missing. ZIP US Postal Code HID Hospital ID CITY City Name STATE State Name BEDS Number of Hospital Beds RBEDS Number of Rehab Beds OUTV Number of Outpatient Visits ADM* Administrative Cost In $ s per year. SIR Revenue from Inpatient HIP9 Number of HIP Operations for 99 KNEE9 Number of KNEE Operations for 99 TH (binary)* Teaching hospital = teaching, = nonteaching. TRAUMA (binary)* Do They Have a Trauma Unit? =Yes, =No. REHAB (binary)* Do They Have a Rehab Unit? =Yes, =No. HIP9 Number of HIP Operations for 99 KNEE9 Number of KNEE Operations for 99 FEMUR9 Number of FEMUR Operations for 99 Table : Demographic and Operational Variables used in the prediction of maximize sales TRANSFORMATION Analyzing the selected subset of hospitals I reviewed all scatter plots of each explanatory variable against my response variable (). After a close analysis it s seems to be clear that all my variables required transformation in order to appear close to linear. Figure A and B showed that variable BEDS can be better in square root transformation rather than log transformation.
3 NESUG BEDS Figure A: VS. BEDS Transformation Selection in SQRT BEDS Figure B: vs. BEDS Transformation Selection in LOG The number of rehab beds (RBEDS) and operational variables (HIP9, KNEE9, HIP9, KNEE9 and FEMUR9) were transformed to be log (+.xi), while OUTV, ADM and SIR variables
4 NESUG appeared to be closer to linear when log (+.xi) was applied. All binary variables (TH, TRAUMA and REHAB) didn t require any transformations. Finally, my response variable was also transformed from y to log (+y) where y was a combination of all sales for rehabilitation equipment. My final scatter plots after transformations appeared in Figure A and B. BEDS RBEDS HIP KNEE HIP KNEE9 Figure A: vs. BEDS, RBEDS, HIP9, KNEE9, HIP9, and KNEE9 after transformation
5 NESUG FEMUR9 OUTV ADM SIR Figure B: vs. FEMUR9, OUTV, ADM, and SIR TRANSFORMATION DIMENTION REDUCTION Dimension reduction has been made by using factor analysis to summarize operational and demographic variables in the selected subset. Using the factor procedure the three factors were constructed for future analysis: an operational factor (HIP9, KNEE9, HIP9, KNEE9, and FEMUR9), a size factor (BEDS, OUTV, ADM, SIR, TH, and TRAUMA) and a rehab factor (RBEDS and REHAB). After initial factor analysis of all variables in one stage I decided to use two stages for factor analysis in order to find a better interpretation of the factors. Factor analysis in two stages forced me to break the variables into two subgroups, one subgroup with operational variables only and another one with a size and rehab. As we see from an eigenvalues Table A for operational variables the eigenvalue for Factor has a proportion of 9.% while the eigenvalue for other factors has the proportion of more than % according to Table C. Factor pattern for stage Two divided all variables into groups: SIZE group (BEDS, OUTV, ADM, SIR, TH and TRAUMA) and REHAB group (RBEDS and REHAB).
6 NESUG Stage One: NFACT= Eigenvalue Difference Proportion Cumulative factor will be retained by the NFACTOR criterion. Table A: Eigenvalues of the Correlation Matrix in Stage One Variable Description Factor HIP9 NUMBER OF HIP OPERATIONS FOR 99.9 KNEE9 NUMBER OF KNEE OPERATIONS FOR 99.9 HIP9 NUMBER HIP OPERATIONS FOR 99.9 KNEE9 NUMBER KNEE OPERATIONS FOR 99.9 FEMUR9 NUMBER FEMUR OPERATIONS FOR 99.9 Table B: Factor Pattern in Stage One including Number of Operations Stage Two: NFACT= Eigenvalue Difference Proportion Cumulative factors will be retained by the NFACTOR criterion. Table C: Eigenvalues of the Correlation Matrix in Stage Two Variable Description Factor Factor BEDS NUMBER OF HOSPITAL BEDS.9. RBEDS NUMBER OF REHAB BEDS ..9 OUTV NUMBER OF OUTPATIENT VISITS ADM ADMINISTRATIVE COST.9 . SIR REVENUE FROM INPATIENT.9 . TH TEACHING HOSPITAL?.. TRAUMA DO THEY HAVE A TRAUMA UNIT?.. REHAB DO THEY HAVE A REHAB UNIT?..9 Table D: Rotated Factor Pattern for Two Factors (SIZE and REHAB) Figure presented Eigen Values distribution for stage using one factor and stage using two factors.
7 NESUG Eigen values for one factor Eigen values for two factors Egien value... Egien value Stage one Stage two Figure : Eigen values for an operational (left) and size/rehab (right) factors Final distribution of analyzing variables is shown in Table. Variable Description Variable Name Stage One Stage Two Factor Factor Factor Number of hospital beds BEDS Number of rehab beds RBEDS Number of outpatient visits OUTV Administrative Cost ADM Revenue from inpatient SIR Number of HIP operation for 99 HIP9 Number of KNEE operation for 99 KNEE9 Teaching hospital TH Do they have a trauma unit? TRAUMA Do they have a rehab unit? REHAB Number of HIP operation for 99 HIP9 Number of KNEE operation for 99 KNEE9 Number of FEMUR operation for 99 FEMUR9 Table Final Distribution of Variables in Factors CLUSTER ANALYSIS I used a cluster analysis in order to determine the best cluster to concentrate on for improving our sales. Table demonstrates Ward s Analysis and presents the biggest jump between cluster and with % difference. Therefore, I chose clusters for my future analysis.
8 NESUG NCL Clusters Joined SPRSQ Difference CL CL9.9 9 CL CL. CL CL. CL CL. CL CL. CL9 OB. CL CL. CL CL. CL CL.9 CL CL.99 CL CL. 9 CL CL.9 CL9 CL. % CL CL. CL CL. CL9 CL. Table : Cluster selection based on cluster history using WARD variance table Next, I created a box plot of sales against the clusters (Figure ). Based on this graph, cluster had the highest mean for sales and had some hospitals within it that didn t have any sales at all. CLUSTER Figure : Box Plot with per CLUSTER
9 NESUG Following examination of the table with means sales (Table ) discovered that the chosen cluster has the highest mean sales. Cluster contained hospitals in it, so it cannot be assumed that they are homogeneous. Since our sample size is large, I applied regression estimate for future analysis. CLUSTER FREQ msales mf mf mf Table : Sales and factors per cluster ( clusters) REGRESSION ANALYSIS In my regression analysis I used the stepwise backwards elimination procedure to determine if any of the factors are significant and must be retained in the model. The elimination did remove two factors and define an operational factor that is significant for our model. Next, I considered which hospitals have no sales. Once they were indentified, I analyzed the gain for each hospital and found that in order to increase the sales of orthopedic equipment; we should concentrate on six hospitals for potential gain of $9,9. Here is the list of the hospitals with their hospitals ID: HID (Galveston, TX), HID (Thomasville, GA), HID 9 (Los Angeles, CA), HID 9 (Valdosta, GA), HID 9 (Fort Myers, FL), and HID 999 (Downey, CA). PAM ANALYSIS IN R I used the R software to apply the method for robust clustering (PAM) in order to identify the best cluster. As we can see on Figure, k= generate the highest average silhouette width (.9). 9
10 SAS Cluster Selection NESUG Figure : Average Silhouette Width Comparison based on Clusters Table proves that robust clustering method used in R match well to the same selection for our market segments as SAS software. clpam PAM Cluster Selection Table : Cluster Selection Table using SAS and R software
11 NESUG RPART ANALYSIS IN R Using RPART analysis I found that the segment with the highest number of potential gain (.) contained missing from 9 total observations (Figure ). Figure : Final Regression Tree with following number of observations: (n=, n=, n=, n=9, n=) I did identify hospitals in this segment to determine if any of those were previously chosen for increasing our sales gain. As shown in Table, hospital with HID 9 (Fort Myers, FL) has been selected before by the regression analysis. Observation CITY STATE HID CLUSTER 9 Hemet CA 99 NA Cape Coral FL 9 NA Hawaiian CA 99 NA Oakland CA 9 NA Greensboro NC NA 9 Melbourne FL 9 NA Sacramento CA 99 NA Fort Myers FL 9 NA Table : Cluster Selection Table using RPART Method
12 NESUG RANDOM FOREST METHOD Using random forest method I found for Fort Myers hospital (HID 9) that the more accurate number for analysis is.9 and exp (.9) can generate about $, sales gain. CONCLUSIONS According to the project I was able to analyze the subset of hospitals, based on geographical selection, and put them into market segments that closely resemble one to another by using cluster analysis. By finding the best cluster, having the highest mean sales that contained hospitals with no sales, I was able to estimate potential sales gain above $, for the following hospitals: HID (Galveston, TX), HID (Thomasville, GA), HID 9 (Los Angeles, CA), HID 9 (Valdosta, GA), HID 9 (Fort Myers, FL), and HID 999 (Downey, CA). I found the final results by running PAM, RPART and Random Forest methods providing strong evidence that these selected hospitals will be the perfect candidates for improving sales of orthopedic equipment as a short term solution. REFERENCES: Statistical Consulting, Javier Cabrera and Andrew McDougall, SpringerVerlag, New York,, No. of pages: xii + 9. ISBN 9 "Understanding Robust and Exploratory Data Analysis," by Hoaglin, Mosteller and Tukey, John Wiley & Sons, 9 SAS Institute Inc., SAS Programming Tips: A Guide to Efficient SAS Processing, Cary, NC: SAS Institute Inc., 99, pp. Rajan Sambandam (9), Cluster Analysis Gets Complicated. Reprinted with permission from the American Marketing Association (Marketing Research, Vol., No., Spring ) Robert Adams (), Merck & Co., Inc., North Wales, PA, Box Plots in SAS : UNIVARIATE, BOXPLOT, or GPLOT? NESUG ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Thanks also to Stan Legum, cochair of section, whose feedback has proved invaluable to the writing of this paper. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: George Obsekov American College of Radiology Research Center Market Street Philadelphia, PA 9 Work Phone:  Fax: Web:
13 NESUG APENDIX: Following code has been used to make full analysis of the selecting dataset: DATA sasuser.hospital; INFILE 'hospital.txt' DELIMITER=','; INPUT ZIP $ HID $ CITY $ STATE $ BEDS RBEDS OUTV ADM SIR Y HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9; DATA hospital; SET sasuser.hospital; label ZIP = US POSTAL CODE HID = HOSPITAL ID CITY = CITY NAME STATE = STATE NAME BEDS = NUMBER OF HOSPITAL BEDS RBEDS = NUMBER OF REHAB BEDS OUTV = NUMBER OF OUTPATIENT VISITS ADM = ADMINISTRATIVE COST SIR = REVENUE FROM INPATIENT Y = OF REHABILITATION EQUIPMENT SINCE "JAN JULY " = OF REHAB EQUIP FOR THE PREVIOUS "" MO HIP9 = NUMBER OF HIP OPERATIONS FOR "99" KNEE9 = NUMBER OF KNEE OPERATIONS FOR "99" TH = TEACHING HOSPITAL? TRAUMA = DO THEY HAVE A TRAUMA UNIT? REHAB = DO THEY HAVE A REHAB UNIT? HIP9 = NUMBER HIP OPERATIONS FOR "99" KNEE9 = NUMBER KNEE OPERATIONS FOR "99" FEMUR9 = NUMBER FEMUR OPERATIONS FOR "99"; /* new response variable  */ = log(+ +Y); IF = THEN =.; /* code for selecting subsets based on hospital location south STATES*/ IF STATE EQ 'CA' OR STATE EQ 'FL' or state='tx' or state='sc' or state='la' or state='ga' or state='nc' or state='al'; ARRAY X {} BEDS RBEDS HIP9 KNEE9 HIP9 KNEE9 FEMUR9 OUTV ADM SIR; /* STEP TRANSFORMATIONS */ DO I= TO ; X{I} = SQRT(X{I}); END; DO i= to ; X{I} = LOG(+.*X{I}); END; DO I= TO ; X{I} = LOG(+.*X{I}); END; /* factor analysis in two stages, grouping the variables in subgroups */ PROC FACTOR data=hospital METHOD=PRIN NFACT= out=z; VAR HIP9 KNEE9 HIP9 KNEE9 FEMUR9; PROC FACTOR data=hospital METHOD=PRIN NFACT= ROTATE=VARIMAX out=z; VAR BEDS RBEDS OUTV ADM SIR TH trauma rehab; DATA z; set z; factor = factor; keep factor factor; DATA hospout; merge z z;
14 NESUG /*cluster analysis using WARD */ PROC CLUSTER data=hospout METHOD=WARD; VAR factorfactor; COPY ZIP CITY STATE HID BEDS RBEDS OUTV ADM SIR HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9 factorfactor; PROC TREE NOPRINT NCL= OUT=TXCLUST; COPY ZIP CITY STATE HID BEDS RBEDS OUTV ADM SIR HIP9 KNEE9 TH TRAUMA REHAB HIP9 KNEE9 FEMUR9 factorfactor; /* produce the cluster summary and pick the best cluster*/ PROC sort data= TXCLUST; by cluster; PROC means noprint; BY cluster; VAR factorfactor; OUTPUT out=c mean= msales mfmf; PROC boxplot data= TXCLUST; plot *cluster; SELECT TXCLUST; DATA cl; set TXCLUST; if cluster=; PROC REG DATA=cl; MODEL sales = Factorfactor/ P R selection=b; OUTPUT OUT=C P=PRED R=RESID STDP=STDP; /* finally undo the clusters and calculate the potential gain */ DATA C; SET C; rowp = exp(pred+.*stdp*stdp); epred = exp(pred); sales = exp(sales) ; gain = rowp  sales; PROC sort; by gain; PROC print; /* code for the special case when the cluster size is very small*/ DATA cl; SET TXCLUST; IF cluster=; sales = exp(sales) ; PROC print; PROC means data=cl; VAR sales; endsas; /* suppose the mean of sales is. */ DATA cl; set cl; gain =.  sales; PROC sort; by gain; PROC print;
15 NESUG /***** FACTOR ANALYSIS USING RSOFTWARE hh = read.xport("hosp.xpt") dim(hh) hh[,] library(cluster) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) plot(silhouette(pam(hh[,:], k=9)), main = paste("k = ",), do.n.k=false) plot(silhouette(pam(hh[,:], k=)), main = paste("k = ",),do.n.k=false) clpam = pam(hh[,:], k=)$cluster table(clpam) table(hh[,]) table(clpam,hh[,]) library(rpart) rpart( ~FACTOR+FACTOR+FACTOR, data=hh) predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) length(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) length(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) plot(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) text(rpart( ~FACTOR+FACTOR+FACTOR, data=hh)) /***** RPART ANALYSIS USING RSOFTWARE library(rpart) hh[,] hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh) plot(hh.rp) text(hh.rp) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) plot(hh.rp) text(hh.rp) plot(hh.rp, uni=t) text(hh.rp,use.n=true,cex=.) hh.rp = rpart( ~FACTOR+FACTOR+FACTOR, data=hh,control=rpart.control(cp=.)) plot(hh.rp, uni=t) text(hh.rp,use.n=true,cex=.) hh.rp table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh))) table(predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) predv = (predict(rpart( ~FACTOR+FACTOR+FACTOR, data=hh),newdata=hh)) factor(predv)[:] as.numeric(factor(predv))[:] table(as.numeric(factor(predv))) cluster = (as.numeric(factor(predv))) hh[ cluster==,] factor(predv)[:] exp(.)  factor(predv)[:]
What is Data mining?
STAT : DATA MIIG Javier Cabrera Fall Business Question Answer Business Question What is Data mining? Find Data Data Processing Extract Information Data Analysis Internal Databases Data Warehouses Internet
More informationRachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA
PROC FACTOR: How to Interpret the Output of a RealWorld Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a realworld example of a factor
More informationData Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
More informationThe Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,
More informationExploratory Factor Analysis
Introduction Principal components: explain many variables using few new variables. Not many assumptions attached. Exploratory Factor Analysis Exploratory factor analysis: similar idea, but based on model.
More informationModeling Lifetime Value in the Insurance Industry
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
More informationCLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining.
CLUSTER ANALYSIS Introduction Cluster analysis is a technique for grouping individuals or objects hierarchically into unknown groups suggested by the data. Cluster analysis can be considered an alternative
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationGetting Correct Results from PROC REG
Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking
More informationApplying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation
Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation ABSTRACT Customer segmentation is fundamental for successful marketing
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationNew SAS Procedures for Analysis of Sample Survey Data
New SAS Procedures for Analysis of Sample Survey Data Anthony An and Donna Watts, SAS Institute Inc, Cary, NC Abstract Researchers use sample surveys to obtain information on a wide variety of issues Many
More informationDimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationData Mining and Visualization
Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS1332014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationA Comparison of Variable Selection Techniques for Credit Scoring
1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia Email:
More informationFACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.
FACTOR ANALYSIS Introduction Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables Both methods differ from regression in that they don t have
More informationChapter 11 Introduction to Survey Sampling and Analysis Procedures
Chapter 11 Introduction to Survey Sampling and Analysis Procedures Chapter Table of Contents OVERVIEW...149 SurveySampling...150 SurveyDataAnalysis...151 DESIGN INFORMATION FOR SURVEY PROCEDURES...152
More informationImproving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
More informationA Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA022015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
More informationOverview of Factor Analysis
Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 354870348 Phone: (205) 3484431 Fax: (205) 3488648 August 1,
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationFrom The Little SAS Book, Fifth Edition. Full book available for purchase here.
From The Little SAS Book, Fifth Edition. Full book available for purchase here. Acknowledgments ix Introducing SAS Software About This Book xi What s New xiv x Chapter 1 Getting Started Using SAS Software
More informationSAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria
Paper SA01_05 SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN
More information9.2 User s Guide SAS/STAT. Introduction. (Book Excerpt) SAS Documentation
SAS/STAT Introduction (Book Excerpt) 9.2 User s Guide SAS Documentation This document is an individual chapter from SAS/STAT 9.2 User s Guide. The correct bibliographic citation for the complete manual
More informationTechnology StepbyStep Using StatCrunch
Technology StepbyStep Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate
More informationTechnical Notes for HCAHPS Star Ratings
Overview of HCAHPS Star Ratings Technical Notes for HCAHPS Star Ratings As part of the initiative to add fivestar quality ratings to its Compare Web sites, the Centers for Medicare & Medicaid Services
More informationCustomer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona
Paper 12852014 Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona ABSTRACT In this new era of healthcare reform, health insurance companies have
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #47/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationPrincipal Component Analysis
Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded
More informationForecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA
Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA Abstract Virtually all businesses collect and use data that are associated with geographic locations, whether
More information4. There are no dependent variables specified... Instead, the model is: VAR 1. Or, in terms of basic measurement theory, we could model it as:
1 Neuendorf Factor Analysis Assumptions: 1. Metric (interval/ratio) data 2. Linearity (in the relationships among the variablesfactors are linear constructions of the set of variables; the critical source
More informationEXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
More informationAlex Vidras, David Tysinger. Merkle Inc.
Using PROC LOGISTIC, SAS MACROS and ODS Output to evaluate the consistency of independent variables during the development of logistic regression models. An example from the retail banking industry ABSTRACT
More informationInnovative Techniques and Tools to Detect Data Quality Problems
Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful
More informationMEASURES OF LOCATION AND SPREAD
Paper TU04 An Overview of Nonparametric Tests in SAS : When, Why, and How Paul A. Pappas and Venita DePuy Durham, North Carolina, USA ABSTRACT Most commonly used statistical procedures are based on the
More information2. Linearity (in relationships among the variablesfactors are linear constructions of the set of variables) F 2 X 4 U 4
1 Neuendorf Factor Analysis Assumptions: 1. Metric (interval/ratio) data. Linearity (in relationships among the variablesfactors are linear constructions of the set of variables) 3. Univariate and multivariate
More informationFoundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC
A PROC SQL Primer Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC ABSTRACT Most SAS programmers utilize the power of the DATA step to manipulate their datasets. However, unless they pull
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationPrinciple Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
More informationThe SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?
The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health, ABSTRACT
More informationUnivariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
More informationUse Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study
Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study Tongshan Chang The University of California Office of the President CAIR Conference in Pasadena 11/13/2008
More informationPredicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS
Paper 11427 Predicting Customer in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Junxiang Lu, Ph.D. Sprint Communications Company Overland Park, Kansas ABSTRACT
More informationData analysis process
Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis
More informationCan SAS Enterprise Guide do all of that, with no programming required? Yes, it can.
SAS Enterprise Guide for Educational Researchers: Data Import to Publication without Programming AnnMaria De Mars, University of Southern California, Los Angeles, CA ABSTRACT In this workshop, participants
More informationNew Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency
New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency S. David Riba, JADE Tech, Inc., Clearwater, FL ABSTRACT PROC FORMAT is one of the old standards among SAS Procedures,
More informationInsurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationChristianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC
Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC ABSTRACT Have you used PROC MEANS or PROC SUMMARY and wished there was something intermediate between the NWAY option
More informationWebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
More informationCC03 PRODUCING SIMPLE AND QUICK GRAPHS WITH PROC GPLOT
1 CC03 PRODUCING SIMPLE AND QUICK GRAPHS WITH PROC GPLOT Sheng Zhang, Xingshu Zhu, Shuping Zhang, Weifeng Xu, Jane Liao, and Amy Gillespie Merck and Co. Inc, Upper Gwynedd, PA Abstract PROC GPLOT is a
More informationSAS AddIn 2.1 for Microsoft Office: Getting Started with Data Analysis
SAS AddIn 2.1 for Microsoft Office: Getting Started with Data Analysis The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2007. SAS AddIn 2.1 for Microsoft Office: Getting
More informationSAS CLINICAL TRAINING
SAS CLINICAL TRAINING Presented By 3S Business Corporation Inc www.3sbc.com Call us at : 2818239222 Mail us at : info@3sbc.com Table of Contents S.No TOPICS 1 Introduction to Clinical Trials 2 Introduction
More informationChapter 27 Using Predictor Variables. Chapter Table of Contents
Chapter 27 Using Predictor Variables Chapter Table of Contents LINEAR TREND...1329 TIME TREND CURVES...1330 REGRESSORS...1332 ADJUSTMENTS...1334 DYNAMIC REGRESSOR...1335 INTERVENTIONS...1339 TheInterventionSpecificationWindow...1339
More informationTeaching Multivariate Analysis to BusinessMajor Students
Teaching Multivariate Analysis to BusinessMajor Students WingKeung Wong and TeckWong Soon  Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis
More informationExploratory Analysis of Marketing Data: Trees vs. Regression
Exploratory Analysis of Marketing Data: Trees vs. Regression J. Scott Armstrong Assistant Professor of Marketing, The Wharton School and James G. Andress Consultant at Booz, Allen, and Hamilton, Inc.,
More informationCluster this! June 2011
Cluster this! June 2011 Agenda On the agenda today: SAS Enterprise Miner (some of the pros and cons of using) How multivariate statistics can be applied to a business problem using clustering Some cool
More informationM15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1
M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1 15.7 Analytics and Data Mining 15.7 Analytics and Data Mining 1 Section 1.5 noted that advances in computing processing during the past 40 years have
More informationEXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA
EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA Michael A. Walega Covance, Inc. INTRODUCTION In broad terms, Exploratory Data Analysis (EDA) can be defined as the numerical and graphical examination
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationA fast, powerful data mining workbench designed for small to midsize organizations
FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business
More informationAn Exploratory Data Analysis Approach to Qualitative Response Modelling Using SAS/IML(R) and SAS/GRAPH(R) Software
An Exploratory Data Analysis Approach to Qualitative Response Modelling Using SAS/IML(R) and SAS/GRAPH(R) Software Merwyn L. Elliott Ross Hightower Caleb Chan Statistical Services Laboratory Georgia State
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationQuestion 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
More informationEffective Use of SQL in SAS Programming
INTRODUCTION Effective Use of SQL in SAS Programming Yi Zhao Merck & Co. Inc., Upper Gwynedd, Pennsylvania Structured Query Language (SQL) is a data manipulation tool of which many SAS programmers are
More informationFactor Analysis. Chapter 420. Introduction
Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.
More informationMETHODS FAIR HEALTH FH NPIC DATA
Understanding Patterns in the Utilization and Cost of Elbow Reconstruction Surgeries: A Healthcare Procedure that is Common among Baseball Pitchers Eric Okurowski, MBA; Jeff Dang, PhD, FAIR Health Inc.,
More informationUSING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES
USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES Irron Williams Northwestern University IrronWilliams2015@u.northwestern.edu AbstractData science is evolving. In
More informationEssential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA
Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA ABSTRACT Throughout the course of a clinical trial the Statistical Programming group is
More informationA Basic Guide to Modeling Techniques for All Direct Marketing Challenges
A Basic Guide to Modeling Techniques for All Direct Marketing Challenges Allison Cornia Database Marketing Manager Microsoft Corporation C. Olivia Rud Executive Vice President Data Square, LLC Overview
More informationPRINCIPAL COMPONENT ANALYSIS
1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2
More informationData mining and statistical models in marketing campaigns of BT Retail
Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120
More informationData Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationPaper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI
Paper D10 2009 Ranking Predictors in Logistic Regression Doug Thompson, Assurant Health, Milwaukee, WI ABSTRACT There is little consensus on how best to rank predictors in logistic regression. This paper
More informationSteven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 306022501
CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 306022501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups
More informationEXST SAS Lab Lab #4: Data input and dataset modifications
EXST SAS Lab Lab #4: Data input and dataset modifications Objectives 1. Import an EXCEL dataset. 2. Infile an external dataset (CSV file) 3. Concatenate two datasets into one 4. The PLOT statement will
More informationThe Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA
Paper 1562010 The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA Abstract JMP has a rich set of visual displays that can help you see the information
More informationData Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
More information2009 CREDO Center for Research on Education Outcomes (CREDO) Stanford University Stanford, CA http://credo.stanford.edu June 2009
Technical Appendix 2009 CREDO Center for Research on Education Outcomes (CREDO) Stanford University Stanford, CA http://credo.stanford.edu June 2009 CREDO gratefully acknowledges the support of the State
More informationABSTRACT INTRODUCTION %CODE MACRO DEFINITION
Generating Web Application Code for Existing HTML Forms Don Boudreaux, PhD, SAS Institute Inc., Austin, TX Keith Cranford, Office of the Attorney General, Austin, TX ABSTRACT SAS Web Applications typically
More informationln(p/(1p)) = α +β*age35plus, where p is the probability or odds of drinking
Dummy Coding for Dummies Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health ABSTRACT There are a number of ways to incorporate categorical variables into
More informationChapter 25 Specifying Forecasting Models
Chapter 25 Specifying Forecasting Models Chapter Table of Contents SERIES DIAGNOSTICS...1281 MODELS TO FIT WINDOW...1283 AUTOMATIC MODEL SELECTION...1285 SMOOTHING MODEL SPECIFICATION WINDOW...1287 ARIMA
More informationData Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationDoing Quantitative Research 26E02900, 6 ECTS Lecture 2: Measurement Scales. OlliPekka Kauppila Rilana Riikkinen
Doing Quantitative Research 26E02900, 6 ECTS Lecture 2: Measurement Scales OlliPekka Kauppila Rilana Riikkinen Learning Objectives 1. Develop the ability to assess a quality of measurement instruments
More informationIdentification of noisy variables for nonmetric and symbolic data in cluster analysis
Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,
More informationA Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D022009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
More informationJoseph Twagilimana, University of Louisville, Louisville, KY
ST14 Comparing Time series, Generalized Linear Models and Artificial Neural Network Models for Transactional Data analysis Joseph Twagilimana, University of Louisville, Louisville, KY ABSTRACT The aim
More informationStatistical Discovery
SCSUG 2014 JMP Visual Statistics Charles Edwin Shipp, Consider Consulting Corp, Los Angeles, CA ABSTRACT For beginners, we review the continuing merging of statistics and graphics. Statistical graphics
More informationAn Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com
More informationSAS R IML (Introduction at the Master s Level)
SAS R IML (Introduction at the Master s Level) Anton Bekkerman, Ph.D., Montana State University, Bozeman, MT ABSTRACT Most graduatelevel statistics and econometrics programs require a more advanced knowledge
More informationSun Li Centre for Academic Computing lsun@smu.edu.sg
Sun Li Centre for Academic Computing lsun@smu.edu.sg Elementary Data Analysis Group Comparison & Oneway ANOVA Nonparametric Tests Correlations General Linear Regression Logistic Models Binary Logistic
More informationLet SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA
ABSTRACT PharmaSUG 2015  Paper QT12 Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA It is common to export SAS data to Excel by creating a new Excel file. However, there
More informationAssessing Model Fit and Finding a Fit Model
Paper 21429 Assessing Model Fit and Finding a Fit Model Pippa Simpson, University of Arkansas for Medical Sciences, Little Rock, AR Robert Hamer, University of North Carolina, Chapel Hill, NC ChanHee
More informationDidacticiel  Études de cas
1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.
More informationPerformance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe
Performance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe A SAS White Paper Table of Contents The SAS and IBM Relationship... 1 Introduction...1 Customer Jobs Test Suite... 1
More information