Applications of R Software in Bayesian Data Analysis

Transcription

1 Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: ISSN: Florida, USA Applications of R Software in Bayesian Data Analysis Nageena Nazir*, Athar Ali Khan A. H. Mir and Showkat Maqbool Division of Agricultural Statistics, Sher-e- Kashmir University of Agricultural Sciences & Technology Kashmir, Shalimar Srinagar * To whom correspondence should be addressed: nazir.nageena@gmail.com Article history: Received 15 May 2012, Received in revised form 29 May 2012, Accepted 29May 2012, Published 30 May Abstract: Bayesian statistics is an approach to statistics which formally seeks use of prior information with the data, and Baye s Theorem provides the formal basis for making use of both sources of information in a formal manner. The Bayesian analysis is the study of different features of posterior density. R software is used to explore these features from numeric as well as graphic view point. Proper emphasis has been given on graphical features throughout. In this study, Bayesian analyses have been covered on linear regression, analysis of designed experiments, analysis of mixed effect models and logistic regression analysis. Simulation approach of Bayesian analysis was found to be the most useful one. Keywords: R software, Bayesian Data Analysis 1. Introduction Bayesian statistics is an approach to statistics, which formally seeks use of prior information and Baye's theorem provides the basis for making use of this information in a formal manner. When significant prior information is available, the Bayesian approach shows how to utilize it sensibly. This is not possible with most non Bayesian approaches. In Bayesian approach the parameter of interest is treated as random and data as fixed which is in contrast to frequents approach where parameter is treated as fixed and data as random. The business of statistics is to provide information or conclusion about uncertain quantities. The language of uncertainty is probability and only the conditional probability, Bayesian approach consistently uses this language to address uncertainty. Bayes Theorem states that

2 8 or equivalently posterior likelihood p ( θ y) p( y θ ) p( θ ) prior Bayesian statistics is an excellent alternative to be more reasonable for moderate and especially for small sample sizes when non Bayesian procedures do not work (e.g., Berger 1985, page 125). Data analysis is indispensable in any agricultural research. A large number of software have been developed and most common among them are SAS, SPSS, Minitab, S-PLUS and R. In the present study, R software was used for statistical and graphical analyses. It has an integrated suite of software for data manipulation, calculation, and graphical display. It has a large number of functions for data analysis. It has its own programming language, which is very effective and simple. In this study, Bayesian analyses have been covered on linear regression, analysis of designed experiments, analysis of mixed effect models and logistic regression analysis. Simulation approach of Bayesian analysis was found to be the most useful one. 2. Material and Methods In the present paper, R-software is applied to study the Bayesian methods of agricultural data analysis this includes summary features of the data, that is, empirical mean standard, standard error of means, quantiles, posterior density of each of the variable is also plotted. Functions available in the R- software and MCMC pack of R-software are used for illustrating analytical as well as graphical view point. Existing data are used for the purpose of illustration. Concepts of Bayesian methods and R- software implementations are addressed in each section. 3. Bayesian Analysis of Linear Regression Model Analysis of simple regression model is illustrated here and multiple regression models can also be discussed on the similar lines, however one can get such results for multiple regression models on the similar lines. Example: wormy Fruits Percentage of wormy fruits attacked by codling moth larvae is greater on apple trees bearing small crop. Regressor x is the size of crop (hundreds of fruits) and response variable y is the percentage of wormy fruits ( e.g, Snedecor and Cochran 1989, page 162). The data frame wormyfruits consists of 12 rows and 2 columns having column names fruitsize and wormypercent for x and y, respectively.

3 9 fruitsize wormypercent Fit a Bayesian linear model for the data. # Look into the data graphically >x11(width=4, height=4) # To define height and width of Fig. > plot (wormypercent~fruitsize,data=wormyfruits) # Output is reported in Figure 1. wormypercent fruitsize Figure 1: This plot clearly suggests that a simple linear regression model can be fitted. We shall use MCMCregress of MCMCpack to analyze this model.

4 10 > library(mcmcpack) > M6<-MCMCregress (wormypercent~fruitsize, data = wormyfruits) > summary(m6) Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = (1). Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) fruitsize sigma (2). Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) fruitsize sigma This is the numeric summary which clearly shows that both intercept and regression coefficient are statistically significant. Now we can get graphic summary also. To plot the posterior densities of the regression coefficients, we use the function plot as: >plot(m6,trace=false) Output is reported in Figure 2.

5 11 Density of (Intercept) Density of fruitsize N = Bandwidth = N = Bandwidth = Density of sigma N = Bandwidth = Figure 2: It is evident from this figure that all the required information is contained in posterior densities for parameters β, β and σ of the model wormyperce nt β + fruitsize + error = 0 β1 It may be noted that likelihood is Normal and prior is non-informative. 4. Bayesian Analysis of Designed Experiments 4.1. Bayesian Analysis of One Way Data Analysis of variance technique is commonly used to analyze a data generated in an experiment. Bayesian parallel is discussed here. Example: fat data Fat absorption data in which 4 type of fats are used to study the fat absorption patterns, and each fat was replicated 6 times. Purpose of study was to see absorption of different fats in doughnuts. Detail of data is available in Snedecor and Cochran1989, page 218. Replication Fat R1 R2 R3 R4 R5 R Fat Fat Fat

6 12 Fat A data frame fatdata has been created for the use of Bayesian modeling. Fit the data model as: > M7<-MCMCregress(absorption~Fat,data=fatdata) Print the summary of results as: > summary(m7) Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = (1). Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) FatFat FatFat FatFat sigma (2). Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) FatFat FatFat FatFat sigma It is evident from this output that keeping Fat1 as baseline, Fat2 differ significantly from Fat1, whereas Fat3 and Fat4 do not differ significantly from Fat1. This is evidenced into graphic features of the Bayesian analysis also as graphic output is reported in Figure 3.

7 13 >plot(m7,trace=false) Density of (Intercept) Density of FatFat N = Bandw idth = N = Bandw idth = Density of FatFat3 Density of FatFat N = Bandw idth = N = Bandw idth = Density of sigma N = Bandw idth = Figure 3: Posterior summaries of MCMCregress for fatdata. This is the Bayesian couterpart of analysis of variance for one way data Bayesian Analysis of Factorial Experiments Example: cowpea data A data is reported in Snedecor and Cochran (1989), page 308, in which 3 levels of Variety and 3 levels of Spacing are the two factors with 4 Replications. Response is Yield of cowpea hay (lb/100 morgen plot). Design is factorial Randomized Block Design (RBD). Details of the data are as under:

8 14 Table 1: Data on yield of cowpea Variety Spacing Replication R1 R2 R3 R4 V1 S S S V2 S S S V3 S S S To get the Bayesian analysis of this data we use the function MCMCregress of MCMCpack. A data frame cowpea is constructed for Bayesian modeling. This data frame contains 36 rows and 4 columns of Replication, Spacing, Variety and yield. Model is fitted as: > M8<-MCMCregress(yield~Variety*Spacing, data=cowpea) > summary(m8) Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = (1). Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) VarietyV VarietyV SpacingS SpacingS

9 15 VarietyV2:SpacingS VarietyV3:SpacingS VarietyV2:SpacingS VarietyV3:SpacingS Sigma (2). Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) VarietyV VarietyV SpacingS SpacingS VarietyV2:SpacingS VarietyV3:SpacingS VarietyV2:SpacingS VarietyV3:SpacingS Sigma

10 16 Density of (Intercept) Density of VarietyV N = Bandw idth = N = Bandw idth = Density of VarietyV3 Density of SpacingS N = Bandw idth = N = Bandw idth = Density of SpacingS N = Bandw idth = Density of VarietyV2:SpacingS N = Bandw idth = Figure 4: Posterior summaries of cowpea data generated in a factorial experiment. It is evident from these outputs that if V1 and S1 are kept as baseline, then varieties V2 and V3 differ significantly from V1. Similarly, S3 differs significantly from S1 whereas S2 does not differ significantly from S1. It is obvious that interaction V1S1 will be the baseline for testing interactions, and it is evident that only V2S3 differs significantly from V1S1, whereas V2S2, V3S2 and V3S3 do not differ significantly from V1S1. Posterior densities of interactions V3S2, V2S3 and V3S3 are not reported here. 5. Bayesian Analysis of Logistic Regression Model Example: radiotherapy data The data object radiotherapy consists of data taken from Mandenhall et al. (1989): Radiotherapy and Oncology 16, (See also Tanner 1996, page 28). The radiotherapy data frame contains data radio therapy of 24 patients in which rows represent patient and columns represent Days, number of days received by each patient and Response, absence (1) and presence (0) of disease at a site 3 years after treatment. This data does not have any reference of agricultural sciences, however, such type of

11 17 data are quite common in agricultural sciences too. The purpose of illustration of Bayesian logistic regression was the only aim to introduce such a data here. Days Response The model for the data is logistic regression model p i log( ) xi (1) 1 pi = α + β where x i represents the covariate for the ith patient, success (no disease). p i represents corresponding probability of

12 18 This model specifies that log-odds of success is linearly related to the number of days the subject received radiotherapy. The intercept α represents the log-odds of success for 0 days, while the slope β represent s the change in the log-odds of success for every unit increase in covariate. Thus from model (1) probability of success p i can be defined as pi ( xi ) = exp( α + βxi ) /(1 + exp( α + βxi )) Fitting the logic model for radiotherapy data using the function MCMClogit of MCMCpack. > M9<-MCMClogit(Response~Days,data=radiotherapy) The Metropolis acceptance rate for beta was > summary(m9) Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = (1). Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) Days (2). Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) Days To get graphic summary of Bayesian analysis >plot(m9,trace=false) #Output is reported in Figure 5.

13 19 Density of (Intercept) Density of Days N = Bandwidth = N = Bandwidth = Figure 5: Posterior summary of logistic regression model fitted for radiotherapy data discussed above. This figure clearly indicates that Days of therapy are significantly related to the probability of emergence of disease. 6. Bayesian Analysis of Mixed Effects Model (Hierarchical Bayes analysis) It is a well-known fact that mixed effects model lack theoretical foundations and Bayesian approach provides the grounds for it (e.g., Lindley and Smith, 1972) for detailed discussion. Kass and Steffey (1989) use the terms common effect and unit specific effects for fixed and random effects, respectively. In terms of priors, non-informative priors stand for fixed effects and informative priors for the random effects. However, in Bayesian spirit every effect is random. A practical implementation of this analysis has been made into lme4 package of R. Example: coagulation Effect of diet on coagulation time (seconds) for blood drawn from 24 animals randomly allocated to four different diets. (Gelman et al., 1995, page 274.; Box, Hunter and Hunter, 1978). Diet Coagulation time number of observations A B C D

14 20 A data frame coagulation contains the information desired for the analysis. This data frame contains 24 rows and two columns of diet and coagulation time. Bayesian analysis of the data can be made using R software in same spirit as it was done in the earlier examples. >print(dotplot(diet~coag.time,data=coagulation,xlab= Coagulation time(seconds),ylab= Diet )) D C Diet B A Coagulation time(seconds) Figure 6: Dot plot of coagulation data. This figure suggests random effect of intercept. Fitting the model using lmer2 function of lme4 package > M10<-lmer(coag.time~1+(1 diet),data=coagulation) > summary(m10) Linear mixed-effects model fit by REML Formula: coag.time ~ 1 + (1 diet) Data: coagulation AIC BIC loglik MLdeviance REMLdeviance Random effects: Groups Name Variance Std.Dev. diet (Intercept) Residual number of obs: 24, groups: diet, 4 Fixed effects: Estimate Std. Error t value (Intercept)

15 21 6. Simulations from M10 a Posterior Fitted by lmer An in depth Bayesian analysis of this data can be made using simulation tools available in R. For example to simulate 2000 observations from the fitted object M10 we use the function mcmcsamp as: > M10.mcmc<-mcmcsamp(M10,n=2000,deviance=TRUE) > summary(m10.mcmc) Iterations = 1:2000 Thinning interval = 1 Number of chains = 1 Sample size per chain = 2000 (1). Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) log(sigma^2) log(diet.(in)) Deviance (2). Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) log(sigma^2) log(diet.(in)) Deviance >plot(m10.mcmc) #To get graphic summaries reported in Figure 7.

16 Trace of (Intercept) Iterations Density of (Intercept) N = 2000 Bandw idth = Trace of log(sigma^2) Density of log(sigma^2) Iterations N = 2000 Bandw idth = Trace of log(diet.(in)) Density of log(diet.(in)) Iterations N = 2000 Bandw idth = Trace of deviance Iterations Density of deviance N = 2000 Bandw idth = Figure 7: It is evident from above plots of posterior densities that except Intercept none of the posterior densities can be approximated by Normal approximation, a common approach used by non- Bayesians. 7. Conclusion It is clear from this study that Bayesian approach to agricultural data analysis is a very rich and useful tool. It provides in depth study of different features of the data which are otherwise hidden and cannot be explored using other techniques. Moreover, R software has a power and efficiency to deal with the numeric as well as graphic features of an agricultural data. Simulation tools are more powerful than any other statistical package. Future of the data analysis lies with Bayesian approach and R only.

17 23 References [1] Box, G. E. P., Hunter W. G., and Hunter J. S. (1978): Statistics for Experimenters. John Wiley. [2] Gelman, A., Carlin, J. B., Stern H. S. and Rubin, D. B. (1995): Bayesian Data Analysis. Chapman and Hall. [3] Kass, R. E. and Steffy, D. (1989): Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models). J. Amer. Statist. Assoc., 84: [4] Lindley, D. V. and Smith, A. F. M. (1972): Bayes estimates for the linear model (with discussion). J. R. Statist. Soc. Ser B 34: [5] R Development Core Team (2007). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN , URL [6] Snedecor, G. W. and Cochran, W. G. (1989). Statistical Methods, 8th edition. IOWA State University Press, Ames. IOWA. [7] Tanner, M. A. (1996): Tools for Statistical Inference. Springer-Verlag [8] Venables, W. N. and Replay, D. B. (2002). Modern Applied Statistics with S-PLUS. Springer, New York.