- PDF Free Download

Transcription

1 Applying MCMC Methods to Multi-level Models submitted by William J Browne for the degree of PhD of the University of Bath 1998 COPYRIGHT Attention is drawn tothefactthatcopyright of this thesis rests with its author This copy of the thesis has been supplied on the condition that anyone who consults it is understood to recognise that its copyright rests with its author and that no quotation from the thesis and no information derived from it may be published without the prior written consent of the author This thesis may bemadeavailable for consultation within the University Library and may be photocopied or lent to other libraries for the purposes of consultation Signature of Author William J Browne

2 To Health, Happiness and Honesty

3 Summary Multi-level modelling and Markov chain Monte Carlo methods are two areas of statistics that have become increasingly popular recently due to improvements in computer capabilities, both in storage and speed of operation The aim of this thesis is to combine the two areas by tting multi-level models using Markov chain Monte Carlo (MCMC) methods This task has been split into three parts in this thesis Firstly the types of problems that are tted in multilevel modelling are identied and the existing maximum likelihood methods are investigated Secondly MCMC algorithms for these models are derived and nally these methods are compared to the maximum likelihood based methods both in terms of estimate bias and interval coverage properties Three main groups of multi-level models are considered Firstly N level Gaussian models, secondly binary response multi-level logistic regression models and nally Gaussian models with complex variation at level 1 Two simple 2 level Gaussian models are rstly considered and it is shown how to t these models using the Gibbs sampler Then extensive simulation studies are carried out to compare the Gibbs sampler method with maximum likelihood methods on these two models For the general N level Gaussian models, algorithms for the Gibbs sampler and two alternative hybrid Metropolis Gibbs methods are given and these three methods are then compared with each other One of the hybrid Metropolis Gibbs methods is adapted to t binary response multi-level models This method is then compared with two quasi-likelihood methods via a simulation study on one binary response model where the quasilikelihood methods perform particularly badly All of the above models can also be tted using the Gibbs sampling method using the adaptive rejection algorithm in the BUGS package (Spiegelhalter et al 1994) Finally Gaussian models with complex variation at level 1 which cannot be tted in BUGS are considered Two methods based on Hastings update steps are given and are tested on some simple examples The MCMC methods in this thesis have been added to the multi-level modelling package MLwiN (Goldstein et al 1998) as a by-product of this research

4 Acknowledgements I would rstly like to thank my supervisor, Dr David Draper whose research in the elds of hierarchical modelling and Bayesian statistics motivated this PhD I would also like to thank him for his advice and assistance throughout both my MSc and PhD I would like to thank my parents for supporting me both nancially and emotionally through my rst degree and beyond I would like to thank the multilevel models project team at the Institute of Education, in particular Jon Rasbash and Professor Harvey Goldstein for allowing me to work with them on the MLwiN package I would also like to thank them for their advice and assistance while I have been working on the package I would like to thank my brother Edward and his ance Meriel for arranging their wedding a month before I am scheduled to nish this thesis This way I can spread my worries between my PhD andmy best man's speech Iwould like to thank my girlfriends over the last three years for helping me through various parts of my PhD Thanks for giving me love and support when I needed it and making my life both happy and interesting I would like to thank the other members of the statistics group at Bath for teaching me all I know about statistics today I would like to thank my fellow oce mates, past and present for their humour, conversation and friendship and for joining me in my many pointless conversations Thanks to family and friends both in Bath and elsewhere Special thanks are due to the EPSRC for their nancial support \The only thing I know is that I don't know anything" Socrates

5 Contents 1 Introduction 1 11 Objectives 1 12 Summary of Thesis 2 2 Multi Level Models and MLn 4 21 Introduction JSP dataset 4 22 Analysing Redhill school data Linear regression Linear models 7 23 Analysing data on the four schools in the borough of Blackbridge ANOVA model ANCOVA model Combined regression Two level modelling Iterative generalised least squares Restricted iterative generalised least squares Fitting variance components models to the Blackbridge dataset Fitting variance components models to the JSP dataset Random slopes model Fitting models to pass/fail data Extending to multi-level modelling Summary 21 i

6 3 Markov Chain Monte Carlo Methods Background Bayesian inference Metropolis sampling Proposal distributions Metropolis-Hastings sampling Gibbs sampling Rejection sampling Adaptive rejection sampling Gibbs sampler as a special case of the Metropolis-Hastings algorithm Data summaries Measures of location Measures of spread Plots Convergence issues Length of burn-in Mixing properties of Markov chains Multi-modal models Summary Use of MCMC methods in multi-level modelling Example - Bivariate normal distribution Metropolis sampling Metropolis-Hasting sampling Gibbs sampling Results Summary 51 4 Gaussian Models 1 - Introduction Introduction Prior distributions Informative priors Non-informative priors Priors for xed eects 55 ii

7 424 Priors for single variances Priors for variance matrices Level variance components model Gibbs sampling algorithm Simulation method Results : Bias Results : Coverage probabilities and interval widths Improving maximum likelihood method interval estimates for u Summary of results Random slopes regression model Gibbs sampling algorithm Simulation method Results Conclusions Simulation results Priors in MLwiN Gaussian Models 2 - General Models General N level Gaussian hierarchical linear models Gibbs sampling approach Generalising to N levels Algorithm Computational considerations Method 2 : Metropolis Gibbs hybrid method with univariate updates Algorithm Choosing proposal distribution variances Adaptive Metropolis univariate normal proposals Method 3:Metropolis Gibbs hybrid method with block updates Algorithm Choosing proposal distribution variances Adaptive multivariate normal proposal distributions Summary Timing considerations 138 iii

8 6 Logistic Regression Models Introduction Multi-level binary response logistic regression models Metropolis Gibbs hybrid method with univariate updates Other existing methods Example 1 : Voting intentions dataset Background Model Results Substantive Conclusions Optimum proposal distributions Example 2 : Guatemalan child health dataset Background Model Original 25 datasets Simulating more datasets Conclusions Summary Gaussian Models 3 - Complex Variation at level Model denition Updating methods for a scalar variance Metropolis algorithm for log Hastings algorithm for Example : Normal observations with an unknown variance Results Updating methods for a variance matrix Hastings algorithm with an inverse Wishart proposal Example : Bivariate normal observations with an unknown variance matrix Results Applying inverse Wishart updates to complex variation at level MCMC algorithm Example iv

9 743 Conclusions Method 2: Using truncated normal Hastings update steps Update steps at level 1 for JSP example Proposal distributions Example 2 : Non-positive denite and incomplete variance matrices at level General algorithm for truncated normal proposal method Summary Conclusions and Further Work Conclusions MCMC options in the MLwiN package Further work Binomial responses Multinomial models Poisson responses for count data Extensions to complex variation at level Multivariate response models 188 v

10 List of Figures 2-1 Plot of the regression lines for the four schools in the Borough of Blackbridge Tree diagram for the Borough of Blackbridge Histogram of 1 using the Gibbs sampling method Kernel density plot of 1 using the Gibbs sampling method and a Gaussian kernel with a large value of the window width h Traces of parameter 1 and the running mean of 1 for a Metropolis run that converges after about 50 iterations Upper solid line in lower panel is running mean with rst 50 iterations discarded ACF and PACF for parameter 1 for a Gibbs sampling run of length 5000 that is mixing well and a Metropolis run that is not mixing very well Kernel density plot of 2 using the Gibbs sampling method and a Gaussian kernel Plots of the Raftery Lewis ^N values for various values of p, the proposal distribution standard deviation Plot of the MCMC diagnostic window in the package MLwiN for the parameter 1 from a random slopes regression model Plot of normal prior distributions over the range ({5,5) with mean 0andvariances 1,2,5,10 and 50 respectively Plots of biases obtained for the various methods against study design and parameter settings Trajectories plot of IGLS estimates for run of random slopes regression model where convergence is not achieved 85 vi

11 4-4 Plots of biases obtained for the various methods tting the random slopes regression model against value of u01 (Fixed eects parameters and level 1 variance parameter) Plots of biases obtained for the various methods tting the random slopes regression model against value of u01 (Level 2 variance parameters) Plots of biases obtained for the various methods tting the random slopes regression model against study design (Fixed eects parameters and level 1 variance parameter) Plots of biases obtained for the various methods tting the random slopes regression model against study design (Level 2 variance parameters) Plots of the eect of varying the scale factor for the proposal variance and hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the 0 parameter in the variance components model on the JSP dataset Plots of the eect of varying the scale factor for the proposal variance and hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the 0 parameter in the random slopes regression model on the JSP dataset Plots of the eect of varying the scale factor for the proposal variance and hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the 1 parameter in the random slopes regression model on the JSP dataset Plots of the eect of varying the scale factor for the multivariate normal proposal distribution and hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the 0 parameter in the random slopes regression model on the JSP dataset Plots of the eect of varying the scale factor for the multivariate normal proposal distribution and hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the 1 parameter in the random slopes regression model on the JSP dataset 131 vii

12 6-1 Plot of the eect of varying the scale factor for the univariate Normal proposal distribution rate on the Raftery Lewis diagnostic for the u 2 parameter in the voting intentions dataset Plots comparing the actual coverage of the four estimation methods with their nominal coverage for the parameters 0 1 and Plots comparing the actual coverage of the four estimation methods with their nominal coverage for the parameters 3 v 2 and u Plots of truncated univariate normal proposal distributions for a parameter, A is the current value, c and B is the proposed new value, M is max and m is min, the truncation points The distributions in (i) and (iii) have mean c, while the distributions in (ii) and (iv) have mean 175 viii

13 List of Tables 21 Summary of Redhill primary school results from JSP dataset 6 22 Parameter estimates for model including Sex and Non-Manual covariates for Redhill primary school 8 23 Summaryofschools in the borough of Blackbridge 8 24 Parameter estimates for ANOVA and ANCOVA models for the boroughofblackbridge dataset Parameter estimates for two variance components models using both IGLS and RIGLS for Borough of Blackbridge dataset Parameter estimates for two variance components models using both IGLS and RIGLS for all schools in the JSP dataset Comparison between tted values using the ANOVA model and the variance components model 1 using RIGLS Parameter estimates for random slopes model using both IGLS and RIGLS for all schools in the JSP dataset Comparison between tted regression lines produced by separate regressions and the random slopes model Parameter estimates for the two logistic regression models tted to the Blackbridge dataset Parameter estimates for the two-level logistic regression models tted to the JSP dataset Comparison between MCMC methods for tting a bivariate normal model with unknown mean vector Comparison between 95% condence intervals and Bayesian credible intervals in bivariate normal model 51 ix

14 41 Summary of study designs for variance components model simulation Summary of times for Gibbs sampling in the variance components model with dierent study designs for 50,000 iterations Summary of Raftery Lewis convergence times (thousands of iterations) for various studies Summary of simulation lengths for Gibbs sampling the variance components model with dierent study designs Estimates of relative bias for the variance parameters using dierent methods and dierent studies True level 2/1 variance values are10and Estimates of relative bias for the variance parameters using dierent methods and dierent true values All runs use study design Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the xed eect parameter using dierent methods and dierent studies True values for the variance parameters are 10 and 40 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Average 90%/95% interval widths for the xed eect parameter using dierent studies True values for the variance parameters are 10 and Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the xed eect parameter using dierent methods and dierent true values All runs use study design 7 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Average 90%/95% interval widths for the xed eect parameter using dierent true parameter values All runs use study design Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the level 2 variance parameter using dierent methods and dierent studies True values of the variance parameters are 10 and 40 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates 73 x

15 412 Average 90%/95% interval widths for the level 2 variance parameter using dierent studies True values of the variance parameters are 10 and Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the level 2 variance parameter using dierent methods and dierent true values All runs use study design 7 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Average 90%/95% interval widths for the level 2 variance parameter using dierent true parameter values All runs use study design Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the level 1 variance parameter using dierent methods and dierent studies True values of the variance parameters are 10 and 40 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Average 90%/95% interval widths for the level 1 variance parameter using dierent studies True values of the variance parameters are 10 and Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the level 1 variance parameter using dierent methods and dierent true values All runs use study design 7 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Average 90%/95% interval widths for the level 1 variance parameter using dierent true parameter values All runs use study design Summary of results for the level 2 variance parameter, u 2 using the RIGLS method and inverse gamma intervals Summary of the convergence for the random slopes regression with the maximum likelihood based methods (IGLS/RIGLS) The study design is given in terms of the number of level 2 units and whether the study is balanced (B) or unbalanced (U) 86 xi

16 421 Summary of results for the random slopes regression with the 48 schools unbalanced design with parameter values, u00 =5 u01 = 0and u11 =0:5 All 1000 runs Summary of results for the random slopes regression with the 48 schools unbalanced design with parameter values, u00 =5 u01 = 1:4 and u11 =0:5 Only 982 runs Summary of results for the random slopes regression with the 48 schools unbalanced design with parameter values, u00 =5 u01 = ;1:4 and u11 =0:5 Only 984 runs Summary of results for the random slopes regression with the 48 schools unbalanced design with parameter values, u00 =5 u01 = 0:5 and u11 =0:5 All 1000 runs Summary of results for the random slopes regression with the 48 schools unbalanced design with parameter values, u00 =5 u01 = ;0:5 and u11 =0:5 Only 998 runs Summary of results for the random slopes regression with the 48 schools balanced design with parameter values, u00 = 5 u01 = 0:0 and u11 =0:5 All 1000 runs Summary of results for the random slopes regression with the 12 schools unbalanced design with parameter values, u00 =5 u01 = 0:0 and u11 =0:5 Only 877 runs Summary of results for the random slopes regression with the 12 schools balanced design with parameter values, u00 = 5 u01 = 0:0 and u11 =0:5 Only 990 runs Optimal scale factors for proposal variances and best acceptance rates for several models Demonstration of Adaptive Method 1 for parameters 0 and 1 using arbitrary (1000) starting values Comparison of results for the random slopes regression model on the JSP dataset using uniform priors for the variances, and dierent MCMC methods Each method was run for 50,000 iterations after a burn-in of xii

17 54 Demonstration of Adaptive Method 2 for parameters 0 and 1 using arbitrary (1000) starting values Demonstration of Adaptive Method 3 for the parameter vector using RIGLS starting values Comparison of results for the random slopes regression model on the JSP dataset using uniform priors for the variances, and dierent block updating MCMC methods Each method was run for 50,000 iterations after a burn-in of Comparison of results from the quasi-likelihood methods and the MCMC methods for the voting intention dataset The MCMC method is based on a run of 50,000 iterations after a burn-in of 500 and adapting period Optimal scale factors for proposal variances and best acceptance rates for the voting intentions model Summary of results (with Monte Carlo standard errors) for the rst 25 datasets of the Rodriguez Goldman example Summary of results (with Monte Carlo standard errors) for the Rodriguez Goldman example with 500 generated datasets Comparison between three MCMC methods for a univariate normal model with unknown variance Comparison between two MCMC methods for a bivariate normal model with unknown variance matrix Comparison between IGLS/RIGLS and MCMC method on a simulated dataset with the layout of the JSP dataset Comparison between RIGLS and MCMC method 2 on three models with complex variation tted to the JSP dataset 177 xiii

18 Chapter 1 Introduction 11 Objectives Multi-level modelling has recently become an increasingly interesting and applicable statistical tool Many areas of application t readily into a multilevel structure Goldstein and Spiegelhalter (1996) illustrate the use of multilevel modelling in two leading application areas, health and education other application areas include household surveys and animal growth studies Several packages have been written to t multi-level models MLn (Rasbash and Woodhouse (1995)), HLM (Bryk et al (1988), Bryk and Raudenbush (1992)), and VARCL (Longford (1987), Longford (1988)) are all packages which use as their tting mechanisms, maximum likelihood or empirical Bayes methodology These methods are used to nd estimates of parameters of interest in complicated models where exact methods would involve intractable integrations Another technique that has come to the forefront of statistical research over the last decade or so is the use of Markov chain Monte Carlo (MCMC) simulation methods (Gelfand and Smith 1990) With the increase in computer power, both in speed of computation and in memory capacity, techniques that were theoretical ideas thirty years ago are now practical reality The structure of the multilevel model with its interdependence between variables makes it an ideal area of application for MCMC techniques Draper (1995) describes the use of multi-level modelling in the social sciences and recommends greater use of MCMC methods in this eld 1

19 When MCMC methods were rst introduced, if statisticians wanted to t a complicated model they would program up their own MCMC sampler for the problem they were considering and use it to solve that problem More recently a general purpose MCMC sampler, BUGS (Spiegelhalter et al 1995) has been produced that will t a wide range of models in many application areas BUGS uses a technique called Gibbs sampling to t its models using an adaptive rejection algorithm described in Gilks and Wild (1992) In this thesis I am interested in studying multi-level models, and comparing the maximum likelihood based methods in the package MLn with MCMC methods I will parallel the work of BUGS and consider tting various families of multi-level models using both Gibbs sampling and Metropolis-Hastings sampling methods I will also consider how the maximum likelihood methods can be used to give the MCMC methods good starting values and suitable proposal distributions for Metropolis-Hastings sampling The package MLwiN (Goldstein et al 1998) is the new version of MLn and some of its new features are a result of the work contained in this thesis MLwiN contains for the rst time MCMC methodology as well as the existing maximum likelihood based methods 12 Summary of Thesis In the next chapter I will discuss some of the background to multi-level modelling using as an example an educational dataset Iwillintroduce multi-level modelling as an extension to linear modelling and explain briey how the existing maximum likelihood methods in MLn t multi-level models In Chapter 3 I will consider MCMC simulation techniques and summarise the main techniques, Metropolis sampling, Gibbs sampling and Hastings sampling I will explain how such techniques are used and how to get estimates from the chains they produce I will also consider convergence issues when using Markov chains and motivate all the methods with a simple example In Chapter 4 I will consider two very simple multi-level models, the twolevel variance components model, and the random slopes regression model, both introduced in Chapter 2 I will use these models to illustrate the important issue of choosing general `diuse' prior distributions when using MCMC methods The 2

20 chapter will consist of two large simulation experiments to compare and contrast the IGLS and RIGLS maximum likelihood methods with MCMC methods using various prior distributions under dierent scenarios In Chapter 5 I will discuss some more general algorithms that will t N level Gaussian multi-level models I will give three algorithms, rstly Gibbs sampling and then two hybrid Gibbs Metropolis samplers: the rst containing univariate updating steps, and the second block updating steps For each hybrid sampler I will also describe an adaptive Metropolis technique to improve itsmixing I will then compare all the samplers through some simple examples In Chapter 6 I will discuss multi-level logistic regression models I will consider one of the hybrid samplers introduced in the previous chapter and show howitcan be modied to t these new models These models are a family that maximum likelihood techniques perform particularly badly on I will therefore compare the maximum likelihood based methods with the new hybrid sampler via another simulation experiment In Chapter 7 I will introduce a complex variation structure at level 1 as a generalisation of the Gaussian models introduced in Chapter 5 I will then implement two Hastings updating techniques for the level 1 variance parameters that aim to sample from such models Firstly a technique based on an inverse Wishart proposal distribution and secondly a technique based on a truncated normal proposal distribution I will then compare the results of both methods to the maximum likelihood methods In Chapter 8 I will discuss other multi-level models that have not been tted in the previous chapters and add some general conclusions about the thesis as awhole 3

21 Chapter 2 Multi Level Models and MLn 21 Introduction In the introduction I mentioned several applications that contain datasets where a multi-level structure is appropriate The package MLn (Rasbash and Woodhouse 1995) was written at the Institute of Education primarily to t models in the area of education although it can be used in many of the other applications of multi-level modelling In this chapter I intend to consider, through examples, some statistical problems that arise in the eld of education These problems will increase in complexity to incorporate multi-level modelling I will explain how the maximum likelihood methods in MLn can be used to t the models as each new model is introduced The dataset used in this chapter is the Junior School Project (JSP) dataset analysed in Woodhouse et al (1995) 211 JSP dataset The JSP is a longitudinal study of approximately 2000 pupils who entered junior school in 1980 Woodhouse et al (1995) analyse a subset of the data containing 887 pupils from 48 primary schools taken from the Inner London Education Authority (ILEA) For each child they consider his/her Maths scores in two tests, marked out of 40, taken in years 3 and 5, along with other variables that measure the child's background I will consider smaller subsets of this subset in the models considered in this chapter Any names used in the examples are ctitious, and are simply used to aid my descriptions 4

22 I will now consider as my rst dataset the sample of pupils from one school participating in the Junior School project, and consider how to statistically describe information on an individual pupil 22 Analysing Redhill school data Redhill Primary school is the 5th school in the JSP dataset and the sample of pupils participating in the JSP has 25 pupils who sat Maths tests in years 3 and 5 I will denote the Maths scores, out of a possible 40, in years 3 and 5 as M3 and M5 respectively When considering the data from one school, it is the individual pupils' marks that are of interest The data for Redhill school are given in Table 21 The individual pupils and their parents will be interested in how they, or their children are doing in terms of what marks were achieved, and how these marks compare with the other pupils in the school Consider John Smith, pupil 10, who achieved 30 in both his M3 andm5 test scores, or equivalently 75% in each test If this is the only information available then John Smith appears to have made steady progress in mathematics If instead the marks for the whole class are available then each child could be given a ranking to indicate where he/she nished in the class It can now be seen that John Smith ranked equal eighth in the third year test but only equal eighteenth in the second test, so although he got the same mark in each test, compared to the rest of the class he has done worse in the second test This is because although his marks have stayed constant the mean mark for the class has risen from 278 to 32 This may be because the second test is in fact comparatively easier than the rst test, or the teaching between the two tests has improved the children's average performance With only the data given it is impossible to distinguish between these two reasons for the improved average mark 221 Linear regression A better way to compare John Smith's two marks is to perform a regression of the M5 marks on the M3 marks This will be the rst model to be tted to the 5

23 Table 21: Summary of Redhill primary school results from JSP dataset Pupil M3 Rank M5 Rank M/NM Sex M M M M NM F NM M M F NM F NM M NM F NM F M M NM F NM M M F M F NM M NM M M F M F NM F NM F NM F NM M M M NM M NM M dataset and can be written as : M5 i = M3 i + e i where the e i are the residuals and are assumed to be distributed normally with mean zero and variance 2 The linear regression problem is studied in great detail in basic statistics courses and the least squares estimates are as follows : ^ 0 = y ; ^ 1 x 6

24 ^ 1 = X (x i ; x)(y i ; y)= X (x i ; x) 2 where in this example, x = M3 and y = M5 Fitting the above modelgives estimates, ^ 0 =15:96 and ^ 1 =0:575, and from these estimates, expected values and residuals can be calculated Consequently John Smith with his mark of 30 for year 3 would be expected to get 3322 for year 5 and the residual for John Smith is then 30 ; 33:22 = ;3:22 This means that using this model, John Smith got 322 marks less than would be expected of the average pupil (at Redhill school) with a mark of 30 in year 3 using this model 222 Linear models The simple linear regression model is a member of a larger family of models known as normal linear models (McCullagh and Nelder 1983) Two other covariates, the sex of each pupil and whether their parent's occupation was manual or nonmanual were collected The simple linear regression model can be expanded to include these two covariates as follows M5 i = M3 i + 2 SEX i + 3 NONMAN i + e i = X i + e i where SEX i =0for girls, 1 for boys and NONMAN i =0for manual work and 1 for non-manual work The formula for the least squares estimates for a normal linear model is similar to the formula for simple linear regression except with matrices replacing vectors The estimate for the parameter vector is ^ =(X T X) ;1 X T y: For our model the least squares estimates are given in Table 22 From Table 22 it can be seen that on average, pupils from a non-manual background do better than pupils from a manual background and that boys do slightly better than girls Considering John Smith again, his expected M 5 mark 7

25 Table 22: Parameter estimates for model including Sex and Non-Manual covariates for Redhill primary school Parameter Variable Estimate (SE) 0 Intercept 1809 (773) 1 M3 043 (028) 2 Sex 015 (220) 3 Non Manual 270 (220) under this model is now 3127 and so is closer to his actual mark However, in this model neither of the additional covariates has an eect that is signicantly dierent from zero, and so the standard procedure would be to remove them from the model and revert to the simple linear regression model The purpose of reviewing normal linear models is to show how they are related to hierarchical models I will now expand the dataset to include all the pupils sampled in four of the schools in the JSP 23 Analysing data on the four schools in the borough of Blackbridge Iamnow going to consider a new educational situation involving the JSP dataset A family has moved into the borough of Blackbridge which has four primary schools in its area and they want to choose a school for their children who have sat M3 tests The four schools I have selected for the ctional borough are Bluebell (School 2), Redhill (School 5), Greenacres (School 9) and Greyfriars (School 13) The four schools are summarised in Table 23 Table 23: Summary of schools in the borough of Blackbridge School Pupils M3 M5 Male NonMan Name Sampled Mean Mean (%) (%) Bluebell % 30% Redhill % 64% Greenacres % 67% Greyfriars % 0% All schools % 48% 8

26 From the table we can see that Redhill school had the best M5 average results Bluebell school had the best average improvement in Maths from years 3 to 5 Greenacres school had the second highest M5 average but had far less improvement than the other schools Greyfriars although having the worst M5 average of the four schools had 100% of pupils from a manual background If the simple linear regression model regressing M5 on M3 is tted to the four schools separately, the resulting regression lines can be seen in Figure 2-1 Math 5 Score Bluebell Redhill Greenacres Greyfriars Math 3 score Figure 2-1: Plot of the regression lines for the four schools in the Borough of Blackbridge From these regression lines it can be seen that if we were to choose the best school for a particular child by using their M3 mark then no one school dominates and our choice would depend on the M3 mark for the particular child In comparing the 4 regression lines to consider which school is best, the same 9

27 model has eectively been tted four times, and the results for each school are only inuenced by the other schools by comparing the four graphs It would be better if the data from all four schools could be incorporated in one model I will now introduce some more models that attempt to do this 231 ANOVA model The Analysis of Variance (ANOVA) model consists of tting factor variables to a single response variable, for example, for the current dataset, M5 ij = 0 + SCHOOL j + e ij : One way of tting the above model is to constrain SCHOOL 4 to be zero to avoid co-linearity amongst the predictor variables The parameter estimates are given in the second column of Table 24 Table 24: Parameter estimates for ANOVA and ANCOVA models for the borough of Blackbridge dataset Parameter ANOVA Est (SE) ANCOVA Est (SE) (157) 1292 (328) (013) SCHOOL (238) 120 (202) SCHOOL (194) 100 (172) SCHOOL (200) {167 (194) From the parameter estimates, expected values for pupils in all four schools can be found and these are the actual school means This shows that the ANOVA model does not actually combine the data for the four schools but instead can be used to compare the four schools to check if the schools are signicantly dierent 232 ANCOVA model The Analysis of Covariance (ANCOVA) model is a combination of an ANOVA model and a linear regression model In its simplest form there are two predictors, a regression variable and a factor variable, for example, for the current dataset 10

28 M5 ij = M3 ij + SCHOOL j + e ij : Again to t the model co-linearity amongst the predictor variables needs to be avoided and so SCHOOL 4 can be constrained to be zero The parameter estimates are given in the third column of Table 24 The ANCOVA model ts parallel regression lines, one for each school, when regressing M 5 on M 3 Here the data for one school is dependent on the other schools due to the common slope parameter 1 The intercepts for each school are independent given the common slope and so are the least squares estimates for the data assuming the given slope 233 Combined regression An easy way to combine the information from the four schools into one model is to simply ignore the fact that the children attend dierent schools This results in the following regression model : M5 i = M3 i + e i for all 69 pupils Fitting this model gives parameter estimates ^ 0 = 11:82 and ^ 1 = 0:52 While the ANOVA model considers the school means as a one level dataset, here we consider all 69 pupils as a one level dataset, so neither of these approaches exploits the structure of the problem, illustrated in Figure 2-2 An alternative model that aims to t this structure is illustrated in the next section 24 Two level modelling To adapt the problem to the structure illustrated in Figure 2-2, in a way that permits generalising outward from the four schools, I must not only consider the pupils from a school to be a random sample from that school, but also assume the schools are a sample from a population of schools In this way the ANOVA model can be modied as follows : 11

29 Borough of Blackbridge Bluebell Redhill Greenacres Greyfriars Pupil 1 Pupil 1 Pupil 11 Pupil 1 Pupil 11 Pupil 1 Pupil 2 Pupil 2 Pupil 12 Pupil 2 Pupil 12 Pupil 2 10 pupils 25 pupils 21 pupils 13 pupils Figure 2-2: Tree diagram for the Borough of Blackbridge M5 ij = 0 + School j + e ij School j N(0 2 s) e ij N(0 2 e): Similarly the ANCOVA model becomes M5 ij = M3 ij + School j + e ij : School j N(0 2 s) e ij N(0 2 e): In these models the predictor variables can be split into two parts, the xed part and the random part In the above models the xed part consists of the 's, and the product of the vector and the X matrix is known as the xed predictor These models are known as variance components models because the variance of the response variable, M 5 about the xed predictor, is 12

30 var(m5 ij j ) =var(school j + e ij )= 2 s + 2 e the sum of the level 1 (pupil) and level 2 (school) variances Both models described above are variance components models but with dierent xed predictors Variance components models cannot be tted by ordinary least squares as no closed form least squares solutions exist, instead an alternative approach is required There are many such approaches and in this chapter I will consider the techniques already used in the package MLn for tting these models These techniques are based on iterative procedures to give estimates and are a type of maximum likelihood estimation The other multi-level modelling computer packages use slightly dierent approaches HLM (Bryk et al 1988) uses empirical Bayes techniques based on the EM algorithm (Dempster, Laird, and Rubin 1977) while VARCL uses a fast Fisher scoring algorithm (Longford 1987) In the next chapter I will introduce MCMC techniques that use simulation methods to calculate estimates, and these methods will be used throughout this thesis 241 Iterative generalised least squares Iterative generalised least squares (IGLS, Goldstein (1986)) is an iterative procedure based on generalised least squares estimation Consider the variance components model that corresponds to the ANCOVA model, here estimates need to be found for both the xed eects, 0 and 1, and the random parameters 2 s and 2 e If the values of the two variances are known then the variance matrix, V for the response variable can be calculated and this leads to estimates for the xed eects, ^ =(X T V ;1 X) ;1 X T V ;1 Y where in this example, Y 1 = M5 11 and X 1 =(1M3 11 )etc Considering the earlier dataset with 69 pupils in 4 schools then the variance matrix V will be 69 by 69 with a block diagonal structure as follows : 13

31 V ij = 8 >< >: 2 s + 2 e 2 s if i = j if i 6= j but School [i] = School [j] 0 otherwise From the estimates ^, the raw residuals can then be calculated as follows : ~r ij = M5 ij ; ^ 0 ; ^ 1 M3 ij : If the vector of these raw residuals, ~R is formed into the cross product matrix R + = ~R ~R T then this matrix has expected value V, the variance matrix dened above This can be used to create a linear model with predictors s 2 and e 2 as follows : R ij + = A ij s 2 + B ij e 2 + ij where A ij and B ij take values of 0 or 1, A ij = 1 when School i = School j and B ij = 1 if i = j Using the block diagonal structure of V, vec(r + ) can be constructed so that it only contains the elements of R + that do not have expected value zero This means in the example with 4 schools, the vector will be of length = 1335, and just these terms need be included in the linear model above as opposed to all 69 2 = 4761 terms After applying regression to this model new estimates are obtained for the two variance parameters s 2 and e 2 These new estimates can now be used instead of our initial estimates and the whole procedure repeated until convergence is obtained Convergence To calculate when the method has converged relies on setting a tolerance level Estimates from consecutive iterations are compared and if the dierence between the two estimates is within the tolerance boundaries then convergence of that parameter is said to have beenachieved The method nishes at the rst iteration at which all parameters have converged To start the IGLS procedure requires initial values, which are normally obtained by nding the ordinary least squares 14

32 estimates of the xed eects Goldstein (1995) The IGLS method is explained more fully in 242 Restricted iterative generalised least squares The IGLS procedure can produce biased estimates of the random parameters This is due to the sampling variation of the xed parameters which is not accounted for in this method A modication to IGLS known as restricted iterative generalised least squares (RIGLS, see Goldstein (1989)) can be used to produce unbiased estimates This modication works by incorporating a bias correction term into the equation when updating the variance parameters The RIGLS method then gives estimates that are equivalent to restricted maximum likelihood (REML) estimates 243 Fitting variance components models to the Blackbridge dataset The IGLS and RIGLS methods were used to t the two variance components models described earlier and the results can be found in Table 25 Two important points can be drawn from this table Firstly a deciency in the IGLS method can be seen for model 1, the school level variance parameter, s 2 has been estimated as negative and has consequently been set to zero This odd behaviour occurs when avariance parameter is very small compared to its standard error and sometimes in the iterative procedure the maximum likelihood estimate becomes negative In this example the behaviour is mainly due to trying to t a two level model to a dataset with only four schools Table 25: Parameter estimates for two variance components models using both IGLS and RIGLS for Borough of Blackbridge dataset Parameter IGLS RIGLS IGLS RIGLS Name Model 1 Model 1 Model 2 Model (068) 3087 (078) 1505 (302) 1456 (315) (011) 059 (011) s (000) 054 (170) 003 (092) 054 (133) e (546) 3212 (562) 2258 (395) 2290 (401) 15

33 Secondly for both variance components models the variance at school level is very small, and not signicantly dierent from zero This means that it may be better, in this example, with only four schools to stick to single level modelling and use the combined regression model 244 Fitting variance components models to the JSP dataset The same two variance components models as considered in the previous section were tted to the whole JSP dataset and the results can be seen in Table 26 When all 48 schools are considered, the school level variance, s 2 is signicant and so it makes sense to use a two level model Table 27 shows the dierent estimates obtained for the four Blackbridge schools using ANOVA and RIGLS As described earlier the ANOVA estimates are simply the school means, and the RIGLS estimates are shrinkage estimates This means that the RIGLS estimates are a weighted combination of the mean of all 887 results, 3057, and the individual school means Table 26: Parameter estimates for two variance components models using both IGLS and RIGLS for all schools in the JSP dataset Parameter IGLS RIGLS IGLS RIGLS Name Model 1 Model 1 Model 2 Model (040) 3060 (040) 1515 (090) 1514 (090) (003) 061 (003) s (155) 532 (159) 403 (118) 416 (121) e (192) 3928 (192) 2813 (137) 2816 (137) Table 27: Comparison between tted values using the ANOVA model and the variance components model 1 using RIGLS School ANOVA RIGLS BlueBell Redhill Greenacres Greyfriars

34 There are analogous results for the ANCOVA model and the second variance components model Here the ANCOVA model ts the least squares regression intercepts given the xed slope, and the RIGLS method will shrink these parallel lines towards a global average line, given the xed slope 245 Random slopes model The two level models considered so far have all been variance components models, that is the random variance structure at each level is simply a constant variance Earlier we considered tting separate regression lines to each school and this model can also be expanded into a two level model known as a random slopes regression, M5 ij = M3 ij + u 0j + u 1j M3 ij + e ij u j = u 0j u 1j 1 A MVN(0 s ) e ij N(0 2 e): In this notation, the s are xed eects and represent an average regression for all schools The u j s are the school level residuals and s is the school level variance, which is now a matrix as the slopes now also vary at the school level As the earlier variance components models suggested that the dataset with four schools should not be treated as a two level model, we will only consider tting the random slopes model to the full JSP dataset with 48 schools, as shown in Table 28 Table 28: Parameter estimates for random slopes model using both IGLS and RIGLS for all schools in the JSP dataset Parameter IGLS RIGLS (132) 1503 (134) (004) 0613 (004) s (1637) 4704 (1675) s01 {123 (052) {130 (053) s (0017) 0036 (0017) e (134) 2696 (134) 17

35 Table 29 gives as a comparison the regression lines tted using separate regressions and the random slopes two level model, for the four schools in Blackbridge Here shrinkage can be seen towards the average line M 5 = 15:03 + 0:613M 3 Table 29: Comparison between tted regression lines produced by separate regressions and the random slopes model School Separate Random Slopes Name Regression Regression BlueBell 22:44 + 0:317M 3 17:16 + 0:548M 3 Redhill 15:96 + 0:575M 3 15:06 + 0:610M 3 Greenacres ;7:40+1:244M 3 5:28 + 0:868M 3 Greyfriars 12:52 + 0:665M 3 12:56 + 0:678M 3 Both the variance components model and the random slopes regression will be discussed in greater detail in Chapter 4 25 Fitting models to pass/fail data Often when considering exam datasets the main objective is to classify students into grade bands, or at the simplest level, to pass or fail pupils In this thesis I will only be studying the simpler case where the response variable can be thought of as a 0 or a 1, depending on whether the event of interest occurs or not The interest now lies in estimating the probability ofthe response variable being a 1 rather than the value of the response variable, as this is constrained to be either 0 or 1 The normal linear models used in the earlier sections can still be used when the response is binary but they assume that the response variable can take any real values and so this leads to probability estimates that lie outside [0 1] The more common approach to tting binary data is to assume that the response variable has a binomial distribution Then the model produced can be tted by the technique of generalized linear models, also described in McCullagh and Nelder (1983) A link function, g() is required to transform the response variable from the [0 1] scale to the whole real line I will only consider the logistic link, log(p ij =(1 ; p ij )), as it is the most frequently used link function for binary data, 18

36 although there are other alternatives The model then becomes g() = X, where X is the linear predictor When a binary response model is tted using the logistic link, the technique is known as logistic regression Although the response variable is a binary outcome the predictor variables can still be continuous variables For example if a student is due to take an exam that will be externally examined the result he/she obtains will generally be a grade and the exact mark will not be given However he/she will typically be given a mock exam at some time before the exam and the exact mark of this mock will be known I will try and mimic this scenario by using the 4 schools considered in the earlier examples and assume that the predictor M3 score is known However the response of interest M 5willnowbeconverted to a pass/fail response, Mp5, depending on whether the student gets at least 30 out of 40 in the test Considering the 69 students in the 4 schools, 46 of them got at least 30 on the year 5 test so two-thirds of the students actually pass Interest will now lie in whether the school of each pupil has an eect and whether the `mock' exam mark, M3 is a good indicator of whether a student will pass or fail I will consider the following two models that are analogous to the ANOVA and ANCOVA models for continuous responses Mp5 ij Bernoulli(p ij ) log(p ij =(1 ; p ij )) = 0 + SCHOOL j (21) log(p ij =(1 ; p ij )) = M3 ij + SCHOOL j (22) The estimates obtained by tting these models are given in Table 210 The parameter estimates obtained can be transformed back to give estimates for the p ij, ^p ij = exp(x ij) 1+exp(X ij ) : Looking more closely at model 21 and using the transform, the estimated probability of passing for a pupil in school i is equal to the proportion of pupils passing in school i If we consider once again John Smith, who got 30 in his year 2 maths exam, using model 22 we see that a pupil from Redhill school with 30 19

37 in the M3 exam has an estimated 85:8% chance of passing In fact, John Smith just scraped apass,scoring 30 in year 5 Table 210: Parameter estimates for the two logistic regression models tted to the Blackbridge dataset Parameter Model 21 Model 22 0 {01541 { SCHOOL {01727 SCHOOL SCHOOL { Extending to multi-level modelling The above models that have been tted using logistic regression techniques are single level models but in an analogous way to the earlier Gaussian models these models can be extended to two or more levels As with the Gaussian models there are iterative procedures to t the multi-level logistic models, and MLn uses two such methods Both techniques are quasi-likelihood based methods which can be used on all non-linear multi-level models They use a Taylor series expansion to linearize the model The rst technique is Marginal Quasi-likelihood (MQL) proposed by Goldstein (1991) MQL tends to underestimate xed and random parameters particularly with small datasets The second technique is Predictive orpenalised Quasi-likelihood (PQL) proposed by Laird (1978) and also Stiratelli, Laird, and Ware (1984) PQL is more accurate than MQL but is not guaranteed to converge The distinction between the methods is that when forming the Taylor expansion, in order to linearize the model, the higher level residuals are added to the linear component of the nonlinear function in PQL, and not in MQL The order of the estimation procedure is the order of the Taylor series expansion I will now consider tting several two level logistic regression models to the JSP dataset with 48 schools with the response variable being a pass/fail indicator for year 5 scores as before The models are described below and the parameter estimates obtained are given in Table

38 Mp5 ij Bernoulli(p ij ) log(p ij =(1 ; p ij )) = 0 + SCHOOL j (23) SCHOOL j N(0 2 s) Mp5 ij Bernoulli(p ij ) log(p ij =(1 ; p ij )) = M3 ij + SCHOOL j (24) SCHOOL j N(0 2 s) Table 211: Parameter estimates for the two-level logistic regression models tted to the JSP dataset Par M3(MQL1) M3(PQL2) M4(MQL1) M4(PQL2) (0120) 0769(0136) {3703(0420) {4262(0464) (0016) 0205(0018) s (0138) 0570(0179) 0633(0198) 0874(0258) From Table 211 we can see that the estimates using PQL are larger both for the random and xed parameters There is a signicant positive eect of M 3 score on the probability of passing and there is signicant variabilitybetween the dierent schools using both MQL and PQL 26 Summary The multi-level logistic regression model ends our discussion of models that can be tted to the JSP dataset I will return to problems involving binary data in Chapter 6, where the MQL and PQL methods described here will be compared to MCMC alternatives In this chapter I have highlighted some examples of simple models that arise when using data from education I have then shown how the existing methods in the MLn package can t such models In the next chapter I will introduce 21

39 MCMC methods which will then be used to t the types of models introduced here in later chapters 22

40 Chapter 3 Markov Chain Monte Carlo Methods In the previous chapter I concentrated on maximum likelihood based techniques for tting multi-level models The eld of Bayesian statistics has grown in importance as computer power has increased and techniques that would previously have been impossible to implementcannow be performed eciently MCMC methods are one group of Bayesian methods that can be used to t multi-level models In this chapter I will describe the various MCMC methods in common usage todayandhowtheywork I will explain howtousethechains that the methods produce to get answers to problems and how to tell when a chain has reached its equilibrium distribution I will then illustrate all these points in a simple example I will begin by explaining why such methods are used 31 Background Consider a sequence of n observations, y i that have been generated from a normal distribution with unknown mean and variance, so that y i N( 2 ) Then and 2 have standard (unbiased) estimates given the observations y i, ^ = y = nx i=1 y i n and ^2 = nx i=1 (y i ; y) 2 n ; 1 : Consider instead the situation in reverse, if and 2 were known, I could 23

41 generate a sample of observations from the normal distribution by simulation, see for example Box and Muller (1958) Then if I drew a large enough sample the mean and variance of the sample should be approximately equal to the mean and variance of the underlying distribution Similarly consider a gamma distribution with parameters and If after nding a suitable simulation algorithm (see Ripley (1987)), a large sample from the gamma distribution is drawn, then it can be veried that the mean of the distribution is and the variance is Both of these examples are trivial but 2 given any parameter of interest, if I can simulate from its distribution for long enough I can calculate estimates for the parameter and any functionals of the parameter Multi-level models are much more complicated than these two examples and it is rare that samples from a parameter's distribution can be obtained directly from simulation The dierence between multi-level models and our simple examples is that the parameter of interest will depend on several other parameters with unknown values Bayesian estimation methods involve integrating over these other parameters but this becomes infeasible as the model becomes more complicated The methods detailed in the last chapter involve the use of an iterative procedure that leads to an approximation to the actual parameter value The methods in this chapter involve the generation, via simulation, of Markov chains that will, given time, converge to the posterior distribution of the parameter of interest Before going on to describe the various MCMC techniques I rst need to cover some of the basic ideas of Bayesian inference 32 Bayesian inference In frequentist inference the data, whose distribution across hypothetical repetitions of data-gathering is assumed to depend on a parameter vector,, are regarded as random with xed In Bayesian inference the data are regarded as xed (at their observed values) and is treated as random as a means of quantifying uncertainty about it In this formulation possesses a prior distribution and a posterior distribution linked by Bayes' theorem The posterior distribution of is dened from Bayes' theorem as 24

42 p( j data) / p(data j )p(): Here p() is the prior distribution for the parameter vector and should represent all knowledge we have about prior to obtaining the data Prior distributions can be split into two types, informative prior distributions which contain information that has been obtained from previous experiments, and \noninformative" or diuse priors which aim to express that we have little or no prior knowledge about the parameter In frequentist inference prior distributions are not used in tting models, and so \non-informative" priors are widely used in Bayesian inference to compare with the frequentist procedures The posterior distribution for is therefore proportional to the likelihood, p(data j ) multiplied by the prior distribution The proportionality constant is such that the posterior distribution is a valid probability distribution One principal diculty in Bayesian inference problems is calculating the proportionality constant as this involves integration and does not always produce a posterior distribution that can be written in closed form Also to nd marginal, posterior and predictive distributions involves (high-dimensional) integrations which MCMC methods can instead perform by simulation How do MCMC methods t in? MCMC methods generate samples from Markov chains which converge to the posterior distribution of, without having to calculate the constant of proportionality From these samples, summary statistics of the posterior distribution can be calculated 33 Metropolis sampling The Metropolis algorithm was rst described in Metropolis et al (1953) in the eld of statistical mechanics The idea is to generate values of, the parameter of interest from a proposal distribution and correct these values so that the draws are actually simulating from the posterior distribution p( j data) The proposal distribution is generally dependent on the last value of drawn but independent 25

43 of all other previous values of to obey the Markov property The method works by generating new values at each time step from the current proposal distribution but only accepting the values if they pass a criterion In this way the estimates of are improved at each time step and the Markov chain reaches its equilibrium or stationary distribution, which is the posterior distribution of interest by construction The Metropolis algorithm for an unknown parameter is as follows : Select astartingvalue for which is feasible For each time step t sample a point from the current proposal distribution p t ( j t;1 ) The proposal distribution must be symmetric in and t;1, that is p t ( j t;1 )=p t ( t;1 j ) for all t Let r t = p(jy) p( t;1 jy) be the posterior ratio and a t =min(1 r t ) be the acceptance probability Accept the new value = with probability a t, otherwise let t = t;1 In multi-level models there are many parameters of interest and the above algorithm can be used in several ways Firstly could be considered as a vector containing all the parameters of interest and a multivariate proposal distribution could be used Secondly the above algorithm could be used separately for each unknown parameter, i If this is done, it is generally done sequentially, that is, at step t generate a new 1t, then a new 2t and so on until all parameters have been updated then continue with step t + 1 Thirdly a combination method where parameters are updated in suitable blocks, some multivariately and some univariately could be used 331 Proposal distributions To perform Metropolis sampling symmetric proposal distributions must be chosen for all parameters There are at least two distinct types of proposal distributions that can be used Firstly the simplest type of proposal is the independence proposal This proposal generates a new value from the same proposal distribution regardless of the current value If a parameter is restricted to a range of values, for example a 26

44 correlation parameter must lie in the range [;1 1], then an independence proposal could consist of generating a new value from a uniform [;1 1] distribution Independence proposals are somewhat limited in that if parameters are dened on <, then it is dicult, if not impossible to nd a proposal distribution that will sample from the whole range of the parameter in an eective manner A second type of proposal that is popular is the random-walk proposal Here p t ( j t;1 ) = p t ( ; t;1 ), so the proposal is centred at the current value of the parameter, t;1 Common examples are both uniform and normal distributions centred at the current parameter value Both these proposal distributions will then have a free parameter, in the case of the uniform the width of the interval and in the normal the variance of the distribution The values given to these parameters will aect how well the simulation performs If the variance parameter is too small then the sampler will end up making lots of little jumps and will take a long time to reach all parts of the sample space If the variance is too big there will be a lower acceptance rate and the sampler will end up staying at particular parameter values for long periods and again the chain will take a long time to give good estimates Convergence rates will be dealt with later in this chapter The common method used for selecting parameter values for proposal distributions is to try several values for the variance until the chain \mixes well" Adaptive methods which modify the proposal distribution to improve convergence are also discussed in later chapters 34 Metropolis-Hastings sampling Hastings (1970) generalized the Metropolis algorithm to allow proposal distributions that were not symmetric To correct for this the ratio of the posterior probabilities r t is now replaced by a ratio of importance ratios: r t = p( j y)=p t ( j t;1 ) p( t;1 j y)=p t ( t;1 j ) : The Metropolis algorithm is just a special case of this algorithm where p t ( j t;1 )=p t ( t;1 j )andthese terms cancel out In later chapters it will be seen that it is not always easy to nd a symmetric proposal distribution for parameters with restricted ranges, for example variances 27

45 Also asymmetric proposal distributions may sometimes assist in increasing the rate of convergence As with the Metropolis algorithm, the Metropolis-Hastings algorithm has proposal distributions with parameters that can be modied to speed up convergence 35 Gibbs sampling The Gibbs sampler is a special case of the Metropolis-Hastings algorithm Geman and Geman (1984) used an approach in their work on image analysis based on the Gibbs distribution They consequently named this method Gibbs sampling Gelfand and Smith (1990) applied the Gibbs sampler to several statistical problems bringing it to the attention of the statistical community The Gibbs sampler is best applied on problems where the marginal distributions of the parameters of interest are dicult to calculate, but the conditional distributions of each parameter given all the other parameters and the data have nice forms For example, suppose the marginal posterior p( j y) cannot be obtained from the joint posterior p( z j y) analytically However suppose that the conditional posteriors p( j y z) and p(z j y ), have forms that are known and are easy to sample from, for example normal or gamma distributions Gibbs sampling can then be used to sample indirectly from the marginal posterior The Gibbs sampler works on the above problem as follows, rstly choose a starting value for z, say z (0) and then generate via random sampling a single value, (1), from the conditional distribution p( j y z = z (0) ) Next generate z (1) from the conditional distribution p(z j y = (1) ) Then start cycling through the algorithm generating (2) and z (2) and so on If the conditional distributions of the parameters have standard forms, then they can be simulated from easily If this is not the case and the conditional distribution does not have a standard form, then a dierent method must be used Two such methods will now be described 351 Rejection sampling Rejection sampling is described in Ripley (1987) It is used when the distribution of interest f(x) cannot be easily sampled from but there exists a distribution 28

46 g(x) such that f(x) <Mg(x)8x where M is a positive number, and g(x) can be sampled from without diculty g(x) can be thought of as an envelope function that completely bounds the required distribution from above To generate a value from f(x) usethe following algorithm Repeat Generate Y from g(y), Generate U from U(0,1), until U < f(y)/m g(y) Return X = Y The eciency of this method depends on the enveloping function g(x) There are two major aims, rstly to nd a function that satises f(x) <Mg(x)8x, and secondly to nd a g(x) similar enough to f(x) to have a high acceptance rate The second method which will now be discussed briey, tries to automatically satisfy both these aims 352 Adaptive rejection sampling Adaptive rejection sampling (Gilks and Wild 1992) works when the conditional distribution of interest is log concave It starts by considering a small number of points from the distribution of interest f and evaluating the tangents to log f at these points Then joining up these tangents will construct an envelope function, g for f Then proceed as in rejection sampling except that when a point x g is chosen from g, aswell as evaluating f(x g ), also evaluate the tangent tologf(x g ) and modify the envelope accordingly Then as more points are sampled, g(x) becomes more and more like f(x), hence the rejection sampling is adaptive 353 Gibbs sampler as a special case of the Metropolis- Hastings algorithm Looking at the Gibbs sampling algorithm as written above it is not immediately obvious that the Gibbs sampler is a special case of the Metropolis-Hastings algorithm If I consider the proposal distribution for a particular member of, (i) as 29

47 p t ( j t;1 )= 8 < : p( (i) j t;1(;i) y) if (;i) = t;1(;i) 0 otherwise where (;i) is the vector with element i removed In other words the only possible proposals involve holding all components of constant except the ith Then the ratio r t of importance ratios is and all proposals are accepted r t = p( j y)=p t ( j t;1 ) p( t;1 j y)=p t ( t;1 j ) p( j y)=p( (i) j t;1(;i) y) = p( t;1 j y)=p( t;1(i) j t;1(;i) y) = p( t;1(;i) j y) p( t;1(;i) j y) = 1 This is the same Gibbs sampler algorithm as described earlier but this time written in the Metropolis-Hastings format 36 Data summaries Once a Markov chain has been run the outputs produced are sequences of values, one sequence for each parameter, assumed to be from the desired joint posterior distribution Each individual sequence can be thought of as a sample from the marginal distribution of the individual parameter From each sequence of values we hope to describe the parameter it represents via summary statistics In the last chapter I reviewed the IGLS and RIGLS methods for multilevel modelling For each parameter of interest, these methods calculated a maximum likelihood based estimate ^ for and a standard error for this estimate If condence intervals are required, then if you are prepared to assume the parameter is normally distributed, you can generate central 95% condence intervals for, (^ ; 1:96SE(^) ^ +1:96SE(^)): Markov chains can also calculate these same summary statistics but they can also produce other summaries for the parameter We will now describe how to 30

48 calculate the various summary statistics from Markov chains 361 Measures of location There are three main estimates that can be used for the parameter of interest, 1 Sample mean If I consider the chain values as a sample from the posterior distribution of, then I can calculate their mean in the usual way : ^ = 1 N NX i=1 i : 2 Sample median The median can be found by nding the N +1 2 th sorted chain value Computationally it is quicker to calculate the median via a `binary chop' algorithm rather than actually sorting the chain The `binary chop' algorithm consists of taking the rst (unsorted) chain value and dividing the other values into two groups depending on whether they are larger or smaller than the rst value Then depending on the number of values bigger than this rst value the median will be in one of the two groups Discard the group that does not contain the median and repeat the procedure on the other group Repeat this procedure recursively until the median is found This is an N log 2 algorithm as opposed to an N 2 that the simple sort would be algorithm 3 Sample mode This statistic is equivalent to the estimate given by the IGLS and RIGLS methods when the prior distribution is at and is also known as the maximum likelihood estimate (MLE) in that case It is not calculated directly from the Markov chain but is instead calculated from the kernel density plot described later 362 Measures of spread There are two main groups of summary statistics for the spread of a set of data 31

49 1 Variance and standard deviation The variance and the standard deviation of the data are both summary statistics associated with the mean In a similar way to the mean formula, consider the chain values as a sample from the posterior distribution of, then the variance has the usual formula var(^) = 1 NX N ; 1 ( i 2 ; (P N i=1 i ) 2 ): N The standard deviation is the square root of the variance i=1 2 Quantile based estimates There are several measures of spread that are calculated from the quantiles of a distribution In Bayesian statistics condence intervals are replaced by credible intervals which have a dierent interpretation to the frequentist condence interval A frequentist 100(1 ; )% condence interval for is dened as an interval calculated from the data such that 100(1;)% of such intervals contain In Bayesian statistics the data is thought of as xed and the parameter, variable, and so a 100(1 ; )% credible interval, C is such that R C p( j data )d =1; (Bernardo and Smith 1994) The quantiles are used to produce credible intervals, for example a 95% central Bayesian credible interval is (Q 0:025 Q 0:975 ), where Q i is the ith quantile The interquartile range = Q 0:75 ; Q 0:25 can also be calculated from the quantiles and is an alternative measure of spread The `binary chop' algorithm used for the median can also be used to calculate the quantiles rather than sorting 363 Plots Given that the sequence of values obtained for a parameter can be thought of as a sample of n points from the marginal posterior distribution of then I can use plots to show the shape of this distribution The simplest density plot is the histogram Given the range of the parameter values in the sequence, the range can be split into M contiguous intervals, not necessarily of the same length, which are commonly known as bins The numbers 32

50 of values that fall in each bin are then counted and the histogram estimate at a point is dened by ^p() = no of i in same bin as n width of bin containing : An example of a histogram for the parameter 1 from the example later in this chapter can be seen in Figure 3-1 Histograms give a rather `blocky' approximation to the posterior distribution of interest The approximation is improved, up to a point, by increasing the number of bins, M but this also depends on the number of points n being large mu(1) Figure 3-1: Histogram of 1 using the Gibbs sampling method The kernel density estimator improves on the histogram by giving a smoother estimate of the posterior distribution The histogram can be thought of as 33

51 taking each point i and spreading its contribution to the posterior distribution uniformly over the bin containing i The kernel estimator on the other hand spreads the contribution of each point conditional on a kernel function K around the point, where K satises Z 1 ;1 ^p() = 1 nh K()d =1: The kernel estimator with kernel K can then be dened by! nx K i=1 ; i h where h is a parameter known as the window width whichgoverns the smoothness of the estimate For a more detailed description of choosing the kernel function K and the window width see Silverman (1986) An example of a kernel density plot for the same data as the earlier histogram can be seen in Figure Convergence issues The Maximum Likelihood based estimation procedures described in the last chapter are iterative routines and consequently converge to an answer The convergence depends on a tolerance factor, that is how dierent the current estimate is from the last estimate Here convergence is easy to establish The convergence of a Markov chain is dierent from the convergence of these techniques In Markov chain methods we are not interested in convergence to an estimate but instead are interested in convergence to a distribution, namely the joint posterior distribution of interest There are many points to be addressed when considering convergence to a distribution Firstly when has the chain moved from its starting value and started sampling from its stationary distribution? Secondly how large a sample is required to give estimates to a given accuracy, and nally is the stationary distribution the required posterior distribution? 34

52 mu(1) Figure 3-2: Kernel density plot of 1 using the Gibbs sampling method and a Gaussian kernel with a large value of the window width h 371 Length of burn-in It is usual in Markov chains to ignore the rst B values while the chain converges to the posterior distribution These B values are known as the `burn-in' period and there are many methods to estimate B The easiest method is to look at a trace for each parameter of interest If a parameter, is considered when convergence has been attained at B, the observations i i > B should all come from the same distribution An equivalent approach is to consider the trace of the mean of the parameter of interest against time This trace should become approximately constant when convergence has been reached Examples of both of these traces can be seen in Figure 3-3 where convergence is reached after about 50 iterations The upper solid line in the bottom graph is the running mean after 35

53 discarding the rst 50 iterations There are many convergence diagnostics that can be used to estimate whether a chain has converged The Raftery and Lewis diagnostic (Raftery and Lewis 1992) can also be used to estimate the chain length required for a given estimator accuracy and is mentioned later The Gelman and Rubin diagnostic (Gelman and Rubin 1992) uses multiple chains and will be described when I consider multimodal models Geweke diagnostic Geweke (1992) assumes that a burn-in of length B has been chosen and these B iterations have been discarded The method has its origin in the eld of spectral analysis and compares the trace of over two distinct parts, the rst n A and the last n B iterations, typically the rst tenth and the last half of the data The following statistic A ; B (n ;1 A ^S (0) A + n ;1 B ^S (0) B ) 1 2 tends to a standard normal distribution as n! 1 if the chain has converged Here is the sample mean of and ^S (0) is the consistent spectral density estimate If the above statistic gives large absolute values for a chain, then convergence has not occurred In the models I am considering in this thesis, convergence to the stationary distribution is quick when using the IGLS or RIGLS starting values and an arbitrary `burn-in' period will be used, for example 500 iterations If the chain has not converged by this point I can observe this from its trace and amend the `burn-in' accordingly 372 Mixing properties of Markov chains After a Markov chain has converged the next consideration is how long to run the chain to get accurate enough estimates For some samplers such as the independence sampler, it is possible to calculate the number of iterations required to calculate particular summary statistics to a given accuracy This is because the independence sampler by denition should give uncorrelated values 36

54 mu(1) iteration running mean iteration Figure 3-3: Traces of parameter 1 and the running mean of 1 for a Metropolis run that converges after about 50 iterations Upper solid line in lower panel is running mean with rst 50 iterations discarded 37

55 Auto-correlation is an important issuewhen considering the chain length, as a chain that is mixing badly, that is, has a high auto-correlation will need to be run longer to give estimates to the required accuracy Two useful plots that come from the time series literature (Chateld 1989) are the autocorrelation function (ACF) and the partial autocorrelation function (PACF) The ACF is dened by : () = Cov[(t) (t + )] Var((t)) and describes correlations between the chain itself and a chain produced by moving the start of the chain forward iterations The chain is mixing well if these values are all small The pth partial autocorrelation (PAC) is the excess auto-correlation at lag p when tting an AR(p) process not accounted for by an AR(p ; 1) model The rst PAC will be equal to the rst autocorrelation as this describes the correlation in the chain For the chain to obey the (rst-order) Markov property all other PACs should be near zero A large pth PAC would indicate that the next value is dependent on past values and not just the current value The ACF and PACF for one Gibbs sampling run and one Metropolis sampling run with p = 0:05 for our example in the later section are shown in Figure 3-4 The ACFs in Figure 3-4 include the auto-correlation at lag 0 which is always 1 Here it can be seen that the Gibbs run is mixing well and the auto-correlations are all small whereas the Metropolis run is highly correlated There are many ways to improve the mixing of a Markov chain The simplest way would be to thin the chain by using only every kth observation from the chain Thinning a chain will give anewchain that has less autocorrelation but it can be shown (MacEachern and Berliner 1994) that the thinned chain gives less accurate estimates than the complete chain Thinning is still a useful technique as longer runs need greater storage capacity and although the thinned chain is not as useful as the full chain, it will generally be better than a section of the complete chain of the same length When considering Gibbs sampling methods, there are several ways of improving the mixing of the chain by actually altering the form of the model that is being tted Hills and Smith (1992) explore re-parameterising the vector of variables, so that the new variables correspond to the principle axes of the 38

56 Series : gibbs Series : metropolis ACF ACF Lag Lag Series : gibbs Series : metropolis Partial ACF Partial ACF Lag Lag Figure 3-4: ACF and PACF for parameter 1 for a Gibbs sampling run of length 5000 that is mixing well and a Metropolis run that is not mixing very well 39

57 posterior distribution This is done by transforming the data to the new axes When considering multi-level models, techniques such as hierarchical centring (Gelfand, Sahu, and Carlin 1995) where variables that appear at lower levels are also included at higher levels will improve the mixing of the sampler The mixing of a Metropolis Hastings chain will depend greatly on the proposal distribution used I will discuss in greater detail in the example at the end of this chapter the eect of the proposal distribution Most of the techniques used to improve mixing in the Gibbs sampling algorithms, which involve changing the structure of the model can also be used with Metropolis Hastings algorithms Raftery & Lewis diagnostic The Raftery and Lewis diagnostic (Raftery and Lewis 1992) considers the convergence of a run based on estimating a quantile, q of a function of the parameters, g() to within a given accuracy The method works by rstly nding the estimated quantile, ^g( q ) from the chain and then creating a chain of binary values, Z t dened by Z t =1if g( t ) > ^g( q ) Z t =0if g( t ) ^g( q ): This binary sequence, or a thinned version of the binary sequence can then be thought of as a one step Markov chain with transition matrix 0 P 1 ; 1 ; Using results from Markov chain theory and estimates for and from the chain, estimates for the length of `burn-in' required, B and the minimum number 1 A : of iterations to run the chain for, N can be calculated N is dened as the minimum chain length to obtain estimates for the qth quantile within r (on the probability scale) with probability s such that the n step transition probabilities of the Markov chain are within of its equilibrium distribution The estimates are 40

58 B = log (+) max( ) log(1 ; ; ) and ( ) ;2 (2 ; ; ) r N = B + ( + ) 3 ( 1(1 + s)) : 2 The Raftery Lewis diagnostic can also be used to assess the mixing of the Markov chain by comparing the value N with the value N min obtained if the chain values were an independent sample The statistic I RL = N N min, can then be used to describe the eciency of the sampler The default settings for the Raftery Lewis diagnostic as used in Raftery and Lewis (1992), are q = 0:025 r = 0:005 and s =0:95, and these will be used in MLwiN 373 Multi-modal models If the joint posterior distribution of interest is multi-modal then when an MCMC sampler is used to simulate from the distribution it is possible that, particularly if the modes are distinct, the sampler will simulate from one of the modes and not the whole distribution To get around this problem it is always useful to run several chains in parallel with dierent starting values spread around the expected posterior distribution and compare the estimates that are obtained from each chain If chains give widely diering estimates then the posterior is likely to be multi-modal and the dierent chains are sampling from distinct modes of the distribution There are many convergence diagnostics that rely on running several chains from dierent starting points One of the more popular will now be described Gelman & Rubin diagnostic Gelman and Rubin (1992) assume that m runs of the same model, each of length 2n starting from dispersed starting points are run, and the rst n iterations of each run have been discarded to allow each sequence to move away from its starting point Then the between run, B and within run, W variances are calculated, B=n = P m i=1 ( i: ; :: ) 2 =(m ; 1) and W = P m i=1 s 2 i =m where i: is the mean of the n values for run i, s 2 i is the variance and :: is the overall mean 41

59 The variance of the parameter of interest, 2, can be estimated by aweighted average of W and B, ^ 2 = n ; 1 n W + 1 n B: This along with ^ = ::, gives a normal estimate for the target distribution If the dispersed starting points are still inuencing the runs then the estimate ^ 2 will be an overestimate of 2 The potential scale reduction as n! 1, that is the overestimation factor for ^ 2 can be estimated by where df =2^ 2 =dvar(^ 2 ), and ^R = n ; 1 n n ; 1 dvar(^ 2 )= n +2 (m +1)(n ; 1) mn 2 + m +1 mn B W 2 1 m dvar(s2 i )+ df df ; 2 m mn m ; 1 B2 n m [dcov(s2 i i:) 2 ; 2 i: dcov(s 2 i i: )] in which the estimated variances and covariances are obtained from the sample means and variances of the m runs If ^R is near 1 for all parameters of interest then there is little evidence of multi-modality If ^R is signicantly bigger than 1 then at least one of the m runs has not converged, or the runs have converged to dierent modes in the distribution The majority of models I will study in this thesis will be unimodal I am aiming to use MLn's maximum likelihood techniques to give good starting values for the parameters of interest and consequently will not use widely dierent starting values and so generally choose not to use the Gelman and Rubin diagnostic I will however run several chains with dierent starting seeds for the random number generator This will work in a similar way to the dierent starting values A chain may get stuck in a local mode using one set of random number seeds, whereas another chain starting from the same starting values but with dierent random numbers may get stuck in a dierent mode However if the model does have multiple modes this procedure will not nd them as well as 42

60 the Gelman-Rubin sampling strategy 374 Summary Convergence diagnostics for MCMC methods has become a large eld of statistical research and the three diagnostics described here are simply the tip of the iceberg Both Cowles and Carlin (1996) and Brooks and Roberts (1997) review larger groups of convergence diagnostics and are recommended for further reading on this subject 38 Use of MCMC methods in multi-level modelling Following the introduction of Gibbs sampling in Gelfand and Smith (1990), Gelfand et al (1990) applied the Gibbs sampling algorithm to many problems including variance components models and a simple hierarchical model Seltzer (1993) considers using Gibbs sampling on a two level hierarchical model with a scalar random regression parameter The algorithm used is fully generalized in Seltzer, Wong, and Bryk (1996) to allowvectors of random regression parameters Zeger and Karim (1991) consider using Gibbs sampling for generalized linear models with random eects, which are two level multi-level models They concentrate mainly on the logistic normal model which I will investigate in Chapter 6 The package BUGS (Spiegelhalter et al 1994), is a general purpose Gibbs sampling package using the adaptive rejection method (Gilks and Wild 1992) that can be used to t many models including multi-level models They have concentrated mainly on models with univariate parameter distributions although in BUGS versions 05 and later they include multivariate distributions It can be seen that most research in the use of MCMC methods in the eld of multi-level modelling has concentrated on Gibbs sampling This is primarily because of its ease of programming In MLwiN I will start by using Gibbs sampling for the simplest models Then when the conditional distributions do not have standard forms, for example logistic regression models, where Zeger and Karim (1991) use rejection sampling and BUGS uses adaptive rejection sampling, I will instead consider using Metropolis and Metropolis Hastings sampling I will 43

61 also consider using these methods as an alternative to Gibbs sampling in the less complex Gaussian models Before looking at multi-level models, I will end this chapter with an example that will illustrate the three MCMC methods and the other issues described in this chapter 39 Example - Bivariate normal distribution Gelman et al (1995) considered taking one observation from a bivariate normal distribution to illustrate the use of the Gibbs sampler I will consider the more general case of a sequence of n pairs of observations (y 1i y 2i ) from a bivariate normal distribution with unknown mean ( 1 2 ) and known variance matrix Assume that has a non-informative uniform prior distribution then the posterior distribution has a known form : A j y 1 y 1 A A : y 2 n I can verify the use of the MCMC techniques in this chapter by comparing the answers they produce with the correct posterior distribution I will consider a set of 100 draws generated from a bivariate 0 normal 1distribution with mean vector =(4 2) and variance matrix A I will assume that is known and that I want to estimate In the test data set, y 1 = 4:0154 and y 2 =2:0013, so the posterior distributions are as follows : A j A I will now explain briey how to use the various techniques on this problem A : 391 Metropolis sampling There are two parameters, 1 and 2 for which posterior distributions are required As uniform priors are being used for 1 and 2, the conditional posterior distributions are simply determined by the likelihood : 44

62 p( 1 j 2 y) = p(y j )p( 1 ) p( 1 j 2 y) / exp(; 1 2 Similarly for 2, p( 2 j 1 y) / exp(; 1 2 NX i=1 NX i=1 I will use the normal proposal distribution (y i ; ) T ;1 (y i ; )): (y i ; ) T ;1 (y i ; )): i(t+1) N( i(t) 2 p) for both 1 and 2 I will consider several values for 2 p to show the eect of the proposal variance on acceptance rate and convergence of the chain 392 Metropolis-Hasting sampling As an example of a Metropolis-Hastings sampler I will consider the following normal proposal distribution i(t+1) N( i(t) p) This proposal distribution has two dierences from the earlier Metropolis proposal distribution Firstly it is biased which in this example induces slow mixing, generally it is preferable to have an unbiased proposal distribution Secondly it is not symmetric, so it does not have the Metropolis property, p( t+1 = a j t = b) = p( t+1 = b j t = a) Consequently the ratio of the proposal distributions has to be worked out, that is For this proposal distribution, r = p( t+1 = a j t = b) p( t+1 = b j t = a) : 45

63 r = p( t+1 = a j t+1 N(b )) p( t+1 = b j t+1 N(a )) = exp(; 1 (a ; b ; )2 ) exp(; 1 (b ; a ; )2 ) = exp(; ((a ; b ; 1 2 )2 ; (b ; a ; 1 2 )2 )) = exp( a ; b 2 ) So when choosing to accept or reject each new value the Hastings ratio is used as a multiplying factor Again I will consider using several dierent proposal variances, 2 p to improve our acceptance rate and convergence time 393 Gibbs sampling To use Gibbs sampling on this model I will consider updating the two parameters, 1 and 2 separately It would be pointless as an illustration of the Gibbs sampler, updating the parameters together using a multivariate updating step as this would be generating from the conditional distribution, p( j y) which is the joint posterior distribution of interest and I could nd its mean and variance directly To use Gibbs sampling I need to nd the two conditional distributions, p( 1 j 2 y) and p( 2 j 1 y) I am using uniform priors for 1 and 2 and so the posterior distribution is simply the normalised likelihood The conditional distributions are found as follows : Let D= p( 1 j 2 y) = p(y j )p( 1 ) p( 1 j 2 y) / exp(; d 11 d 12 d 12 d 22 p( 1 j 2 y) / exp(; 1 2 nx i=1 (y i ; ) T ;1 (y i ; )) 3 5 = ;1, then expand in terms of 1 nx i=1 (y i1 ; 1 ) 2 d 11 +2(y i1 ; 1 )(y i2 ; 2 )d 12 +(y i2 ; 2 ) 2 d 22 ): 46

64 Then assuming that 1 has a normal distribution, 1 N( c 2 c) and equating powers of 1 gives 2 1 c 2 = nd ! 2 c = 1 nd 11 and ;2 c 1 2 c = ;2 nx i=1 (y i1 1 d 11 + y i2 1 d 12 ; 2 1 d 12 ) So I have! c = y 1 + d 12 d 11 (y 2 ; 2 ): and similarly for 2, p( 1 j 2 y) N(y 1 + d 12 1 (y 2 ; 2 ) ) d 11 nd 11 p( 2 j 1 y) N(y 2 + d 12 1 (y 1 ; 1 ) ): d 22 nd 22 These expressions could also have been derived from standard bi-variate normal regression results I can now use the Gibbs sampling algorithm by alternately sampling from these two conditional distributions Unlike the other two methods, I do not have a free parameter to set to change the acceptance rate and improve the convergence rate, the Gibbs sampler always accepts the new state This is one of the reasons it is more widely used than the other methods 394 Results The model was tted using all three methods described above For the Gibbs sampler 3 runs were performed using a burn-in of 1,000 and amainrunof 5,000 updates For both the Metropolis and Metropolis-Hastings sampling methods 3 runs were performed for several dierentvalues of p A burn-in of 1,000 was used and a main run of 100,000 updates for both methods The results are summarised in Table 31 From Table 31 it can clearly be seen that all the methods eventually converge 47

65 Table 31: Comparison between MCMC methods for tting a bivariate normal model with unknown mean vector Method p 1 (sd) 2 (sd) Acc % 1 = 2 R-L ^N Theory N/A 4015 (0141) 2001 (0100) N/A N/A (0143) 2004 (0100) 886/843 81, (0141) 2002 (0100) 781/703 35,000 Metropolis (0141) 2001 (0100) 606/497 16, (0141) 2001 (0100) 478/372 14, (0141) 2002 (0100) 325/240 20, (0142) 2002 (0100) 174/124 36, (0143) 2002 (0101) 44/43 92, (0141) 2002 (0099) 89/79 34,900 Hastings (0140) 2001 (0100) 188/141 25, (0141) 2001 (0100) 179/131 31, (0141) 2001 (0100) 152/110 41, (0143) 2001 (0100) 111/78 63,200 Gibbs N/A 4014 (0140) 2002(0103) 1000/1000 3,900 to approximately the correct answers According to the Raftery-Lewis diagnostic, the Gibbs sampling method achieves the default accuracy goals in the least number of iterations Both the Metropolis and the Hastings methods have accuracies which vary depending on the proposal distribution The Hastings method has much smaller acceptance rates due to the bias in the proposal distribution It is clear that the Hastings sampler takes longer to converge than the Metropolis sampler on average, and that the best proposal standard deviation p is higher for the Hastings sampler than the Metropolis sampler Both these points are also due to the bias in the sampler, and in general a biased sampler would not be used in preference to an unbiased sampler Gelman, Roberts, and Gilks (1995) studied optimal Metropolis sampler proposal distributions for Gaussian target distributions They found that the best univariate proposal distributions have standard deviations that are 238 times the sample standard deviation I used the known correct standard deviations for the parameters 1 and 2 to nd that the optimal proposal standard deviations are 0336 for 1 and 0238 for 2 In Table 31 the same proposal standard deviation is used for both parameters, but it is as easy to use dierent proposal distributions for each parameter Using the proposal standard deviations proposed in Gelman, 48

66 Roberts, and Gilks (1995) gives acceptance rates of approximately 442% and Raftery Lewis ^N values of around 14,000 for both parameters which compare favourably with the best results in the table mu(2) Figure 3-5: Kernel density plot of 2 Gaussian kernel using the Gibbs sampling method and a Looking at the kernel density plots for the two variables, Figures 3-2 and 3-5, constructed from one run of the Gibbs sampler method, it can be seen that both variables have Gaussian posterior distributions as expected In the case where the posterior distributions of the parameters of interest are Gaussian, then the 95% central Bayesian credible intervals (BCI) from the simulation methods will be approximately the same as the standard 95% condence interval (CI) This point is illustrated in Table 32 for the three MCMC methods Both the Bayesian credible intervals and the condence intervals are based on one run of each sampler respectively 49

67 (a) (b) RL diag (1000s) RL diag (1000s) proposal sd mu1 (c) proposal sd mu2 (d) RL diag (1000s) RL diag (1000s) acceptance rate mu1 (e) acceptance rate mu2 (f) RL diag (1000s) RL diag (1000s) proposal sd/actual sd mu proposal sd/actual sd mu2 Figure 3-6: Plots of the Raftery Lewis ^N values for various values of p, the proposal distribution standard deviation 50

68 Table 32: Comparison between 95% condence intervals and Bayesian credible intervals in bivariate normal model Method 1 CI 2 CI 1 BCI 2 BCI Theory (3739,4291) (1805,2197) (3739,4291) (1805,2197) Met p =0:3 (3737,4294) (1807,2196) (3737,4294) (1806,2196) Met optimal (3737,4292) (1804,2200) (3738,4291) (1804,2200) Hast p =0:5 (3739,4288) (1802,2190) (3739,4283) (1804,2191) Gibbs (3733,4291) (1799,2205) (3737,4290) (1802,2200) Considering the Metropolis sampler in more detail and running the sampler with lots of dierent values of p the optimal value given by Gelman, Roberts, and Gilks (1995) can be veried The graphs in Figure 3-6 were created by tting smooth curves to the results from the data for the new runs of the sampler The graphs (a) and (b) show the Raftery Lewis ^N values plotted against p, the proposal standard deviation for 1 and 2 respectively The graphs (c) and (d) show the eect of acceptance rate of the proposals on ^N for the two variables Graphs (e) and (f) are graphs (a) and (b) rescaled by dividing p by the true standard deviations for the two parameters In their paper Gelman, Roberts, and Gilks (1995) compared the eects of dierent values of the parameter p, and the acceptance rate of the parameter, on the eciency of the sampler I have substituted eciency for the Raftery and Lewis diagnostic, ^N and found that the same values of standardised p that maximise eciency, minimise ^N From this it appears that 1= ^N = f(eciency), for increasing f It is worth noting from Figure 3-6 that while the optimal value for the ratio = proposal SD/ actual SD is (close to) the 24 value obtained in Gelman, Roberts, and Gilks (1995), the region of near optimality is quite broad : the Raftery Lewis ^N value is below 20,000 for 0: Summary This example has shown how to use the three MCMC methods described earlier in this Chapter on a simple problem It has shown that the Gibbs sampler works best if the conditional distributions of the unknown parameters are known It also shows that the Metropolis and Hastings algorithms are easier to implement 51

69 but need more tuning to give good answers The Hastings algorithm used here performed worst but it will be shown how more sensible Hastings algorithms can be used in later chapters The example also illustrates how the output from a simulation method can be summarised and highlights the importance of checking convergence The summary results and run diagnostics covered in this chapter have now been added to MLwiN and can be seen in Figure 3-7 This parameter has been given an informative prior distribution whose graph can also be seen in the kernel density picture In the next chapter I will discuss prior distributions in greater detail Figure 3-7: Plot of the MCMC diagnostic window in the package MLwiN for the parameter 1 from a random slopes regression model 52

70 Chapter 4 Gaussian Models 1 - Introduction 41 Introduction In this chapter the aim is to combine the knowledge gained in the previous two chapters, to use the MCMC methods described in Chapter 3 to t some of the simple multi-level models described in Chapter 2 This work will then lead on to the next chapter where I will consider how to t general models using MCMC methods I will only consider one of the three MCMC methods described in chapter 3, the Gibbs sampler, and use it to t the simple variance components model and the random slopes regression model I will give Gibbs sampling algorithms to t both these models and compare the results obtained to the results obtained with the IGLS and RIGLS maximum likelihood methods Before considering any modelling, I will rstly concentrate on one important aspect of Bayesian methods, prior distributions To create a general purpose multi-level modelling package that uses Bayesian methods, some default prior distributions must be found for all parameters These default priors should be \non-informative" and so some possible default priors will be described in the next section of this chapter These dierent candidate priors will then be compared via simulation with each other and the maximum likelihood methods I will end the chapter with some conclusions on which methods perform best 53

71 42 Prior distributions A prior distribution p() for a parameter is a probability distribution that describes all that is known about before the data has been collected There are two distinct types of prior distribution, informative priors and non-informative priors 421 Informative priors An informative prior for is a prior distribution that is used when information about the parameter of interest is available before the data is collected, and this information is to be included in the analysis For example, say Iwas interested in estimating the average height ofmaleuniversity students Then before collecting my data by sampling from the student population, I go to the library and nd that the average height of men in Britain is 179m I can then create a normal prior distribution with mean 179 and variance 2, where the value, 2 will determine the information content of the prior knowledge I could also incorporate my belief that as students are generally in the age group, and this age group is on average taller, this group will be on average taller by increasing the mean of my prior distribution 422 Non-informative priors A \non-informative" prior distribution for is a prior distribution that is used to express complete ignorance of the value of before the data is collected They are non-informative in the sense that no value is favoured over any other and are also described as diuse or at priors due to this reason and their shape The most common non-informative prior is the uniform distribution over the range of the sample space for If the parameter is dened over an innite range, for example the whole real line, then the uniform distribution is an improper prior distribution, as its distribution function does not integrate to 1 Improper prior distributions should be used with caution, and only be used if they produce proper posterior distributions 54

72 423 Priors for xed eects Fixed eect parameters have no constraints and can take any value A prior distribution for such parameters will need to be dened over the whole real line The conjugate prior distribution for such parameters is the normal distribution as will be illustrated in the algorithms described in later sections Uniform prior If a non-informative prior is required then a good choice would be a uniform prior over the whole real line, p() / 1 This prior is improper as it does not integrate to 1, but will give proper posterior distributions and can be approximated by the following prior Normal prior with huge variance As the variance of the normal distribution is increased, the distribution becomes locally at around its mean Although xed eects can take any value, close examination of the data can narrow the range of values and a suitable normal prior can be found Generally the normal prior, p() N( ) will be an acceptable approximation to a uniform distribution but if the xed eects are very large, a suitable increase in the prior variance may be necessary Figure 4-1 shows several normal priors over the range ({5,5) It can clearly be seen that as the variance increases the prior distribution becomes atter over the range and when the variance is increased to 50 the graph looks like aat line 424 Priors for single variances Variance parameters are constrained to have strictly positive values, and so prior distributions such as the normal cannot be used The conjugate prior for a variance parameter is an inverse chi squared or inverse gamma distribution As these distributions are not commonly simulated from, the precision parameter, the reciprocal of the variance is generally considered instead The conjugate priors for the precision parameter are then the chi-squared or gamma distributions There are a variety of main contenders for the non-informative distribution for the variance and these will now be considered 55

73 Figure 4-1: Plot of normal prior distributions over the range ({5,5) with mean 0 and variances 1,2,5,10 and 50 respectively Uniform prior for 2 The parameter of interest is the variance, 2 so this prior tries to allow any variance to be equally likely This prior is used by Gelman and Rubin (1992) and Seltzer (1993) amongst others but appears to have the disadvantage that it favours large values of 2 This is because even unfeasibly large values of 2 have equal prior probability This prior is improper and the following is a proper alternative Pareto(1,c) prior for =1= 2 The Pareto distribution is a left-truncated gamma distribution, and when used as a prior for the precision parameter is equivalent to a locally uniform prior for 56

74 the variance parameter Pareto(1 c), p() =c ;2 >c: This means that a uniform prior for 2 on (0 c ;1 )isequivalent toapareto (1 c) prior for As c is decreased the distribution will approach the improper uniform distribution on (0 1) Uniform prior for log 2 Box and Tiao (1992) try to nd `data-translated' uniform priors to represent suitably non-informative priors They try and nd a scale upon which the distribution of the parameter will have the same shape for any possible value of that parameter For the xed eects parameter this scale is simply the parameter's own scale, as altering a normal distribution's mean does not alter its shape, it simply translates the distribution When considering a variance parameter the likelihood has an inverse-chi squared distribution and this implies that the correct scale is the log scale Consequently Box and Tiao suggest using a uniform prior on the log 2 scale DuMouchel and Waternaux (1992) discourage the use of this improper prior distribution with hierarchical models as they claim it can give improper posterior distributions so instead the following proper alternative is often used Gamma(,) prior for =1= 2 The gamma(,) prior for approaches a uniform prior for log 2 as! 0 In fact the improper gamma distribution when = = 0 for is equivalent to a uniform prior for log 2 This prior is the standard prior recommended by BUGS (Spiegelhalter et al 1994) for variance parameters, as BUGS does not permit the use of improper priors Scaled inverse chi-squared( ^ 2 ) prior for 2 An alternative approach would be to use an estimate of the parameter of interest to choose a particular prior distribution This prior is then a data-driven prior 57

75 as it requires an estimate for 2 The parameter is small and if the estimate of 2, ^ 2 is 1, this prior is equivalent toagamma(=2,=2) prior for 425 Priors for variance matrices When considering a variance matrix most priors for single variances can be translated to a multivariate alternative Uniform prior for This is similar to the univariate case, ie p() / 1 Uniform prior for log This is similar to the univariate case, ie p(log ) / 1 This prior will not be considered directly as it gives rise to improper posterior distributions Wishart prior equivalent to the gamma(,) prior It is dicult to evaluate what would be the multivariate equivalent of the gamma(,) prior One candidate prior is a Wishart prior for the precision matrix, ;1 with parameters = n and S = I In fact, it will be seen later, that this prior is slightly informative and shrinks the estimate for towards I, theidentity matrix Wishart prior equivalent to SI 2 ( ^ 2 ) prior It can be shown (Spiegelhalter et al 1994) that if ;1 W ishart n ( S) then E() = S=( ; n ; 1) It clearly follows that if there is a prior estimate, ^ for and I want to incorporate this estimate into a `vaguely' informative prior, then the following is an obvious candidate : ;1 Wishart n (n +2 ^): I will now compare the results obtained using all the above priors on 2 simple two level models 58

76 43 2 Level variance components model One of the simplest possible multi-level model is the two level variance components model This model has a single response variable and interest lies in quantifying variability of this response at dierent levels The two level variance components model can be written mathematically as : y ij = 0 + u j + e ij u j N(0 2 u) e ij N(0 2 e) where i = 1 ::: n j j = 1 ::: J, P j n j = N and in which all the u j and e ij are independent This model can be tted using the Gibbs sampling method as shown in the next section 431 Gibbs sampling algorithm The unknown parameters in the variance components model can be split into four groups, the xed eect, 0, the level 2 residuals u j, the level 2 variance 2 u and the level1variance 2 e Conditional posterior distributions for each of these parameters need to be found so that the Gibbs sampling method described in the previous chapter can be used Then sampling from the distributions in turn gives estimates for the parameters and their posterior distributions can be found by simulation Prior distributions I will assume a uniform prior for the xed eect parameter 0 The two variances 2 u and 2 e will take various priors in the simulation experiment so in the algorithm I will use general scaled inverse 2 priors with parameters u s 2 u and e s 2 e respectively Then all the priors in the earlier section can be obtained from particular values of these parameters The algorithm is then as follows : Step 1 p( 0 j y 2 u 2 e u) Let 0 N( ^ 0 ^D 0 ), then to nd ^ 0 and ^D 0 : 59

77 Y p( 0 j y u 2 e u) 2 / ( 1 1 ) 1 i j e 2 2 exp[; (y 2e 2 ij ; u j ; 0 ) 2 ] / exp[; N 2 2e X (y e 2 ij ; u j ) 0 + const]: Comparing this with the form of a normal distribution and matching powers of 0 gives i j ^D 0 = 2 e N and Step 2 p(u j j y 2 u 2 e 0 ) ^ 0 = 1 N X i j (y ij ; u j ): Let u j N(û j ^D j ), then to nd û j and ^D j : p(u j j y 2 u 2 e 0 ) / n Y j i=1( 1 1 ) 1 e 2 2 exp[; (y 2e 2 ij ; u j ; 0 ) 2 ] ( 1 u 2 ) 1 u 2 2 exp[; j ] 2u 2 / exp[; 1 2 (n j 2 e + 1 )u 2 u 2 j + 1 2e 2 n X j i=1 (y ij ; 0 )u j + const]: Comparing this with the form of a normal distribution and matching powers of u j gives and ^D j =( n j 2 e û j = ^D j 2 e n X j i= u ) ;1 (y ij ; 0 ): 60

78 Step 3 p( 2 u j y 0 u 2 e) Consider instead p(1= 2 u j y 0 u 2 e) and let 1= 2 u gamma(a u b u ) Then p(1= 2 u)=(1= 2 u) ;2 p( 2 u)andso p(1= 2 u j y 0 u 2 e) / JY j=1( 1 2 u u 2 ) 1 2 exp[; j ]( 1 ) ;2 p( 2u 2 u) 2 u 2 / ( 1 ) ( j u u 2 ;1) exp[; 1 ( 2u 2 Comparing this with the form of a gamma distribution produces JX j=1 u 2 j + u s 2 u)]: a u = J + u 2 and b u = 1 2 ( us 2 u + JX j=1 u 2 j): A uniform prior on 2 u, or the equivalent Pareto prior is equivalent to u = ;2 s 2 u = 0 A uniform prior on log 2 u is equivalent to u = 0 s 2 u = 0 and a gamma( ) prior for 1= 2 u is equivalent to u =2 s 2 u =1 Step 4 p( 2 e j y 0 u 2 u) Consider instead p(1= 2 e j y 0 u 2 u) and let 1= 2 e gamma(a e b e ) Then p(1= 2 e)=(1= 2 e) ;2 p( 2 e)andso Y p(1=e 2 j y 0 u u) 2 / ( 1 1 ) 1 2 exp[; (y i j e 2 2e 2 ij ; 0 ; u j ) 2 ]( 1 ) ;2 p( e) 2 e 2 / ( 1 ) ( N e e 2 ;1) exp[; 1 X ( (y 2e 2 ij ; 0 ; u j ) 2 + e s 2 e)]: i j Comparing this with the form of a gamma distribution produces a e = N + e 2 and b e = 1 2 ( es 2 e + X i j e 2 ij ): A uniform prior on e, 2 or the equivalent Pareto prior is equivalent to e = ;2 s 2 e = 0 A uniform prior on log e 2 is equivalent to e = 0 s 2 e = 0 and a gamma( ) prior for 1=e 2 is equivalent to e =2 s 2 e =1 61

79 Having found the four sets of conditional distributions, it is now simple enough to program up the algorithm and compare via simulation the various prior distributions 432 Simulation method In this simulation experiment Iwant to compare the maximum likelihood methods IGLS and RIGLS with the Gibbs sampler method using several dierent prior distributions for the variance parameters For ease of terminology I will consider this two level dataset in an educational setting and use pupils within schools as the units under consideration I will now consider what parameters in the above model are important and run several dierent sets of simulations using dierent values for these parameters For each set of parameter values 1000 simulated datasets will be generated and each method will be used to t the variance components model to each dataset I will then compare how the methods perform for each group of parameter values These comparisons will be of two sorts Firstly how biased the methods are and secondly how well the condence intervals they produce cover the data The rst two parameters I consider as importantwillinuencethe structure of the study Firstly I will consider the size of the study There are J schools, each with n j pupils to give a dataset of size N I will consider changing J, the number of schools included in the study, and consequently modifying N to reect this change Secondly the number of pupils in each school will be considered Here two schemes will be adopted, rstly having equal numbers of pupils in each school and secondly having a more widely spread distribution of pupils in each school To use a realistic scenario I will consider the JSP dataset introduced in earlier chapters If slight modications are made by removing 1 pupil at random from each of the 23 largest schools then there will be 864 students, an average of 18 students per school The number of schools included in the study can then be varied and schools can be chosen so that the average pupils per school is maintained at 18 and the sizes of the individual schools are well spread I will consider four sizes of study, 6, 12, 24 and 48 schools with a total of 108, 216, 432 and 864 pupils respectively 62

80 The 8 study designs are included in Table 41 below The individual schools in the cases with unequal n j, were chosen to resemble the actual (skewed) distribution of class size in the JSP data Table 41: Summary of study designs for variance components model simulation Study Pupils per School N The other variables that need varying are the true values given to the parameters of interest, 0 u 2 and e 2 The xed eect, 0, is not of great interest and so will be xed at 300 and not modied The two variances are more interesting and I will choose three possible values for each of these parameters The between schools variance, u 2 will take values 10, 100 and 400, and for the between pupils variation, e 2 the values 100, 400 and 800 will be used I will assume that e 2 is always greater than or equal to u 2 as this is more likely in the educational scenario I will consider the eight dierent study designs with true values that are most like the original JSP model, that is, 0 = 30 u 2 = 10 and e 2 = 40 I will then only consider study design 7, which is similar to the actual JSP dataset and modify the true values of the variance parameters This will make in total 15 dierent simulation settings 63

81 Creating the simulation datasets For the variance components model, creating the simulation datasets is easy as the only data that need generating are the values of the response variable for the N pupils Considering the case of 864 pupils within 48 schools the procedure is as follows : 1 Generate 48 u j s, one for each school, bydrawing from a normal distribution with mean 0 and variance u 2 2 Generate 864 e ij s, one for eachpupil,by drawing from a normal distribution with mean 0 and variance e 2 3 Evaluate Y ij = 0 + u j + e ij for all 864 pupils This will generate one simulation dataset for the current parameter values This dataset is then tted using each method, and the whole procedure is repeated 1000 times The datasets will be generated using a short C program The Gibbs sampling routines using the various prior distributions will be run using the BUGS package while the IGLS and RIGLS estimates will be calculated using MLwiN The main reason to use BUGS and not MLwiN to perform the Gibbs sampling runs is due to the computing resources available to me I have access to several UNIX Sun workstations on which I can run BUGS in parallel but I only have one PC to use for MLwiN and this machine is also used for MLwiN development work Comparing methods To compare the bias of each method I can nd the mean values of each parameter estimate over the 1000 runs and the standard errors of these means These means can then be compared with the true answer To compare the coverage, I want to nd how many of the 1000 runs contain the true value in an x% condence interval Ideally x% of the runs will contain the true value in an x% condence interval In particular I will concentrate on the 90% and 95% condence intervals as these are the most used condence levels For the Gibbs sampling methods, it is easy to calculate how many of the I iterations in each run are larger than the true value and so I can then nd whether or not the true value lies in an x% credible interval without actually calculating all the credible intervals I will consider three proper priors, the Pareto(1,c) prior 64

82 for 1= 2, the gamma( ) prior for 1= 2, and the scaled inverse chi-squared prior for 2 mentioned in the earlier section of this chapter The gamma( ) estimate has been used as the parameter for the scaled inverse chi-squared prior For a particular method, the same prior will be used for both u 2 and e 2 For IGLS, we can assume normality for the xed eects parameter and calculate a normal x% condence interval For the variances it is not so clear how to calculate condence intervals so for now I will assume normality and later I will consider another method of improving on this assumption Preliminary analysis for Gibbs sampling To gauge how long the simulations will take and consequently how many iterations I can aord to run for eachmodel,ihave performed two preliminary tests Firstly I ran each of the 8 study designs with a generated dataset for 50,000 iterations on a fast machine This test will give an estimate of how long the simulation studies will take The results are in Table 42 Table 42: Summary of times for Gibbs sampling in the variance components model with dierent study designs for 50,000 iterations Study CPU time Act time N 1 59s 64s s 68s s 128s s 107s s 203s s 203s s 398s s 411s 864 Secondly, to calculate how long to run each model I need to consider when the model has produced estimates to a given accuracy For this I will consider the Raftery Lewis diagnostic at two percentiles, the standard 25 percentile and the median I will use a value for r in the Raftery Lewis notation of 001 instead of 0005 which is the default This is because it takes a lot longer to obtain the same accuracy for the median as the 25 percentile The results in Table 43 are the average values for ^N from 10 runs of the Gibbs 65

83 sampler, each of length 10,000 and with a burn-in of 1,000 The true values used are 0 =30 2 u =10 and 2 e =40 Table 43: Summary of Raftery Lewis convergence times (thousands of iterations) for various studies Study Prior 0 (25/50) u(25/50) 2 e(25/50) 2 1 Gamma 10/67 8/146 1/12 1 Pareto 20/138 3/87 1/10 2 Gamma 13/79 8/69 1/11 2 Pareto 19/168 2/91 1/10 3 Gamma 6/76 10/34 1/10 3 Pareto 8/93 2/28 1/10 5 Gamma 5/77 2/28 1/10 5 Pareto 5/90 2/27 1/10 7 Gamma 5/73 1/16 1/10 7 Pareto 5/79 1/15 1/10 The large ^N values for the median of 0 are due to large serial correlation and do not decrease much with the size of the study The large ^N values for u 2 are due to the skewness of the posterior distribution which decreases as the number of schools increases Although it is important to ensure the estimates have converged to a given accuracy, this has to be balanced by the fact that many simulation datasets are to be generated for each situation and time is therefore constrained It is important to realise that the models under study are in common usage and are known to be unimodal and so longer runs are only used to establish accuracy in the estimates and not convergence of the chain As I will be running lots of simulations and then averaging the results the accuracy of individual simulations does not have to be so rigid The number of iterations in each simulation run will depend on the size of the study The smaller studies need longer to converge but take less time per iteration so can be run for longer The lengths of simulation runs I chose are in Table 44 below Even with the modest values for N with the larger sample sizes, the full simulation took 3 months to run using 3 Sun machines 66

84 Table 44: Summary of simulation lengths for Gibbs sampling the variance components model with dierent study designs Study Length N 1 50, , , , , , , , Results : Bias In Table 45 the estimates of the relative bias for each method, obtained from the simulations are given for the eight dierent study designs It can be seen immediately from the table and Figures 4-2(i) and (ii) that the RIGLS method gives the smallest bias on almost all study designs The IGLS method generally underestimates the variances and in particular the level 2 variance All the `noninformative' MCMC methods overestimate the variances and the Pareto prior does particularly badly It is no great surprise that the level 1 variance is better estimated, and has smaller percentage biases than the level 2 variance as there are far more pupils than schools This reduction in relative bias is due to the bias of the estimates being inversely related to the number of observations they are based on The study design also has a signicant eect on how the methods perform As the number of schools is increased and consequently the number of students is also increased, all the methods give better estimates The eect of using a balanced design as opposed to an unbalanced design is unclear The IGLS, RIGLS and Pareto prior methods give better estimates with a balanced design but the other MCMC priors generally give worse estimates The size of the dataset has far more eect on the estimates than whether the design is balanced In Table 46, and Figures 4-2(iii) and (iv), study design 7, which is the design most like the original JSP dataset and is a design where most methods perform well has been considered in more detail The true values for the variance 67

85 Table 45: Estimates of relative bias for the variance parameters using dierent methods and dierent studies True level 2/1 variance values are 10 and 40 Study IGLS RIGLS Gamma SI 2 Pareto Number ( ) prior prior (1 ) prior Level 2 variance relative bias % (Monte Carlo SE) 1 {2264 (205) {097 (248) 4905 (405) 5090 (403) 4813 (1023) 2 {2007 (201) 003 (242) 5141 (398) 5275 (396) 4498 (966) 3 {1186 (157) {099 (172) 1836 (217) 1866 (216) 7489 (271) 4 {975 (138) 040 (151) 2027 (208) 2047 (207) 7088 (260) 5 {237 (113) 310 (118) 1198 (130) 1200 (130) 3077 (142) 6 {411 (112) 102 (116) 970 (128) 971 (128) 2672 (141) 7 {214 (085) 052 (086) 469 (090) 470 (090) 1248 (095) 8 {202 (081) 053 (082) 475 (086) 475 (086) 1204 (090) Level 1 variance relative bias % (Monte Carlo SE) 1 {042 (0453) {042 (0453) 279 (0473) 261 (0471) 347 (0470) 2 {045 (0458) {041 (0458) 278 (0478) 262 (0476) 358 (0476) 3 {002 (0320) {003 (0320) 163 (0328) 159 (0328) 202 (0325) 4 {016 (0323) {016 (0323) 143 (0322) 140 (0321) 194 (0319) 5 {031 (0223) {031 (0223) 028 (0224) 028 (0224) 066 (0224) 6 {015 (0223) {015 (0223) 042 (0224) 042 (0224) 083 (0224) 7 {004 (0158) {004 (0158) 025 (0158) 025 (0158) 042 (0158) 8 {009 (0158) {009 (0158) 019 (0158) 019 (0158) 038 (0158) parameters have then been modied It can be seen that the IGLS method again underestimates the true values and that the RIGLS method corrects for this What is surprising is the eects obtained when the level 2 variance is set to values much less than the level 1 variance Here the MCMC methods with the gamma and SI 2 priors now underestimate the level 2 variance The corresponding level 1 variance is still over-estimated and so perhaps some of the level 2 variance is being estimated as level 1 variance The Pareto prior biases are not similarly aected by modifying the true values of the variances In fact the percentage bias in the estimate of the level2variance increases when the true value of level 1variance is increased So to summarise the RIGLS method gives the least biased estimates in all the scenarios studied The IGLS method underestimates the variance parameters while all the MCMC methods overestimate the variance except when the true value of e 2 is much greater than the true value of u 2 Of the MCMC methods, 68

86 (i) Relative % bias : level 2 variance IGLS RIGLS Gamma SI chi-squared Pareto Simulation Design (ii) Relative % bias : level 1 variance IGLS RIGLS Gamma SI chi-squared Pareto Simulation Design (iii) Relative % bias : level 2 variance IGLS RIGLS Gamma SI chi-squared Pareto 1:10 1:40 1:80 10:10 10:40 10:80 40:40 40:80 Parameter Settings (iv) Relative % bias : level 1 variance IGLS RIGLS Gamma SI chi-squared Pareto 1:10 1:40 1:80 10:10 10:40 10:80 40:40 40:80 Parameter Settings Figure 4-2: Plots of biases obtained for the various methods against study design and parameter settings 69

87 Table 46: Estimates of relative bias for the variance parameters using dierent methods and dierent true values All runs use study design 7 Level 2/1 IGLS RIGLS Gamma SI 2 Pareto Variances ( ) prior prior (1 ) prior Level 2 variance relative bias % (Monte Carlo SE) 1/10 {310 (108) 036 (111) 315 (119) 328 (120) 1592 (122) 1/40 {654 (207) 031 (214) {1848 (214) {2001 (218) 3957 (218) 1/80 {343 (302) 718 (316) {2281 (254) {2460 (261) 8453 (311) 10/10 {171 (072) 053 (073) 494 (076) 495 (076) 1056 (079) 10/40 {214 (085) 052 (086) 469 (090) 470 (090) 1248 (095) 10/80 {277 (101) 043 (103) 368 (109) 369 (109) 1484 (113) 40/40 {171 (072) 053 (073) 494 (076) 495 (076) 1056 (079) 40/80 {188 (076) 050 (077) 490 (081) 491 (081) 1123 (085) Level 1 variance relative bias % (Monte Carlo SE) 1/10 {004 (016) {004 (016) 043 (016) 042 (016) 048 (016) 1/40 {008 (016) {007 (016) 073 (016) 076 (016) 024 (016) 1/80 {018 (016) {016 (016) 064 (016) 066 (016) 017 (016) 10/10 {015 (016) {015 (016) 019 (016) 019 (016) 042 (016) 10/40 {004 (016) {004 (016) 025 (016) 025 (016) 042 (016) 10/80 {004 (016) {004 (016) 033 (016) 032 (016) 042 (016) 40/40 {015 (016) {015 (016) 019 (016) 019 (016) 042 (016) 40/80 {005 (016) {005 (016) 021 (016) 021 (016) 042 (016) the Pareto prior gives far greater bias than the other priors 434 Results : Coverage probabilities and interval widths The Bayesian MCMC methods are not designed specically to give unbiased estimates In the Bayesian framework, interval estimates and coverage probabilities are considered more important The maximum likelihood IGLS and RIGLS methods are not ideally suited for nding interval estimates and coverage probabilities as additional assumptions now have to be made to create intervals In the following tables I have made the assumption that all the parameters have Gaussian distributions In the case of the level 2 variance parameter this is a very implausible assumption and I will consider an alternative assumption later in this chapter I will rstly consider the xed eect parameter, 0 where it is plausible to 70

88 assume there is a Gaussian posterior distribution Table 47 contains the coverage probabilities for the xed eect parameter using the eight dierent study designs and Table 48 contains the corresponding interval widths It can be seen in Table 47 that the gamma and SI 2 priors perform signicantly better than the RIGLS method All three methods have actual coverage that is too small and the IGLS method gives the smallest coverage The Pareto method gives actual coverage that is far too big for the smaller studies but gives better coverage as study size increases In fact all methods perform better the larger the study and generally the coverage is slightly better when the design is balanced Table 48 echoes the results in Table 47 in that the gamma and SI 2 intervals are on average slightly wider than the IGLS and RIGLS intervals which leads to better coverage The Pareto prior gives intervals in the smaller studies that are on average almost twice as wide as the other methods and consequently gives too much coverage As the studies get larger the Pareto intervals get closer in size to the other methods and the method performs better In Table 49 when only study 7 is considered there are far smaller dierences between the various methods The Pareto prior is doing slightly better than all the other methods, while the other MCMC methods are generally improving on the maximum likelihood based methods It can be seen here that when the level 2 variance is much smaller than the level 1 variance, the gamma and SI 2 priors do worse than the IGLS and RIGLS methods Table 410 shows the corresponding interval widths which are very similar for all methods Table 411 considers the coverage for the level 2 variance parameter, 2 u using the eight dierent study designs It can be seen that there is far greater discrepancy between the maximum likelihood methods and the MCMC methods for this parameter but this is to some extent due to the normality assumption RIGLS and IGLS give coverage probabilities that are much smaller than the actual coverage should be The Pareto prior does better than the other priors when the study size is small but there is very little to choose between the priors as the size gets larger All methods give coverage probabilities that are smaller than they should be Table 412 shows that the Pareto prior has average interval widths that are four times larger than the other priors for studies 1 and 2 Thesizeoftheaverage 71

89 Table 47: Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the xed eect parameter using dierent methods and dierent studies True values for the variance parameters are 10 and 40 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Study IGLS RIGLS Gamma SI 2 Pareto Number ( ) prior prior (1 )prior 1 817/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /957 Table 48: Average 90%/95% interval widths for the xed eect parameter using dierent studies True values for the variance parameters are 10 and 40 Study IGLS RIGLS Gamma SI 2 Pareto Number ( ) prior prior (1 )prior 1 415/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /208 Table 49: Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the xed eect parameter using dierent methods and dierent true values All runs use study design 7 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Level 2/1 IGLS RIGLS Gamma SI 2 Pareto Variances ( ) prior prior (1 ) prior 1/10 884/ / / / /939 1/40 879/ / / / /953 1/80 883/ / / / /961 10/10 890/ / / / /951 10/40 887/ / / / /942 10/80 885/ / / / /940 40/40 890/ / / / /951 40/80 888/ / / / /947 72

90 Table 410: Average 90%/95% interval widths for the xed eect parameter using dierent true parameter values All runs use study design 7 Level 2/1 IGLS RIGLS Gamma SI 2 Pareto Variances ( ) prior prior (1 ) prior 1/10 060/ / / / /076 1/40 086/ / / / /111 1/80 112/ / / / /146 10/10 153/ / / / /196 10/40 167/ / / / /210 10/80 182/ / / / /230 40/40 306/ / / / /391 40/80 316/ / / / /403 Table 411: Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the level 2 variance parameter using dierent methods and dierent studies True values of the variance parameters are 10 and 40 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Study IGLS RIGLS Gamma SI 2 Pareto Number ( ) prior prior (1 )prior 1 683/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /932 Table 412: Average 90%/95% interval widths for the level 2 variance parameter using dierent studies True values of the variance parameters are 10 and 40 Study IGLS RIGLS Gamma SI 2 Pareto Number ( ) prior prior (1 ) prior / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /

91 Table 413: Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the level 2 variance parameter using dierent methods and dierent true values All runs use study design 7 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Level 2/1 IGLS RIGLS Gamma SI 2 Pareto Variances ( ) prior prior (1 ) prior 1/10 861/ / / / /930 1/40 840/ / / / /955 1/80 777/ / / / /959 10/10 862/ / / / /928 10/40 874/ / / / /930 10/80 863/ / / / /931 40/40 862/ / / / /928 40/80 869/ / / / /929 Table 414: Average 90%/95% interval widths for the level 2 variance parameter using dierent true parameter values All runs use study design 7 Level 2/1 IGLS RIGLS Gamma SI 2 Pareto Variances ( ) prior prior (1 ) prior 1/10 107/ / / / /153 1/40 200/ / / / /299 1/80 293/ / / / /471 10/10 706/ / / / /997 10/40 833/ / / / / /80 991/ / / / / / / / / / / / / / / / /4245 Table 415: Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the level 1 variance parameter using dierent methods and dierent studies True values of the variance parameters are 10 and 40 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Study IGLS RIGLS Gamma SI 2 Pareto Number ( ) prior prior (1 )prior 1 878/ / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /943 74

92 Table 416: Average 90%/95% interval widths for the level 1 variance parameter using dierent studies True values of the variance parameters are 10 and 40 Study IGLS RIGLS Gamma SI 2 Pareto Number ( ) prior prior (1 ) prior / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /776 Table 417: Comparison of actual coverage percentage values for nominal 90% and 95% intervals for the level 1 variance parameter using dierent methods and dierent true values All runs use study design 7 Approximate MCSEs are 028%/015% for 90%/95% coverage estimates Level 2/1 IGLS RIGLS Gamma SI 2 Pareto Variances ( ) prior prior (1 ) prior 1/10 891/ / / / /951 1/40 894/ / / / /956 1/80 896/ / / / /955 10/10 886/ / / / /957 10/40 890/ / / / /957 10/80 891/ / / / /958 40/40 886/ / / / /957 40/80 889/ / / / /957 Table 418: Average 90%/95% interval widths for the level 1 variance parameter using dierent true parameter values All runs use study design 7 Level 2/1 IGLS RIGLS Gamma SI 2 Pareto Variances ( ) prior prior (1 ) prior 1/10 163/ / / / /194 1/40 648/ / / / /771 1/ / / / / / /10 163/ / / / /194 10/40 651/ / / / /777 10/ / / / / / /40 651/ / / / /776 40/ / / / / /

93 intervals for the maximum likelihood methods may be articially small due to the assumption of normality When the values of the variance parameters are modied and the study design is design 7, it can be seen (Table 413) that the Pareto prior still has the best coverage intervals The other MCMC methods do better than the maximum likelihood based methods except when the true value of u 2 is much smaller than the true value of e, 2 when the situation is reversed This point is emphasised in Table 414 where it can be seen that the intervals for the gamma and SI 2 are narrower than for the other methods when u 2 =1 and e 2 = 40 or 80 The Pareto prior intervals are slightly wider which explains why they cover better When comparing the coverage for the level 1 variance, e 2 in Table 415 it is easy to see that there is little to choose between the methods When the study is small the MCMC methods do slightly better than the maximum likelihood methods and overall the Pareto prior generally gives the best coverage This parameter has much more accurate coverage intervals than the other parameters In Table 416 it can be seen that the MCMC intervals are wider when there are only 108 pupils but as the size of the study gets bigger the intervals become virtually identical The same behaviour would probably be seen with the level 2 variance if the number of schools was made a lot larger Table 417 shows the coverage probabilities when we only consider study design 7 Here there is nothing to choose between any of the methods and all the coverage probabilities are very good Table 418 conrms this by showing that the intervals from all the models are virtually identical 435 Improving maximum likelihood method interval estimates for 2 u From the above results it is clear that the Gaussian distribution is a bad approximation to use when calculating interval estimates for the level 2 variance parameter, u 2 I have shown in the Gibbs sampling algorithm that the true conditional distribution for u 2 is an inverse gamma distribution and I will now try and use this fact to construct condence intervals for u 2 based on the inverse gamma distribution 76

94 For each of the 1,000 simulations generated, RIGLS gives an estimate of 2 u, ^ 2 u = and a variance for ^ 2 u var(^ 2 u)= Now if I use the assumption that 2 u has an inverse gamma ( ) distribution, then the mean and variance of 2 u can be used to calculate the appropriate distribution as follows : = ; 1! = ( ; 1) = = 2 ( ; 1) 2 ( ; 2) 2 ( ; 1) 2 ( ; 1) 2 ( ; 2)! = 2 +2and = 3 + : Now having found the required inverse gamma distribution, to construct a x% condence interval the quantiles from the distribution need to be found Instead of using an inverse gamma distribution directly, the equivalent gamma distribution is used and the points are inverted : x% CI = 1 1 Gam 1;x=2 ( ) Gam x=2 ( 2 1 A : ) The results obtained using the inverse gamma approach for the RIGLS method can be seen in Table 419 In comparing these results with the results obtained using a Gaussian interval for the level 2 variance (Tables ) two points emerge Firstly the inverse gamma method generally gives worse coverage at the 90% level than the Gaussian method but better coverage at the 95% level This is probably due to the skewed form of the condence interval The exception to this rule is when the level 2 variance is small when the inverse gamma method does much worse than the Gaussian method This may now explain why the MCMC methods do badly in these cases, as the Gaussian interval for RIGLS was an unfair comparison and we now see that the MCMC methods are performing 77

95 Table 419: Summary of results for the level 2 variance parameter, 2 u using the RIGLS method and inverse gamma intervals Study 90% 95% Level 2/1 90% 95% Number Variances Coverage Probabilities / / / / / / / / Average Interval Widths (90%/95%) / / / / / / / / better in terms of coverage than the maximum likelihood methods Secondly the inverse gamma intervals are on average much narrower than the Gaussian methods 436 Summary of results Although these results don't look great for the Gibbs sampling methods in terms of bias, particularly for the smaller datasets, what the reader must realise is that even the largest JSP dataset studied with 48 schools is a small dataset in multilevel modelling terms If I were to consider larger datasets there will be less bias and better agreement between the methods over coverage of intervals Ihave also 78

96 only considered using the chain mean as a parameter estimate As an alternative I could have considered the median or mode of the parameter which, for the variances will give smaller estimates and hence less bias Overall the MCMC methods appear to have an edge in terms of coverage probabilities, particularly when the inverse gamma intervals are used for the IGLS/RIGLS methods The main danger is that people who are currently using maximum likelihood methods may decide to only use the MCMC methods on the smaller datasets as in computational terms the MCMC methods take much longer than the other methods The other danger is that people who are unfamiliar with multi-level modelling will not realise that due to the structure of the problems studied, it is not only the number of level 1 units that is important but also the number of level 2 units, particularly when estimating the level 2 variance Models with 6 or even 12 level 2 units would probably not be considered large enough for multi-level modelling To compare the multivariate priors described in the earlier section I will now consider a second model 44 Random slopes regression model In order to compare the various priors for a variance matrix, I will now need to consider a more complicated model One of the simplest multi-level models that includes a variance matrix is the random slopes regression model introduced in the last chapter The random slopes regression model can be written mathematically as : u j = y ij = X ij + u 0j + u 1j X ij + e ij u 0j u 1j 1 A MV N(0 Vu ) e ij N(0 2 e) where i =1 ::: n j j =1 ::: J and P j n j = N I will re-express the rst line of this model as follows : y ij = 0 X 0ij + 1 X 1ij + u 0j X 0ij + u 1j X ij + e ij 79

97 where X 0ij is constant andx 1ij = X ij in the previous notation Iwillnow use X ij to mean the vector (X 0ij X 1ij ) as this change will help simplify the expressions in the algorithms that follow 441 Gibbs sampling algorithm As with the variance components model the parameters can be split into four groups, the xed eects,, the level 2 residuals, u j, the level 2 variance matrix V u and the level 1 variance e 2 The conditional posterior distributions can be found using similar methodology to that used for the variance components model so I will just outline the posteriors without explaining how to obtain them Prior distributions I will assume a uniform prior for the xed eect parameters 0 and 1 The level 1 variance e 2 will take various univariate priors in the simulations In the algorithm I will use a general scaled inverse 2 prior with parameter e and s 2 e The level 2 variance matrix will similarly take various multivariate priors and so I will assume a general Wishart prior with parameters p and S p for the precision matrix at level 2 All the priors in the earlier section can then be obtained from particular values for these parameters The algorithm is then as follows : Step 1 p( j y V u 2 e u) Let N( ^ ^D), then to nd ^ and ^D : Y p( j y V u e u) 2 / ( 1 1 ) 1 i j e 2 2 exp[; 2e 2 P / exp[; i j X T ijx ij 2 2 e (y ij ; X ij u j ; X ij ) 2 ] e X i j X T ij(y ij ; X ij u j ) + const]: Comparing this with the form of a multivariate normal distribution and matching powers of gives ^D = 2 e[ X i j X T ijx ij ] ;1 80

98 and ^ = [ X i j X T ijx ij ] ;1 X i j X T ij(y ij ; X ij u j ) Step 2 p(u j j y V u 2 e ) = ^D 2 e X i j X T ij(y ij ; X ij u j ): Let u j N(û j ^D j ), then to nd û j and ^D j : p(u j j y V u 2 e ) / n Y j i=1( 1 2 e 1 ) 1 2 exp[; (y 2e 2 ij ; X ij ; X ij u j ) 2 ] jv u j ; exp[; 2 ut j V u ;1 u j ] / exp(; 1 n X j 2 [ +const): i=1 X T ijx ij 2 e + V ;1 u ]u 2 j e n X j i=1 X T ij(y ij ; X ij )u j Comparing this with the form of a multivariate normal distribution and matching powers of u j gives and Step 3 p(v u j y u 2 e) P nj i=1 X ^D T j =[ ijx ij e 2 û j = ^D j 2 e n X j i=1 + V u ;1 ] ;1 X T ij(y ij ; X ij ): Consider instead p(v u ;1 j y u e)andletv 2 u ;1 Wishart( u S u ) Then letting p(v u ;1 ) Wishart( p S p )gives 81

99 p(v ;1 u j y u 2 e) / JY j=1 jv u j ; 1 2 exp(; 1 2 ut j V ;1 u u j )p(v ;1 u ) / jv u j ; J 2 exp(; 1 2 JX j=1 u T j V ;1 u u j ) jv u ;1 j (p;3)=2 exp(; 1 2 tr(s;1 p V u ;1 )) / jv u j (J+p;3)=2 exp(; 1 JX 2 tr(( u T j u j + S p ;1 )V u ;1 )): Then comparing this with the form of a Wishart distribution produces u = J + p and S u =( JX j=1 j=1 u T j u j + S ;1 p ) ;1 : The uniform prior on V u is equivalent to p = ;3 S p =0 Step 4 p( 2 e j y u V u ) Consider instead p(1= 2 e j y u V u ) and let 1= 2 e gamma(a e b e ) Then p(1= 2 e)=(1= 2 e) ;2 p( 2 e)andso Y p(1=e 2 j y u V u ) / ( 1 1 ) 1 i j e 2 2 exp[; (y 2e 2 ij ; X ij ; X ij u j ) 2 ]( 1 ) ;2 p( e) 2 e 2 / ( 1 ) ( N e e 2 ;1) exp[; 1 X ( (y 2e 2 ij ; X ij ; X ij u j ) 2 + e s 2 e)]: i j Then comparing this with the form of a gamma distribution produces a e = N + e 2 and b e = 1 2 ( es 2 e + X i j e 2 ij): A uniform prior on e, 2 or the equivalent Pareto prior is equivalent to e = ;2 s 2 e = 0 A uniform prior on log e 2 is equivalent to e = 0 s 2 e = 0 and a gamma( ) prior for 1=e 2 is equivalent to e =2 s 2 e =1 Having found the four sets of conditional distributions, it is now simple 82

100 enough to program up the algorithm and compare via simulation the various prior distributions 442 Simulation method Following on from the variance components model simulation I now want to extend the comparisons by considering the random slopes regression model The random slopes regression model has more parameters than the variance components model, so I could quite easily study even more designs than the 15 studied using the variance components model Due to time constraints and to avoid too much repetition, only two main areas of interest will be considered I will rstly again consider the study design of the model and consider two sizes of study design and whether the design is balanced or unbalanced These will correspond to designs 3,4,7 and 8inTable 41 The second area of interest is due to the new model design When looking at the variance components model I considered the eect of varying the true values of the two variance parameters Now that there is a variance matrix at level 2 I will consider the eect of varying the correlation between the two parameters at level 2 I will consider ve dierent scenarios, rstly when the two variables are uncorrelated which I will consider using all four study designs The other four scenarios will have both large and small correlations that are positive and then negative These correlations will only be considered using study design 7, which is similar to the actual JSP dataset This will give a total of 9 designs The true values for all the parameters apart from the level 2 co-variance term will be the same for each design as follows : 0 = 30:0 1 = 0:5 u00 = 5:0 u11 = 0:5 and e 2 = 30:0 The level 2 covariance, u01 will be set to a value c to give the required correlation For each set of parameter values I will generate 1000 simulated datasets and t the random slopes regression model using each method to each dataset Creating the simulation datasets Creating the simulation datasets is also easy for the random slopes regression model The only data that needs to be generated are the values of the response variable for the N pupils The second variable X ij will be xed throughout the 83

101 dataset Considering the case of 864 pupils within 48 schools the procedure is as follows : 1 Generate 48 u 0j s and u 1j s, one for each school, by drawing from a multivariate normal distribution with mean 0 and variance matrix u 2 Generate 864 e ij s, one for eachpupil,by drawing from a normal distribution with mean 0 and variance e 2 3 Evaluate Y ij = X ij + u 0j + u 1j X ij + e ij for all 864 pupils This will generate one simulation dataset for the current parameter values This dataset is then tted using each method, and the whole procedure is repeated 1000 times The datasets will be generated using a short C program Comparison of methods The priors to be considered for the level 2 variance are the uniform prior on the u scale, and the two Wishart priors for the level 2 precision described in the earlier section The uniform prior method for u will be run using MLwiN and the other two priors using the BUGS package The Maximum likelihood IGLS and RIGLS methods will also be run using MLwiN Again the main reason for running the bulk of the Gibbs sampling runs using BUGS is computing resources BUGS however cannot t the uniform prior as it is improper and so this will be carried out using MLwiN The estimates from RIGLS will be used as prior estimates for the `data driven' Wishart prior The lengths of the `burnin' and main runs of these simulations will be the same as for the equivalent study designs tting the variance components model The bias and coverage probabilities will be worked out in the same way as for the variance components model The level 1 variance is not of great interest here, andsoihave used the gamma( ) prior when using BUGS and the uniform prior when using MLwiN as these are the defaults Preliminary analysis for IGLS and RIGLS With the added complexity of the random slopes regression model, the IGLS and RIGLS methods occasionally, for certain datasets, have problems tting the model These problems can be of two types Firstly the method may at some iteration generate an estimate for a variance matrix that is not positive denite 84

102 Secondly the method maynotconverge before the maximum number of iterations have been reached Generally what is happening in the second case is that the method is cycling between several estimates and so increasing the maximum number of iterations will not help (see Figure 4-3) Figure 4-3: Trajectories plot of IGLS estimates for run of random slopes regression model where convergence is not achieved The MLn command CONV (Rasbash and Woodhouse 1995) will return whether the estimation procedure has converged, and if not which of the above two reasons is the problem As the maximum likelihood methods are fast to run, Ihave run the random slopes regression model using several simulation studies to identify how well the IGLS and RIGLS methods perform in dierent scenarios The results are given in Table 420 The studies marked with a star will be used in the main analysis Table 420 shows that several factors inuence how well the maximum likelihood methods perform Firstly it should be noted that as study size gets bigger and consequently the number of level 2 units increases the number of datasets that the maximum likelihood methods fail on is minimal The number of datasets where the methods have problems increases when the size of study is decreased, and dramatically increases when the design is unbalanced Also the correlation between the 2 variables at level 2 is important, if the two variables are highly correlated, either 85

103 Table 420: Summary of the convergence for the random slopes regression with the maximum likelihood based methods (IGLS/RIGLS) The study design is given in terms of the number of level 2 units and whether the study is balanced (B) or unbalanced (U) Study u01 Con NCon Not Posdev 3 (12U) {14 623/ /349 21/77 3 (12U) {05 902/857 93/124 5/19 * 3 (12U) /877 71/116 2/7 3 (12U) /871 91/118 3/11 3 (12U) / /366 12/76 4 (12B) {14 914/903 83/74 3/23 4 (12B) {05 986/985 13/14 1/1 * 4 (12B) /990 9/9 0/1 4 (12B) /991 6/7 0/2 4 (12B) /903 85/72 3/25 * 7 (48U) {14 986/984 13/14 1/2 * 7 (48U) {05 998/998 2/2 0/0 * 7 (48U) /1000 0/0 0/0 * 7 (48U) /1000 0/0 0/0 * 7 (48U) /983 16/15 0/2 8 (48B) {14 994/992 6/6 0/2 8 (48B) {05 999/999 1/1 0/0 * 8 (48B) /1000 0/0 0/0 8 (48B) /1000 0/0 0/0 8 (48B) /992 8/8 0/0 positively or negatively, the number of problem datasets increases Most of the studies chosen for further investigation with the Gibbs sampling methods do not have many problem datasets The study 3 scenario with u01 =0 is the worst with only 877 good datasets For the further analysis I will simply discard any problem datasets and analyse the remaining datasets using all the methods One problem that is not captured by the MLn CONV command is when the nal converged estimate is not positive denite These situations will be included in the converged category in the above table This has a knock-on eect when I consider using the RIGLS estimate as a parameter in the Wishart prior distribution Consequently if the level 2 variance matrix estimate has a 86

104 correlation outside [{1,1], I will reduce the estimate of the co-variance, ^ 01 so that the correlation becomes 0:95 before using it as a prior parameter 443 Results The results for the 8 simulation designs comparing the 2 maximum likelihood methods and the three MCMC methods can be seen in Tables 421 to 428 The columns labelled Wish 1 prior are the results for the Wishart (I 2) prior for the precision matrix, ;1 u The columns labelled Wish 2 prior are the results for the Wishart (^ u 4) prior for the precision matrix, ;1 u The results for the unbalanced design with 48 schools and uncorrelated parameters at level 2 are in Table 421 From this table it can be seen that the results for the two maximum likelihood methods and the uniform prior method are similar to the results already seen for the variance components model The IGLS method tends to underestimate the variance parameters at level 2 and the RIGLS method corrects for this giving the least biased estimates The uniform prior on the other hand tends to overestimate the level 2 variance parameters In terms of coverage probabilities there is little to choose between the RIGLS method and the uniform prior This is probably partly due to the uniform prior method giving larger intervals The two other MCMC methods give interesting results The rst Wishart prior method which has a parameter with value the identity matrix, uses this parameter as a prior guess for the level 2 variance matrix This is clearly shown by the estimate of u00 which has a true value of 5 (greater than 1) being an underestimate, and the estimate of u11 which hasatruevalue of 05 (less than 1) being an overestimate This in turn aects the coverage intervals, as the estimates for u00 give worse coverage than RIGLS while the estimates for u11 give better coverage The second Wishart prior method, based on using the RIGLS estimate of u as a parameter in the prior appears to underestimate all the parameters in the variance matrix This in turn leads to smaller average interval widths and in this case worse coverage than the RIGLS method for virtually all parameters Tables 422 to 425 contain the results when the level 2 covariance parameter, u01 is given dierent true values that give matrices with high and low, positive 87

105 Table 421: Summary of results for the random slopes regression with the 48 schools unbalanced design with parameter values, u00 = 5 u01 = 0 and u11 =0:5 All 1000 runs Param IGLS RIGLS Wish 1 Wish 2 Uniform (True) prior prior prior Relative % Bias in estimates (Monte Carlo SE) Except values in [ ] which are actual biases due to the true value being 0 0(30:0) 003 (004) 003 (004) {001 (004) 000 (004) 003 (004) 1(0:5) 064 (070) 064 (070) 251 (070) 110 (070) 074 (070) u00(5:0) {288 (092) 024 (094) {300 (098) {764 (097) 2242 (108) u01(0:0) [{001 (001)] [{001 (001)] [{001 (001)] [{001 (001)] [{002 (001)] u11(0:5) {346 (072) {108 (074) 512 (075) {339 (074) 1576 (085) e(30:0) (016) 003 (016) 068 (016) 090 (017) 053 (016) Coverage Probabilities (90%/95%) : Approximate MCSE (028%/015%) 0 897/ / / / / / / / / /953 u00 872/ / / / /932 u01 901/ / / / /957 u11 862/ / / / /932 2 e 894/ / / / /946 Average Interval Widths (90%/95%) / / / / / / / / / /0456 u / / / / /7484 u / / / / /1491 u / / / / / e 5027/ / / / /

106 and negative correlation The parameter percentage biases are plotted in Figures 4-4 and 4-5 against the value of u01 The immediate thing to notice is that changes to the covariance parameter value have, for most methods and most parameters, little overall eect in terms of bias, and the results in Tables 422 to 425 are similar to those obtained in Table 421 The IGLS and RIGLS methods give similar results as before with the RIGLS method giving approximately unbiased estimates This shows that removing the datasets that did not converge does not appear to have had any noticeable eect on the bias of the estimates The uniform prior method gives approximately the same percentage bias for the variance parameters, and the covariance estimates appear (Figure 4-5 (ii)) to have percentage biases that are positively correlated with u01 The rst Wishart prior method does not exhibit the shrinkage towards 1 property for parameter u00 when the correlation is large (in magnitude) It is also noticeable (Figure 4-4 (i)) that the bias of the parameter 0 using this method is proportional to u01 The second Wishart prior method still underestimates the variance parameters at level 2, although the bias is reduced as the correlation is increased (in magnitude) but is approximately unbiased for the covariance term It also gives the largest bias for the level 1 variance parameter for all values of u01 Considering the coverage properties of the ve methods we nd dierences from the variance components model The MCMC methods now no longer give the best results for all parameters In particular the second Wishart prior method has the smallest intervals for most parameters and consequently performs poorly in terms of coverage The rst Wishart prior performs well and gives reasonable coverage for all parameters Although the uniform prior gives estimates that are highly biased it should not be disregarded as it has the best coverage properties for the level 2 variance parameters It does not perform as well when the correlation is increased (in magnitude) The RIGLS method also gives reasonable coverage for most parameters and performs better than for the variance components model When the study design is changed (Tables 426 to 428) the eects are similar to those observed for the variance components model From Figures 4-6 and 4-7 it can be seen that reducing the number of schools in the study increases the 89

107 Table 422: Summary of results for the random slopes regression with the 48 schools unbalanced design with parameter values, u00 = 5 u01 = 1:4 and u11 =0:5 Only 982 runs Param IGLS RIGLS Wish 1 Wish 2 Uniform (True) prior prior prior Relative % Bias in estimates (Monte Carlo SE) 0(30:0) 002 (004) 003 (004) 007 (004) 003 (004) 003 (004) 1(0:5) 068 (069) 067 (069) 222 (069) 099 (069) 076 (070) u00(5:0) {272 (092) 040 (092) 164 (092) {643 (090) 2286 (104) u01(1:4) {207 (079) 007 (079) {531 (081) {043 (080) 1464 (093) u11(0:5) {326 (071) {094 (071) 796 (072) {263 (071) 1576 (082) e(30:0) (016) 002 (016) {008 (016) 127 (016) 032 (016) Coverage Probabilities (90%/95%) : Approximate MCSE (029%/015%) 0 895/ / / / / / / / / /962 u00 866/ / / / /935 u01 876/ / / / /943 u11 870/ / / / /948 2 e 886/ / / / /957 Average Interval Widths (90%/95%) / / / / / / / / / /0455 u / / / / /7276 u / / / / /1758 u / / / / / e 4998/ / / / /

108 Table 423: Summary of results for the random slopes regression with the 48 schools unbalanced design with parameter values, u00 = 5 u01 = ;1:4 and u11 =0:5 Only 984 runs Param IGLS RIGLS Wish 1 Wish 2 Uniform (True) prior prior prior Relative % Bias in estimates (Monte Carlo SE) 0(30:0) 002 (004) 002 (004) -008 (004) -001 (004) 003 (004) 1(0:5) {001 (069) {001 (069) 227 (069) 011 (069) {001 (069) u00(5:0) {228 (092) 070 (094) 244 (094) {565 (090) 2390 (106) u01(;1:4) 143 (079) {079 (079) 468 (082) {023 (082) {1586 (093) u11(0:5) {174 (072) 068 (073) 897 (074) {123 (075) 1765 (084) e(30:0) (016) {001 (016) {010 (016) 117 (016) 029 (016) Coverage Probabilities (90%/95%) : Approximate MCSE (029%/015%) 0 895/ / / / / / / / / /961 u00 876/ / / / /929 u01 884/ / / / /953 u11 883/ / / / /936 2 e 892/ / / / /949 Average Interval Widths (90%/95%) / / / / / / / / / /0458 u / / / / /7407 u / / / / /1797 u / / / / / e 5005/ / / / /

109 Table 424: Summary of results for the random slopes regression with the 48 schools unbalanced design with parameter values, u00 = 5 u01 = 0:5 and u11 =0:5 All 1000 runs Param IGLS RIGLS Wish 1 Wish 2 Uniform (True) prior prior prior Relative % Bias in estimates (Monte Carlo SE) 0(30:0) 002 (004) 002 (004) 002 (004) 001 (004) 003 (004) 1(0:5) 072 (070) 072 (070) 244 (070) 108 (070) 083 (070) u00(5:0) {300 (092) 008 (094) {312 (097) {808 (096) 2210 (108) u01(0:5) {345 (189) {145 (193) {242 (192) {080 (193) 1243 (222) u11(0:5) {379 (072) {142 (073) 507 (074) {361 (073) 1535 (098) e(30:0) (016) 004 (016) 067 (016) 097 (017) 054 (016) Coverage Probabilities (90%/95%) : Approximate MCSE (028%/015%) 0 897/ / / / / / / / / /953 u00 873/ / / / /933 u01 900/ / / / /953 u11 863/ / / / /942 2 e 890/ / / / /946 Average Interval Widths (90%/95%) / / / / / / / / / /0456 u / / / / /7441 u / / / / /1517 u / / / / / e 5026/ / / / /

110 Table 425: Summary of results for the random slopes regression with the 48 schools unbalanced design with parameter values, u00 = 5 u01 = ;0:5 and u11 =0:5 Only 998 runs Param IGLS RIGLS Wish 1 Wish 2 Uniform (True) prior prior prior Relative % Bias in estimates (Monte Carlo SE) 0(30:0) 003 (004) 003 (004) {004 (004) {001 (004) 003 (004) 1(0:5) 049 (069) 049 (069) 252 (069) 106 (069) 057 (070) u00(5:0) {262 (092) 050 (094) {260 (097) {753 (097) 2278 (108) u01(;0:5) 027 (189) {200 (193) {072 (194) {196 (196) {1812 (222) u11(0:5) {300 (072) {061 (073) 568 (074) {293 (073) 1630 (098) e(30:0) (016) 001 (016) 065 (016) 093 (017) 050 (016) Coverage Probabilities (90%/95%) : Approximate MCSE (028%/015%) 0 897/ / / / / / / / / /955 u00 878/ / / / /934 u01 898/ / / / /949 u11 870/ / / / /933 2 e 896/ / / / /948 Average Interval Widths (90%/95%) / / / / / / / / / /0457 u / / / / /7506 u / / / / /1538 u / / / / / e 5025/ / / / /

111 (i) Estimate of % bias beta IGLS RIGLS Wishart Prior 1 Wishart Prior 2 Uniform Prior True Sigma_u01 (ii) Estimate of % bias beta True Sigma_u01 (iii) Estimate of % bias sigma^2_e True Sigma_u01 Figure 4-4: Plots of biases obtained for the various methods tting the random slopes regression model against value of u01 (Fixed eects parameters and level 1 variance parameter) 94

112 (i) Estimate of % bias Sigma_u IGLS RIGLS Wishart Prior 1 Wishart Prior 2 Uniform Prior True Sigma_u01 (ii) Estimate of % bias Sigma_u True Sigma_u01 (iii) Estimate of % bias Sigma_u True Sigma_u01 Figure 4-5: Plots of biases obtained for the various methods tting the random slopes regression model against value of u01 (Level 2variance parameters) 95

113 percentage bias of all the methods and in particular the uniform prior method It is also noticeable that e 2 is biased high for the two Wishart prior methods, which may explain in part why the level 2variance parameters are biased low The coverage properties are changed slightly when the number of schools is reduced to 12 The uniform prior method now hasvery poor coverage properties as its intervals are too wide and it gives far higher actual percentage coverage for nominal 90% and 95% intervals for the xed eects For the variance parameters the uniform prior gives highly biased estimates that lead to intervals that have lower actual percentage coverage than required The second Wishart prior method is once again performing poorly and the best coverage is either from the RIGLS or the rst Wishart prior method depending on the parameter 45 Conclusions Simulation studies comparing various maximum likelihood and empirical Bayes methods have been performed in the past (see for example Kreft, de Leeuw, and van der Leeden (1994)) There has however been very little comparison work between fully Bayesian MCMC methods and maximum likelihood methods This may be due to the fact that the MCMC methods take a lot longer to perform than the maximum likelihood methods For example the two sets of simulations in this chapter together took over 6 months to perform and this was using several machines simultaneously Although these simulations do highlight some interesting points it would be useful to run more simulations particularly to compare the priors for variance matrices The results obtained in these simulations will now be summarised and then I will talk about how this has inuenced the default prior settings in the MLwiN package 451 Simulation results All the simulations performed in this chapter have been compared in terms of both bias and coverage properties Two maximum likelihood methods have been included in the simulations for completeness but the IGLS method almost always performs worse than the RIGLS method and so can probably be disregarded 96

114 Table 426: Summary of results for the random slopes regression with the 48 schools balanced design with parameter values, u00 = 5 u01 = 0:0 and u11 =0:5 All 1000 runs Param IGLS RIGLS Wish 1 Wish 2 Uniform (True) prior prior prior Relative % Bias in estimates (Monte Carlo SE) Except values in [ ] which are actual biases due to the true value being 0 0(30:0) {005 (004) {005 (004) {008 (004) {008 (004) {005 (004) 1(0:5) {003 (068) {003 (068) 213 (068) 078 (068) 007 (068) u00(5:0) {466 (090) {174 (092) {491 (095) {932 (094) 1884 (104) u01(0:0) [000 (001)] [000 (001)] [000 (001)] [000 (001)] [000 (001)] u11(0:5) {159 (075) 080 (077) 702 (077) {130 (077) 1762 (088) e(30:0) 2 {005 (016) {005 (016) 059 (016) 081 (016) 046 (016) Coverage Probabilities (90%/95%) : Approximate MCSE (028%/015%) 0 879/ / / / / / / / / /956 u00 863/ / / / /944 u01 925/ / / / /961 u11 864/ / / / /920 2 e 890/ / / / /949 Average Interval Widths (90%/95%) / / / / / / / / / /0457 u / / / / /7091 u / / / / /1448 u / / / / / e 5027/ / / / /

115 Table 427: Summary of results for the random slopes regression with the 12 schools unbalanced design with parameter values, u00 = 5 u01 = 0:0 and u11 =0:5 Only 877 runs Param IGLS RIGLS Wish 1 Wish 2 Uniform (True) prior prior prior Relative % Bias in estimates (Monte Carlo SE) Except values in [ ] which are actual biases due to the true value being 0 0(30:0) {005 (015) {006 (015) 014 (009) 010 (009) 006 (009) 1(0:5) {045 (150) {064 (149) 148 (148) {016 (148) 006 (150) u00(5:0) {478 (222) 970 (222) {415 (229) {1234 (211) (510) u01(0:0) [001 (002)] [001 (002)] [{001 (002)] [000 (002)] [{004 (005)] u11(0:5) {919 (157) 065 (171) 3476 (177) {526 (171) (374) e(30:0) 2 {056 (040) {082 (041) 212 (036) 267 (036) 162 (035) Coverage Probabilities (90%/95%) : Approximate MCSE (029%/015%) 0 835/ / / / / / / / / /979 u00 800/ / / / /818 u01 877/ / / / /987 u11 769/ / / / /855 2 e 891/ / / / /951 Average Interval Widths (90%/95%) / / / / / / / / / /1290 u / / / / /5049 u / / / / /9696 u / / / / / e 9955/ / / / /

116 Table 428: Summary of results for the random slopes regression with the 12 schools balanced design with parameter values, u00 = 5 u01 = 0:0 and u11 =0:5 Only 990 runs Param IGLS RIGLS Wish 1 Wish 2 Uniform (True) prior prior prior Relative % Bias in estimates (Monte Carlo SE) Except values in [ ] which are actual biases due to the true value being 0 0(30:0) 004 (008) 004 (008) 013 (008) 006 (008) 004 (008) 1(0:5) 048 (138) 049 (138) 173 (138) 114 (139) 096 (139) u00(5:0) {1280 (180) {032 (196) {1142 (196) {1973 (182) (442) u01(0:0) [001 (002)] 001 (002)] [{001 (002)] [{000 (002)] [{003 (004)] u11(0:5) {905 (141) 046 (154) 3395 (160) {453 (154) (340) e(30:0) (032) 005 (032) 249 (033) 290 (033) 201 (032) Coverage Probabilities (90%/95%) : Approximate MCSE (030%/016%) 0 858/ / / / / / / / / /983 u00 770/ / / / /873 u01 911/ / / / /983 u11 791/ / / / /857 2 e 902/ / / / /951 Average Interval Widths (90%/95%) / / / / / / / / / /1279 u / / / / /4538 u / / / / /8994 u / / / / / e 1004/ / / / /

117 (i) Estimate of % bias beta IGLS RIGLS Wishart Prior 1 Wishart Prior 2 Uniform Prior 12U 12B 48U 48B Simulation Design (ii) Estimate of % bias beta U 12B 48U 48B Simulation Design (iii) Estimate of % bias sigma^2_e U 12B 48U 48B Simulation Design Figure 4-6: Plots of biases obtained for the various methods tting the random slopes regression model against study design (Fixed eects parameters and level 1 variance parameter) 100

118 (i) Estimate of % bias Sigma_u IGLS RIGLS Wishart Prior 1 Wishart Prior 2 Uniform Prior 12U 12B 48U 48B Simulation Design (ii) Estimate of bias Sigma_u U 12B 48U 48B Simulation Design (iii) Estimate of % bias Sigma_u U 12B 48U 48B Simulation Design Figure 4-7: Plots of biases obtained for the various methods tting the random slopes regression model against study design (Level 2 variance parameters) 101

119 (It is included in the package MLwiN as there exist situations where the IGLS method converges when the RIGLS method doesn't) Although the RIGLS method performs well in terms of bias, it is not designed for interval estimation and the additional (sometimes false) assumption that the parameter of interest has a Gaussian distribution has been used to generate interval estimates The MCMC methods have been compared to see if they will improve ontherigls method in terms of coverage The main diculty with the MCMC methods is choosing default priors for the variance parameters In the univariate case I have compared three possible prior distributions using the variance components model All three priors give variance estimates that are positively biased but this bias decreases as N the number of units associated with the variance decreases Of the three priors, the Pareto prior for the precision parameter which is a proper prior equivalent to a uniform prior for the variance, has far larger bias but in turn often has the best coverage properties The gamma( ) prior for the precision parameter has far less bias and also improves over the maximum likelihood methods in terms of coverage so would be preferable except that it does not easily generalise to a multivariate distribution The nal prior which uses a prior estimate, taken from the gamma prior estimate gives approximately the same answers as the gamma prior When variance matrices are considered, as in the random slopes regression model, multivariate priors are required The uniform prior easily translates to a multivariate uniform prior but unfortunately the gamma prior does not A candidate multivariate Wishart prior (Wish1) for the precision matrix was used to replace the gamma prior A third alternative prior (Wish2) based on a prior estimate for the variance matrix, this time from RIGLS, was also considered This third prior performed poorly and tended to underestimate the variance matrix and generally gave worse coverage than the maximum likelihood methods The Wish1 prior tended to shrink the variance estimates towards the identity matrix but generally was less biased than the other two priors The uniform prior once again was highly positively biased In terms of coverage properties the uniform and Wish1 prior both performed as well overall as the RIGLS method but no better So in conclusion, in some situations the RIGLS maximum likelihood method, 102

120 which is far faster to run, improves on MCMC and in other situations it is MCMC that has better performance Both the uniform and gamma priors and their multivariate equivalents have good points and bad points but overall the gamma prior appears to be slightly better In Chapter 6, I will consider a similar simulation study using a multi-level logistic regression model Here the approximation based methods do not perform as well as noted by Rodriguez and Goldman (1995), and it will be shown that the MCMC methods make an improvement with these models 452 Priors in MLwiN The rst release of the MLwiN package occurred while these simulations were still being performed In this version I included the uniform prior on the 2 scale for all variance parameters as a default, mainly because it was simple and easiest to extend to the multivariate case The user is also given the option to include informative priors for variance parameters For an informative prior the user must input a prior estimate for the variance or variance matrix and a sample size on which this prior estimate is based If the user gives a prior sample size of 1 then the Wishart prior produced is identical to the second Wishart prior used for the random slopes regression simulations in this Chapter In future releases the gamma( ) prior for the precision and the Wish1 prior may be added as an alternative following their performance in these simulations In this chapter I have introduced two simple multi-level models and shown how they can be tted using the Gibbs sampler I will generalise this to include the whole family of Gaussian models in the next chapter, along with showing how to apply the other MCMC methods to multi-level models 103

121 Chapter 5 Gaussian Models 2 - General Models In the previous chapter I introduced two simple two level models and showed how to use one MCMC method, Gibbs sampling to t them In this chapter I will extend this work in two directions Firstly I will give a general description of an N level Gaussian multi-level model and show how to t this model using Gibbs sampling Secondly I will show how to use other MCMC methods with Gibbs sampling via a hybrid approach totn level Gaussian models I will give two alternative Metropolis-Gibbs hybrid sampling algorithms and explain how these methods can be extended into adaptive samplers I will compare through a simple example how well the methods perform in terms of their times to produce estimates with a desired accuracy 51 General N level Gaussian hierarchical linear models In the eld of education I have already looked at a two level scenario with pupils within schools This structure could easily be extended in many directions by the addition of extra levels to the model The schools could be divided into dierent education authorities giving another higher level Pupils in each school could be dened by their class allowing a level between pupils and schools Below the pupil level, each student could sit tests over a period of several years and so each 104

122 test could be a lower level unit It is quite easy to see how a2level model can be extended to a 5 level model in an educational setting and it is conceivable that in other application areas there could be even more levels In the general framework, predictor variables can be dened as xed eects or random eects at any level in this model For example, predictors such assex, parental background and ethnic origin are pupil level variables, whilst class size and teacher variables are class level variables and school size and type are school level variables The multi-level structure of these models produces similarities between the conditional distributions for predictor variables at dierent levels, and it will be shown later that only four parameter updating steps are needed for a general N level Gaussian model One of the main diculties with extending the algorithm to N levels is notational In the two level model we have pupil i in school j, and this cannot be extended indenitely I will rstly look at the work on hierarchical models in the paper by Seltzer, Wong, and Bryk (1996) and show how their algorithms can be modied to t a general 3 level multi-level model before extending this work to N levels 52 Gibbs sampling approach Seltzer, Wong, and Bryk (1996) considered hierarchical models of 2 levels with xed eects They found the conditional posterior distributions for all the parameters so that a Gibbs sampling algorithm could be easily implemented They also included specications of prior distributions for all variance parameters and incorporated these prior distributions into their posterior distributions They stated that it was easy to extend the algorithm to hierarchical models with 3 or more levels but did not state how I wish to follow on from their work but to consider a wider family of distributions namely the N level Gaussian multi-level models Their formulation for a 2level hierarchical model is as follows : y ij =X ij j +X ij + e ij j =W j +U j : 105

123 where e ij N(0 2 ) and U j MV N(0 T): This is a general 2 level hierarchical model with xed eects To translate this into a 2 level multi-level model I will re-parameterise as follows : y ij = X ij (W j +U j )+X ij + e ij where Z ij = = X ij U j +X ij W j +X ij + e ij = X ij U j +Z ij + e ij X ijw j 0 0 X ij with e ij N(0 2 ) U j MVN(0 T): A and A In this formulation estimates for the variance parameters, 2 and T as well as the xed eects can still be found We can also nd estimates for the lowest level random variables (in the above notation )which are really also xed eects Due to re-parameterisation the level 2 residuals, the U j are estimated as opposed to the j which were random parameters It is easy to calculate the j from the U j and vice versa Re-parameterising the model in this way does not have any particular advantages in terms of convergence, in fact (Gelfand, Sahu, and Carlin 1995) with some models that t this framework, re-parameterising in the way described will give worse mixing properties for the Markov chain The main reason for reparameterising into this format is that we are now working on a far larger family of distributions This is because there exist models that t this framework but cannot be described in the previous format The next step is to construct conditional posterior distributions for this new model structure I will consider a general 3 level model as opposed to the 2 level model that Seltzer, Wong, and Bryk (1996) consider, as this easily generalises to N levels The three level model will be dened as follows : y ijk =X 1ijk 1 +X 2ijk 2jk +X 3ijk 3k + e ijk 106

124 e ijk N(0 2 ) 2jk MV N(0 V 2 ) 3k MVN(0 V 3 ): with 1 as xed eects, 2 the level 2 residuals and 3 the level 3 residuals There are now 6 sets of unknowns to consider I will consider these in turn and will assume that the variance parameters have general scaled inverse 2 and inverse Wishart priors, whilst the xed eects have uniform priors The steps required are then : Step 1 p( 1 j y V 2 V 3 ) 1 N( b 1 b D 1 ) p( 1 j y V 2 V 3 ) / p(y j V 2 V 3 )p( 1 ) Y / ( 1 ijk ) exp[; 2 (y 2 ijk ; X 1ijk 1 ; X 2ijk 2jk ; X 3ijk 3k ) 2 ] Y / ( 1 ijk ) exp[; 2 (d 2 1ijk ; X 1ijk 1 ) 2 ] where d 1ijk =y ijk ; X 2ijk 2jk ; X 3ijk 3k, giving bd 1 = 2 [ X ijk X T 1ijkX 1ijk ] ;1 and b X X 1 =[ X T 1ijkX 1ijk ] ;1 X T D b 1 X 1ijkd 1ijk = X T ijk ijk 1ijkd 1ijk : 2 This is the formula for a simple linear regression of d 1 against X 1 ijk Step 2 p( 2 j y V 2 V 3 ) 2jk N( b 2jk b D 2jk ) 107

125 p( 2jk j y V 2 V 3 ) / p(y j V 2 V 3 )p( 2jk j V 2 ) / ny jk i=1 [( 1 ) exp[; 2 (d 2 2ijk ; X 2ijk 2jk ) 2 ]]:jv 2 j ; exp[; where d 2ijk =y ijk ; X 1ijk 1 ; X 3ijk 3k, giving and n X jk bd 2jk =[ i=1 b 2jk = b D 2jk 2 Step 3 p( 3 j y V 2 V 3 ) X T 2ijkX 2ijk 2 +V ;1 2 ] ;1 n X jk i=1 X T 2ijkd 2ijk : 2 T 2jkV ;1 2 2jk ] 3k N( b 3k b D 3k ) p( 3k j y V 2 V 3 ) / p(y j V 2 V 3 )p( 3k j V 3 ) / Y ij [( 1 2 ) 1 2 exp[; (d 3ijk ; X 3ijk 3k ) 2 ]]:jv 3 j ; 1 2 exp[; 1 2 T 3kV ;1 3 3k ] where d 3ijk =y ijk ; X 1ijk 1 ; X 2ijk 2jk,giving bd 3k =[ X ij X T 3ijkX 3ijk 2 +V ;1 3 ] ;1 and b 3k = b D 3k 2 X ij X T 3ijkd 3ijk : 108

126 Step 4 p( 2 j y V 2 V 3 ) Assume 2 has a scaled inverse 2 prior, p( 2 ) SI 2 ( e s 2 e) Considering 1= 2, and using the change of variables formula with h( 2 )=1= 2 gives p( 1 2 )=p(2 ) j h 0 ( 2 ) j ;1 =( 1 4 );1 =( 1 2 );2 : Substituting will give p(1= 2 j y V 2 V 3 ) / ( 1 2 )N=2 exp[; X ijk (e ijk ) ]:( 1 2 );2 p( 2 ) so 1= 2 gamma(a b) where a = N + e b= (X e 2 ijk + e s 2 e): ijk A uniform prior on 2 is equivalent to setting e = ;2 and s 2 e =0 Step 5 p(v 2 j y V 3 ) Assume V 2 has an inverse Wishart prior, V 2 IW( p2 S p2 ) then : which gives p(v ;1 2 j y V 3 ) / p( 2 j V 2 )p(v ;1 2 ) V ;1 2 Wishart n2 [S 2 =( X jk 2jk T 2jk + S p2 ) ;1 2 = n jk + p2 ]: Here S 2 is a n 2 n 2 scale matrix where n 2 is the number of random variables at level 2 and 2 is the degrees of freedom of the Wishart distribution, and n jk is the number of level 2 units A uniform prior is equivalent to setting p2 = ;n 2 ; 1, and S p2 =0 Step 6 p(v 3 j y V 2 ) Assume V 3 has an inverse Wishart prior, V 3 IW( p3 S p3 ) then : 109

127 p(v ;1 3 j y V 2 ) / p( 3 j V 3 )p(v ;1 3 ) which gives V ;1 3 Wishart n3 [S 3 =( X k 3k T 3k + S p3 ) ;1 3 = n k + p3 ]: Here S 3 is a n 3 n 3 scale matrix where n 3 is the number of random variables at level 3 and 3 is the degrees of freedom of the Wishart distribution, and n k is the number of level 3 units A uniform prior is equivalent to setting p3 = ;n 3 ; 1, and S p3 =0 The above algorithm already shows some similarities between steps It can be seen that steps 2 and 3 are eectively the same form but with summations over dierent levels The same is also true for steps 5 and 6 and so although I have written the algorithm in six steps it could actually be written out in four I will now consider the N level model and show that this also only needs four steps 53 Generalising to N levels For an N level model there is 1 set of xed eects, N sets of residuals ( although residuals at level 1 can be calculated via subtraction and so do not need to be sampled ) and N sets of variance parameters These parameters can be split into 4 groups in such a way that all parameters in each group have posteriors of the same form as illustrated previously in the 3level model 1 The xed eects 2 The N ; 1 sets of residuals (excluding level 1) 3 The level 1 scalar variance 2 4 The N ; 1 higher level variances I will need some additional notation as using summations over N levels ie N indices becomes impractical and messy Firstly I will describe level 1 as the 110

128 observation level, and units at level 1 as observations Then let M T be the set of all observations in the model and let M l j be the set of observations that at level l are in category j For example in the simple 2 level educational datasets in Chapter 4, M T will contain all pupils in all schools while M 2 j will contain all the pupils in school j Now also let X li be the vector of variables at level l for observation i, where l = 1 refers to the variables associated with the xed eects Finally let the random parameters at level l l > 1 be denoted by lj, where j is one of the combinations of higher level terms (The xed eects will be 1 ) Also d li = e i ; X li lj in the same way as d 2ijk = e ijk ; X 2ijk 2jk in the 3level model I will use the following prior distributions: For the level 1 variance, p( 2 ) SI 2 ( e s 2 e), for the level l variance, where l > 1, V l IW( Pl S Pl ), and for the xed eects, 1 N( p S p ) I will now describe the four steps 531 Algorithm 1 Step 1 - The xed eects, 1 p( 1 j y :::) / p(y j 1 :::)p( 1 ) 1 MVN( b 1 b D 1 ) where and bd 1 =[ X T X i2m T b 1 = b D 1 [ X T X i2m T 1iX 1i + S ;1 2 p ] ;1 1id 1i + S ;1 2 p p ]: 111

129 Step 2 - The level l residuals, l p( l j y :::) / p(y j l :::)p( l jv l ) lj MVN( b lj b D lj ) where bd lj =[ X XT li X li i2m l j +V ;1 2 l ] ;1 and b lj = b D lj 2 X i2m l j X T li d li : Step 3 - The level 1 scalar variance 2 p(1= 2 j y :::) / p(y j 2 :::)p(1= 2 ) 1= 2 gamma(a pos b pos ) where a pos = 1 2 (N + e) b pos = 1 2 (P n e 2 n + e s 2 e) For a uniform prior e = ;2 s 2 e = 0 Step 4 - The level l variance, V l p(v ;1 l j y :::) / p( l j V l )p(v ;1 l ) V ;1 l Wishart nrl [S pos =( Xn l i=1 li T li + S Pl ) ;1 pos = n l + Pl ] where n l is the number of level l units For a uniform prior, S Pl = 0 and 112

130 Pl = ;n rl ; 1 where n rl is the number of random variables at level l 532 Computational considerations When writing Gibbs Sampling code one of the main concerns is the speed of processing The code for 1 iteration will be repeated thousands of times and so any small speed gain for an individual iteration will be magnied greatly The actual memory requirements for storing intermediate quantities will be small in comparison to the size of the results There is therefore scope to store a few more intermediate results if they will in turn speed up the code I will now explain two computational steps that will speed up the processing time Speed up 1 From the 4 steps shown in the general N level algorithm note that the quantities P i2m l j X T lix li and P i2m T X T 1iX 1i are xed constant matrices It would save a large amount of time if these quantities are calculated at the beginning and then stored so that they can be used in each iteration Speed up 2 Much use is made of quantities suchasd li which are equal to e i +c li where c li is the product of a parameter vector and a data vector For example d 2i = e i +X 2i 2j If I store e i the level 1 residual for observation i, then whenever (steps 1 and 2 of the algorithm), one of the d quantities needs to be calculated, I can add on the current value of the parameter multiplied by the data vector, for example X 2i 2j to produce d 2i Then use this to calculate a new value for the appropriate Once a new value has been calculated this procedure can be applied backwards to give the new value of the level 1 residual e i, ie, subtract the new parameter value the data vector This idea will also be repeated in later methods 113

131 54 Method 2 : Metropolis Gibbs hybrid method with univariate updates In the previous section I have given an algorithm to t the multi-level Gaussian model using the Gibbs sampler In the next chapter the models considered do not give conditional distributions that have nice forms to be simulated from easily using the Gibbs sampler Iwillnow t the current models using some alternative MCMC methods which can then be used on the models in the next chapter The steps that cause the simple Gibbs sampler problems in the multi-level logistic regression models in the next chapter are updating the residuals and xed eects The rst plan is to replace the Gibbs sampler on these steps with univariate normal proposal Metropolis steps as described in the following algorithm 541 Algorithm 2 Step 1 - The xed eects, 1 For i in 1 ::: N F ixed (t) 1i = 1i with probability min(1 p(1i j y :::)=p( (t;1) 1i j y : : :)) = (t;1) 1i otherwise where 1i = (t;1) 1i + 1i 1i N(0 2 1i): Step 2 - The level l residuals, l For l in 2 ::: N j in 1 ::: n l and i in 1 ::: n rl (t) lji = lji with probability min(1 p(lji j y :::)=p( (t;1) lji j y : : :)) = (t;1) lji otherwise where lji = (t;1) lji + lji lji N(0 lji), 2 n l is the number of level l units, and n rl is the number of random parameters at level l Step 3 - The level 1 scalar variance 2 This step is the same as Algorithm 1 114

132 Step 4 - The level l variance, V l This step is the same as Algorithm Choosing proposal distribution variances When using the Gibbs sampler on multi-level models and having dened the steps of the algorithm the only remaining task is to x starting values for all the parameters Generally starting values could be set fairly arbitrarily and the results should be similar To improve the mixing of the Markov chains, and to utilise MLwiN's other facilities, I use the current estimates obtained by the maximum likelihood IGLS or RIGLS methods as starting values Having set the starting values it is now simply a question of running through the steps of the algorithm repeatedly When Metropolis steps are introduced to the algorithm, there is now one more set of parameters that need to be assigned values In Steps 1 and 2 of the above algorithm, there are normal proposal distributions with undened variances, and these variances need to be given sensible values The Metropolis steps will actually work with any positive values for the proposal variances, and will eventually, given time, give estimates with a reasonable accuracy but ideally we would like accurate estimates in the minimumnumber of iterations To achieve this aim, proposal variances that give achain that mixes well are desirable Gelman, Roberts, and Gilks (1995) explore ecient Metropolis proposal distributions for normally distributed data in some detail They show that the ideal proposal standard deviation for a parameter of interest, is approximately 24 times the standard deviation of This implies the ideal proposal distribution variance is 58 times the variance of This means that if an estimate of the variance of the parameter of interest was available, this result can be used and the Metropolis algorithm can be used eciently Fortunately MLwiN also gives standard errors to its estimates produced by IGLS or RIGLS and so these values can be used The models studied in Gelman, Roberts, and Gilks (1995) are fairly simple and there is no guarantee that the optimal value of 58 for the scaling factor will follow for multi-level models To test this out I will consider a few simple multi-level models and nd through simulation whether this optimal value of

133 holds Finding optimal scaling factors To nd optimal scaling factors for the variance of the proposal distribution a practical approach was taken Several values for the scaling factor spread over the range 005 to 20 were considered, and for each value 3 MCMC runs with a burn-in of 500 and a main run of 50,000 were performed The same value was used as a multiplier for the variance estimate from the RIGLS method for each xed eect and higher level residual parameter To nd the optimal value the Raftery Lewis statistic was calculated In Chapter 3 I showed that the Raftery Lewis ^N statistic is equivalent to the recipricol of the eciency of the estimate, so the optimal scaling factor will be the scaling factor value that minimises ^N The method was rstly used on the two simple models considered in Chapter 4, the variance components and random slopes regression models The results can be seen in Figures 5-1, 5-2 and 5-3 From these gures the shape of the graphs can be seen to be similar to those in Chapter 3 (Figure 3-6), although the graphs in Chapter 3 are based on the scale factor for the standard deviation and not the variance What is immediately clear is that the value 58 is not the minimum for the scale factor as was found in the simple Gaussian model in Chapter 3 In the second example (Figures 5-2 and 5-3) it can be seen that the optimal scale factor is not even the same for both parameters There is some noise when using ^N as an estimate of eciency but a rough estimate of the minimum can be obtained and on all three graphs this is far smaller than 58 This creates a problem as the same scale factor is being used for each parameter and so this constraint prevents the use of the dierent optimal values The calculations in Gelman, Roberts, and Gilks (1995) that give the optimal value of the scale factor are fairly mathematically complex Due to this complexity I do not intend to attempt to nd similar optimal formulae mathematically for multi-level models in this thesis The method was also considered on some other models and the results can be seen in Table 51 In Table 51, 0 is the intercept, 1 is the Math3 eect, and is the sex eect The models are all either variance components or random 2 slopes regression models, with the number indexing the number of xed eects The nal model (SCH1) uses a dierent educational dataset from Goldstein et al 116

134 Raftery Lewis Nhat Parameter Beta_0 : Scale Factor Raftery Lewis Nhat Parameter Beta_0 : Acceptance Rate Figure 5-1: Plots of the eect of varying the scale factor for the proposal variance and hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the 0 parameter in the variance components model on the JSP dataset 117

135 Raftery Lewis Nhat Parameter Beta_0 : Scale Factor Raftery Lewis Nhat Parameter Beta_0 : Acceptance Rate Figure 5-2: Plots of the eect of varying the scale factor for the proposal variance and hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the 0 parameter in the random slopes regression model on the JSP dataset 118

136 Raftery Lewis Nhat Parameter Beta_1 : Scale Factor Raftery Lewis Nhat Parameter Beta_1 : Acceptance Rate Figure 5-3: Plots of the eect of varying the scale factor for the proposal variance and hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the 1 parameter in the random slopes regression model on the JSP dataset 119

137 (1998) which has 4059 students in 65 schools To this dataset a similar random slopes regression model has been tted to assess whether the results obtained for the JSP dataset are unique Table 51: Optimal scale factors for proposal variances and best acceptance rates for several models Model Optimal Scale factor Acceptance Percentage VC %{70% VC %{70% 40%{60% VC %{70% 40%{60% 40%{65% RSR %{75% 35%{65% RSR %{70% 40%{70% 40%{65% SCH %{80% 40%{70% If the acceptance rate is considered instead it can be seen that the graphs are far atter, and there are a wide range of acceptance rates that give similar values of ^N Gelman, Roberts, and Gilks (1995) calculate the optimal acceptance rate to be 44% for Gaussian data, and although this value appears to give a reasonably low ^N it does notappeartobetheminimum for all parameters It will however give far better results than using the scale factor of 58 In Table 51 ranges of values have been given for the acceptance rates as the graphs of ^N values over these ranges are fairly at It can be seen that all parameters considered give good results with acceptance rates between 45% and 60% This shows that if a proposal distribution that gives the same desired acceptance rate for every parameter could be found then this would be a better method than using the scale factor method considered thus far This is the motivation behind considering adaptive samplers 543 Adaptive Metropolis univariate normal proposals An additional problem with using the variance estimates produced by IGLS and RIGLS to calculate the proposal distribution variances is the assumption that these methods give good estimates This does not however explain the discrepancies from the value 58 for the scale factor, as the IGLS and RIGLS variance estimates will generally be too small which would have the opposite 120

138 eect on the scaling factor An alternative approach would be to have starting proposal distributions and then adapt these distributions as the algorithm is running to improve the mixing of the Markov chain Care has to be taken when performing adaptive Metropolis sampling (Gelfand and Sahu 1994) as the simulations produced may not be a Markov chain Gilks, Roberts, and Sahu (1996) give a mathematical method based on Markov chain regeneration that will give time points when it is acceptable to modify the proposal distribution during the monitoring run of the chain This method although shown to be eective in the paper is rather complicated and so I decided instead to use the simpler approach of adapting the proposal distributions in a preliminary period before the `burn-in' and main monitoring run of the chain Muller (1993) gives a simple adaptive Metropolis sampler based on the belief that the ideal sampler will accept approximately 50% of the iterations Gelman, Roberts, and Gilks (1995) show that for univariate normal proposals, used on a multivariate normal posterior density, the ideal acceptance rate is 44% but from Table 51 it can be seen that for multi-level models 50% is an equally good acceptance rate Muller (1993) considers the last 10 observed acceptance probabilities and uses the simple approach of modifying the proposal distribution if the average of these acceptance rates lies outside the range 02 to 08 There are several factors to consider when designing an adaptive algorithm Firstly how often to adapt the proposal distributions, secondly how to adapt the proposal distributions and thirdly when to stop the adapting period and to continue with the `burn-in' period I will outline two adaptive Metropolis algorithms that aim to give acceptance rates of 44% for all parameters, although 44% can be substituted by any other percentage Adaptive sampler 1 This method has been implemented in MLwiN and has an adapting period of unknown length (up to an upper limit) followed by the usual `burn-in' period and nally the main run from which the estimates are obtained The objective of this method is to achieve an acceptance rate of x% for all the parameters of interest Although in the MLwiN package, the proposal distributions used in the non-adaptive method will be used as initial proposal distributions, the algorithm will work on arbitrary starting proposals as illustrated in the examples below 121

139 The algorithm needs the user to input 2 parameters Firstly x% the desired acceptance rate, which in the example will be 44% and a tolerance parameter, which in the example will be 10% This tolerance parameter governs when the algorithm stops and is meant to signify bounds on the desired acceptance, that is the desired acceptance rate is 44% but if the acceptance rate is somewhere between 34% and 54% we are fairly happy The algorithm then runs the sampler with the current proposal distributions for batches of 100 and at the end of each batch of 100, the proposal distributions are modied This procedure is repeated until the tolerance conditions are achieved The modication procedure that happens after each batch of 100 is detailed in the algorithm below Method The following algorithm is repeated for each parameter Let N Acc be the number of iterations accepted in the current batch for the chosen parameter (out of 100), OPT Acc be the desired acceptance rate, and P SD be the current proposal standard deviation for the parameter If N Acc >OPT Acc P SD = P SD If N Acc <OPT Acc P SD = P SD = 100 ; NAcc 2 ; 100 ; OPT Acc 2 ; N Acc : OPT Acc The above will modify the proposal standard deviation by a greater amount the further the acceptance rate is from the desired acceptance rate If the acceptance rate is too small then the proposed new values are too far from the current value and so the proposal SD is decreased If the acceptance rate is too high, then the proposed new values are not exploring enough of the posterior distribution and so the proposal SD is increased To check if the tolerance condition is achieved N Acc is compared with the tolerance interval, (OPT Acc ; TOL Acc OPT Acc + TOL Acc ) If three successive values of N Acc areinthisinterval then the parameter is marked as satisfying the tolerance conditions Once all parameters have been marked then the tolerance condition is satised After a parameter has been marked it is still modied as before until all parameters are marked, but each parameter only needs to be 122

140 marked once for the algorithm to end To limit the time spent in the adapting procedure an upper limit is set (in MLwiN this is 5,000 iterations) and after this time the adapting period ends regardless of whether the tolerance conditions are met Note that it may be better to use the sum of the actual Metropolis acceptance probabilities as in Muller (1993) instead of N Acc in the above algorithm, although preliminary investigations show no signicant dierences in the proposal distributions chosen Results Table 52 shows the adapting period for the two xed eects parameters for one run of the random slopes regression model with the JSP dataset Here the starting values have beenchosen arbitrarily to be 10 whereas when this method is used in MLwiN the RIGLS estimates will be used instead From the table it can be seen that both parameters have fullled the tolerance criteria by 700 iterations However as the adapting period also includes the level 2 residuals, it is not complete until 3,300 iterations when the nal set of residuals satisfy the criteria Table 52: Demonstration of Adaptive Method 1 for parameters 0 and 1 using arbitrary (1000) starting values N 0 SD N Acc N in Tol 1 SD N Acc N in Tol , , In Table 53 runs of length 50,000 for various dierent methods using the same 123

141 random slopes regression model are compared The four methods considered are the Gibbs sampling method used in Chapter 4, and three versions of the Metropolis Gibbs hybrid method Firstly using proposal SDs set at 10 for all parameters, secondly using the RIGLS starting values to create the proposal distributions and nally using the rst adaptive method After 50,000 iterations, the parameter estimates of all four methods are reasonably similar The Raftery Lewis ^N values show more clearly how well the methods are performing The Gibbs sampler generally has the lowest values of ^N, with the adaptive method the best of the hybrid methods The need to choose good proposal distributions is highlighted bythehuge ^N value for 1 using the arbitrary 10 proposal distribution SD This value is over 30 times longer than the suggested run length for Gibbs for 1 The acceptance rates and proposal standard deviations in Table 53 show how far from the expected 44% acceptance rate, the RIGLS starting values method actually is This in turn explains why the ^N value for 0 using this method is larger than for the adaptive method The table also shows that the adaptive method is a better approach than using the RIGLS starting values and this is backed up by the Figures 5-1 to 5-3 seen earlier The results in Table 53 are based on only one run of each method, but other runs were performed and similar results were obtained Adaptive sampler 2 Two criticism that may be levelled against the rst adaptive sampler are rstly there is no denite length for the adapting period and secondly that the method includes a tolerance parameter which has to be set This second method will hopefully be an improvement on the rst algorithm that does away with the tolerance parameter and gives acceptance rates closer to the desired acceptance rate Method In the rst sampler, although the change in acceptance rate is less the closer the current acceptance rate is to the desired acceptance rate, the change does not vary with time I will try to incorporate the MCMC technique, simulated 124

142 Table 53: Comparison of results for the random slopes regression model on the JSP dataset using uniform priors for the variances, and dierent MCMC methods Each method was run for 50,000 iterations after a burn-in of 500 Par Gibbs MH (SD = 1) MH (RIGLS) MH Adapt (0396) 3059(0374) 3060(0406) 3057(0417) (0048) 0614(0047) 0614(0051) 0616(0049) u (1732) 5699(1716) 5702(1656) 5780(1754) u01 {0426(0163) {0428(0162) {0420(0160) {0436(0168) u (0024) 0055(0023) 0054(0025) 0055(0024) e (1339) 2693(1342) 2699(1337) 2692(1336) Raftery and Lewis diagnostic ( ^N) 0 10,520 60,728 58,778 32, , ,954 24,421 24,999 u00 5,792 6,175 5,645 5,684 u01 4,866 4,714 4,882 5,212 u11 12,345 9,480 14,389 11,877 e 2 3,898 3,810 3,867 3,835 Acceptance Rates for xed eects (%) 0 100% 217% 232% 463% 1 100% 36% 340% 406% Proposal Standard deviations

143 annealing (Geman and Geman 1984) by allowing the change to the proposal SD to decrease with time The algorithm will then be run for a xed length of time, T max (T max is chosen to be 5,000 in the example) and the proposal distributions will be modied every 100 iterations The following procedure will be carried out for each parameter at time t : 100 ; NAcc If N Acc >OPT Acc P SD = P SD 1+ 1 ; 100 ; OPT Acc Tmax ; t +100 T max Tmax ; t +100 If N Acc <OPT Acc P SD = P SD = 1+ 1 ; N Acc OPT Acc T max : So after the rst 100 iterations the range of possible changes to the proposal SD is ( 1 2 P SD 2P SD ) as in the rst algorithm but this shrinks to ( T max T max+100 P SD Tmax+100 T max P SD )attimet max Results Table 54 shows the adapting period for the two xed eects parameters for one run of the random slopes regression model with the JSP dataset using the second method Here again, the starting values have been chosen arbitrarily to be 10 whereas this method could use the RIGLS estimates from MLwiN From Table 54 it can be seen that as time increases the changes to the proposal standard deviation become smaller as with a simulated annealing algorithm The two methods were each run ten times for 5,000 iterations using the random slopes regression model, with the ideal acceptance rate set to 44% The actual acceptance rates achieved for the two xed eects were recorded for both methods For parameter 0, the rst method obtained acceptance rates between 400% and 490% whilst the second method obtained rates of between 433% and 473% For parameter 1, the rst method obtained rates between 410% and 484% whilst the second method obtained rates between 433% and 464% It is not entirely fair to compare these gures directly as the second method was run for 5,000 iterations every time whereas the rst method ran on average for only 2,100 iterations, however the second method does appear to give a more accurate acceptance rate A balance has to be struck between the additional burden of on 126

144 Table 54: Demonstration of Adaptive Method 2 for parameters 0 and 1 using arbitrary (1000) starting values N 0 SD N Acc 1 SD N Acc , , , , average 2,900 extra iterations (in this example) in the adapting period and any gain in speed of obtaining accurate estimates Although this second method is an interesting alternative to the rst adaptive method, it will not be considered further in this thesis as it does not oer any signicant improvements Instead I will now go on to consider multivariate normal Metropolis updating methods 55 Method 3 : Metropolis Gibbs hybrid method with block updates One disadvantage of using univariate normal proposal distributions is that the correlation between parameters is completely ignored and so if two parameters are highly correlated it would be nice to adjust for this in the proposal distribution Highly correlated parameters are generally avoided by centering predictor variables but sometimes large correlations still exist The Gibbs sampler algorithm updates parameters in blocks, for example the xed eects are updated together and all residuals for one level 2 unit are updated 127

145 together The second hybrid method will mimic the Gibbs sampler steps for residuals and xed eects by using multivariate normal proposal Metropolis steps for these blocks as described in the following algorithm : 551 Algorithm 3 Step 1 - The xed eects, 1 (t) 1 = 1 with probability min(1 p( 1 j y :::)=p( (t;1) 1 j y : : :)) = (t;1) 1 otherwise where 1 = (t;1) MV N(0 1 ): Step 2 - The level l residuals, l For l in 2 :::N i in 1 :::n l (t) lj = lj with probability min(1 p(lj j y :::)=p( (t;1) lj j y : : :)) = (t;1) lj otherwise where lj = (t;1) lj + lj lj MV N(0 lj ) and n l is the number of level l units Step 3 - The level 1 scalar variance 2 This step is the same as Algorithm 1 Step 4 - The level l variance, V l This step is the same as Algorithm Choosing proposal distribution variances When using multivariate normal proposal distributions, a similar problem exists as for the univariate case, the variance matrices for the proposal distributions in steps 1 and 2 need to be assigned values This time the Metropolis steps 1 and 2 will work with any positive denite matrix for the proposal variances, however ideally a matrix that gives estimates with a reasonable accuracy in the minimum number of iterations is desired 128

146 Gelman, Roberts, and Gilks (1995) also consider the case of a multivariate normal Metropolis proposal distribution on multivariate normal posterior distributions They calculate the optimal scale factor to be used as a multiplier for the estimated standard deviation matrix for dimensions 1 to 10 and also found an asymptotically q optimal estimator for this scale factor This asymptotic estimator is 2:38= (d) where d is the dimension of the proposal distribution When considering the estimated covariance matrix this multiplier becomes 5:66=d This now implies that if an estimate of the covariance matrix of the parameters of interest can be found then this can be multiplied by this optimal scale factor and the resulting matrix can be used as the variance matrix for the proposal distribution Fortunately MLwiN will give the covariance matrices associated with both the xed eects and the residuals and so these values can be used Finding optimal scaling factors When I considered the univariate normal distributions I found that the optimal value of 58 for the scale factor from Gelman, Roberts, and Gilks (1995) did not actually follow for multi-level models For the multivariate normal proposal distributions I will now consider again the random slopes regression model on the JSP dataset The asymptotic estimator gives the value 283 when d = 2as in the random slopes regression model The optimal estimate from Gelman, Roberts, and Gilks (1995) is 289, so the asymptotic estimator is fairly accurate when d =2 I used a similar approach to that used for the univariate proposal distributions Several values of the scaling factor spread over the range 002 and 10 were chosen, and for each value three MCMC runs with a burn-in of 1,000 and a main run of 50,000 were performed As with the univariate case the Raftery Lewis ^N statistic was calculated to measure eciency The results for the random slopes regression model can be seen in the Figures 5-4 and 5-5 From Figures 5-4 and 5-5 the optimal scale factors for both 0 and 1 appear to be around 075 which is far smaller than the values from Gelman, Roberts, and Gilks (1995) As with the univariate case the acceptance rate gives a far atter graph Here acceptance rates in the range 30% to 75% for 0 and in the range 30% to 70% for 1 give similar low values of ^N Gelman, Roberts, and Gilks (1995) give the acceptance rate of 352% as optimal for a bivariate normal 129

147 Raftery Lewis Nhat Parameter Beta_0 : Scale Factor Raftery Lewis Nhat Parameter Beta_0 : Acceptance Rate Figure 5-4: Plots of the eect of varying the scale factor for the multivariate normal proposal distribution and hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the 0 parameter in the random slopes regression model on the JSP dataset 130

148 Raftery Lewis Nhat Parameter Beta_1 : Scale Factor Raftery Lewis Nhat Parameter Beta_1 : Acceptance Rate Figure 5-5: Plots of the eect of varying the scale factor for the multivariate normal proposal distribution and hence the Metropolis acceptance rate on the Raftery Lewis diagnostic for the 1 parameter in the random slopes regression model on the JSP dataset 131

149 proposal which does appear in these ranges The atness of the acceptance rate graph around the minimum again implies that nding proposal distributions that give a desired acceptance rate for every parameter is a better approach than the scale factor method This leads us to consider adaptive multivariate samplers 553 Adaptive multivariate normal proposal distributions The adaptive samplers considered for univariate normal proposal distributions can both be extended to multivariate proposals I will only consider modifying the rst sampler which is used in MLwiN but the alternative sampler based on simulated annealing could also be easily modied When considering multivariate proposals there is more exibility in the possible variance matrices that can be considered I could simply modify the univariate algorithm by allowing the scale factor to vary and keeping the original estimate of the covariance matrix ( generally from RIGLS ) xed This is rather restrictive as the proposal variance would then have to be a scalar multiple of the initial variance estimate and so an alternative approach will be considered In Gelman, Roberts, and Gilks (1995) dierent optimal acceptance rates are given for dierent dimensions and these range from 44% when d = 1down to 25% The optimal acceptance rate when d = 2is 352%, although Figures 5-4 and 5-5 show that for the random slopes regression model acceptance rates between 30% and 70% are all reasonably good Adaptive sampler 3 This sampler is a slightly modied generalisation of adaptive sampler 1 The objective of the method is again to achieve an acceptance rate of x% for all blocks of parameters Eectively any positive denite matrix can be used as an initial variance matrix for the proposal distribution but in practice it is better to use good estimates as problems may occur if there are no changes accepted in the rst batch of 100 iterations The algorithm needs the user to input 2 parameters Firstly x% the desired acceptance rate and secondly a tolerance parameter This tolerance parameter will work in exactly the same way as in sampler 1 The algorithm runs the 132

150 sampler with the current proposal distributions for batches of 100 and at the end of each batch of 100 iterations the proposal distributions are modied Each proposal distribution consists of two distinct parts, rstly the current estimate of the covariance matrix for the block of parameters considered and secondly the scale factor which this matrix is multiplied by to give the proposal distribution variance The main dierence from the univariate case is that the current estimate of the covariance matrix is updated after every 100 iterations whereas for the univariate case the variance estimate remains xed at the RIGLS estimate For the rst 100 iterations, the RIGLS estimate for the covariance matrix is used Then after each 100 iterations the covariance matrix is calculated from all the iterations run thus far The procedure to follow after every 100 iterations is as given below : Method The following algorithm is repeated for each block of parameters Let N Acc be the number of moves accepted in the current batch for the chosen block (out of 100), OPT Acc be the desired acceptance rate, and SF be the current proposal scale factor for the block If N Acc >OPT Acc SF = SF If N Acc <OPT Acc SF = SF= 100 ; NAcc 2 ; 100 ; OPT Acc 2 ; N Acc : OPT Acc The above will modify the proposal scale factor by a greater amount the further the current acceptance rate is from the desired acceptance rate If the acceptance rate is too small then the proposed new values are too far from the current value and so the scale factor is decreased If the acceptance rate is too high, then the proposed new values are not exploring enough of the posterior distribution and so the scale factor is increased To calculate the actual variance matrix for the proposal distribution, this scale factor has to be multiplied by the current estimate of the covariance matrix for the block of parameters This estimate is based on the values obtained from the iterations run thus far and after each batch of 100 iterations this estimate is 133

151 modied accordingly The procedure for checking that the tolerance criteria is satised and the maximum length of the adapting period are both the same as in adaptive sampler 1 Results Table 55 shows the adapting period for the block oftwo xed eects parameters in the random slopes regression model with the JSP dataset The starting values are the RIGLS estimates of the covariance matrix from MLwiN and the desired acceptance rate is 35% The columns labelled V p00 V p01 and V p11 are the proposal variance matrix It is interesting to note that there is a huge jump in the proposal variance matrix after 100 iterations This is because the actual iterations are then used to estimate the covariance matrix instead of the RIGLS estimates and the estimate after 100 iterations will be less accurate than the RIGLS estimate However as the number of iterations increases the accuracy will improve From Table 55 it can be seen that this block of parameters fulls the tolerance criteria by 400 iterations However as the adapting period also includes the level 2 residuals, it is not complete until 1,300 iterations when the nal set of residuals satisfy the criteria It is interesting to note that for the xed eects only 17 iterations were accepted in the last block of 100 and so the nal proposal distribution is quite dierent from the penultimate one This does not however seem to have aected the results in Table 56 In Table 56 runs of length 50,000 iterations for various dierent multivariate proposal distribution methods are compared These results can also be compared with Table 53 which gives results for the same model but using Gibbs sampling and univariate proposal distribution methods The three proposal distributions considered in Table 56 are rstly an arbitrary identity matrix for the proposal variance, secondly the estimate from RIGLS multiplied by the scale factor (29) as the variance matrix and thirdly the adaptive method with 35% as the desired acceptance rate The method using the identity matrix as the proposal variance shows the importance of choosing a sensible proposal distribution It only has an acceptance rate of less than 1% which leads to huge values of ^N for the xed eects Although the estimates it produces are similar to the other method, this is due to using 134

152 Table 55: Demonstration of Adaptive Method 3 for the parameter vector using RIGLS starting values N SF () N Acc N in Tol V p00 V p01 V p { { { { { { { { { { , { , { the starting values from RIGLS The estimates of parameter standard deviations it produces are too small due to the low acceptance rate and this gives a better indication of the method's poor performance The method based on the RIGLS starting values and a scale factor of 29 gives results that are far better and values of ^N that are similar to the equivalent univariate method The adaptive method, as in the univariate case improves on the scale factor method, although the univariate adaptive method appears to do better than the multivariate method in terms of ^N values This section has shown that Metropolis block updating methods can be produced by modifying the univariate updating methods For the one bivariate example considered the block updating methods do not show any improvement over their univariate equivalents in terms of minimising expected run lengths ^N There is scope to consider these methods in more detail and to see whether there are any improvements on other datasets where there is greater correlation between parameters in a block, but not in this thesis 135

153 Table 56: Comparison of results for the random slopes regression model on the JSP dataset using uniform priors for the variances, and dierent block updating MCMC methods Each method was run for 50,000 iterations after a burn-in of 500 Par MH (V p = I) MH (RIGLS) MH Adapt (0359) 3060(0405) 3058(0397) (0042) 0615(0048) 0615(0048) u (1702) 5680(1735) 5624(1713) u01 {0429(0161) {0424(0163) {0427(0165) u (0024) 0054(0024) 0056(0024) e (1336) 2698(1341) 2700(1342) Raftery and Lewis diagnostic ( ^N) 0 1,045,468 62,443 43, ,906 48,805 36,881 u00 5,917 6,058 6,017 u01 4,842 5,065 4,762 u11 10,424 18,712 14,535 e 2 3,860 3,767 3,791 Acceptance Rates for xed eects (%) 096% 187% 362% Proposal Variance Matrix V p V p {0024 {0009 V p

154 56 Summary In this chapter several MCMC methods for tting N level Gaussian models have been discussed and algorithms produced The Gibbs sampling method introduced in the last chapter for two simple multi-level models was extended to t general N level models For N level Gaussian models, where the conditional distributions required in the Gibbs algorithm have forms that are easily to simulate from this method performs best Two other hybrid methods based on a combination of Metropolis and Gibbs sampling steps were introduced in this chapter These methods do not perform as well as the Gibbs method for the Gaussian models but can be easily applied to models with complicated conditional distributions as will be seen in the next chapter The rst method uses univariate normal proposal distributions whilst the second method uses multivariate normal proposal distributions The two methods were compared using a simple example model and no benet was seen in using the multivariate proposals and so the univariate proposal method will be used in the next chapter Two approaches were considered for generating optimal proposal distributions for the Metropolis steps in the hybrid algorithms The rst approach was based on using scaled estimates of the variance of the parameter of interest as variances for the proposal distribution This approach has some problems as the results in Gelman, Roberts, and Gilks (1995) for optimal scale factors for multivariate normal posterior distributions do not follow for multi-level models The second approach considered uses an adapting period before the main run of the Markov chain in which the proposal distributions are modied to give particular acceptance rates for the parameters of interest This approach works better as for the univariate proposal distribution, acceptance rates in the range 45% to 60% lead to close to optimal proposals in all the examples considered One type of comparison that is missing from this chapter and this thesis in general is timing comparisons and they deserve some mention before I close this chapter 137

155 561 Timing considerations In this thesis the only places where timings are included are generally to justify the dimensions of simulation runs and not to compare individual methods The main reason for not including timing comparisons is that I personally think that they should only be done on released software, on a stand alone machine and by a third party Before the MCMC options were incorporated into the MLwiN package they existed in a more primitive version as a stand alone C program At this time I compared my stand alone code with the BUGS package (Spiegelhalter et al 1994), mainly to conrm that my code gave reasonable estimates but also to compare the speed dierences I found that my Gibbs sampling code was slightly quicker for the few small models tested This was to be expected as the algorithms in this chapter generate from the posterior distributions directly whilst the BUGS package uses the adaptive rejection method Iwould expect now that the BUGS package will outperform the MLwiN Gibbs sampler for the Gaussian models This is because the Gibbs sampling code in MLwiN is embedded beneath the graphical interface which slows the code down considerably Some of the methods described in this chapter have not yet been added to the released version of MLwiN and have not yet been optimised and so comparisons would not be fair All comparisons obviously depend on the eciency of the coding and the model considered However it is generally the case that a Metropolis sampling step will be quicker than a Gibbs sampling step Also Metropolis steps should in general be quicker than rejection sampling algorithms and adaptive rejection sampling as only one value is generated per parameter per iteration using Metropolis The Metropolis steps, however in general need longer to get estimates of a given accuracy as demonstrated earlier in this chapter 138

156 Chapter 6 Logistic Regression Models 61 Introduction In the previous chapters the models considered have been restricted to the family of Gaussian multi-level models This family of models is very useful and can t most datasets well The models considered thus far have all had a response variable that is assumed to be dened on the whole real line There are many variables that are not dened on the whole real line, for example age which must be positive andsex which is either male or female In this chapter I am interested in the second type of variable, one which has two possible states that can be dened as zero and one ie a binary response These types of variables, as seen at the end of chapter 2, also appear in linear modelling as responses In this case they are tted as a Bernoulli response using generalized linear modelling The most common way of tting such a model is by using the logit link function, which is the canonical link for the binomial and Bernoulli distributions The technique is then known as logistic regression McCullagh and Nelder (1983) is a useful text for all generalized linear models including logistic regression models In a similar way that Gaussian linear models can be extended to multi-level Gaussian models, logistic regression models can also be extended to multi-level logistic regression models In this chapter I will dene the general model structure for a multi-level binary response logistic regression model In the last chapter it was pointed out that the simple Gibbs sampling method cannot be used to t these models, as 139

157 the full conditional distributions do not all have forms that are easily simulated from Gilks (1995) shows that this is true for a very simple logistic regression model and then gives some alternative ways to t such models I will show how the Metropolis-Gibbs hybrid methods of the last chapter can easily be adapted to t these logistic regression models In fact the motivation behind putting these hybrid methods in the last chapter was to be able to t multi-level logistic regression models I will then consider two examples to show some other elds of applications of hierarchical modelling, that have models of this structure The rst example is taken from survey sampling and involves a political voting dataset from the British Election study This example will be used to illustrate how to t multilevel binary response models, and how to calculate optimal Metropolis proposal distributions The second example considers a collection of simulated datasets, designed to represent closely the structure of a dataset used in an analysis of health care utilisation in Guatemala These simulated datasets were considered in Rodriguez and Goldman (1995), where deciencies in the quasi-likelihood methods used by MLn to t binary response models were pointed out I hope to show that the Metropolis-Gibbs hybrid methods will improve on the quasi-likelihood methods 62 Multi-level binary response logistic regression models The multi-level binary response logistic regression model has a similar structure to the Gaussian models discussed thus far The only dierence is how the response variable is linked to the predictor variables When considering the Gaussian models, a three level model was described and an algorithm for this model was included A generalisation was then given to N levels A 3 level binary response logistic regression model can be dened as follows : y ijk = Bernoulli(p ijk ) where logit(p ijk )=X 1ijk 1 +X 2ijk 2jk +X 3ijk 3k 140

158 2jk MV N(0 V 2 ) 3k MVN(0 V 3 ): This 3 level model can be easily extended to N levels in a similar way to the Gaussian models Then any of the Metropolis-Gibbs hybrid methods can be adapted to t the logistic model I will only consider the method with univariate updates here and now show how this can be adapted from the Gaussian version 621 Metropolis Gibbs hybrid method with univariate updates As can be seen in the above model denition, multilevel logistic regression models do not have variance terms at level 1 This is because for the Bernoulli distribution, both the mean and the variance are functions of the parameter p ijk only (E(y ijk )=p ijk,var(y ijk )=p ijk (1;p ijk )), Therefore if the mean is estimated, then the variance will be xed This means that the algorithm for a general N level logistic regression model has only three steps Notation In the following algorithm I will use similar notation as used for the N level Gaussian models in Chapter 5 Let M T be the set of all observations in the model, and let M l j be the set of observations that at level l are in category j Let X li be the vector of variables at level l for observation i, where l = 1 refers to the variables associated with the xed eects Let the random parameters at level l l > 1 be denoted by lj, where j is one of the combination of higher level terms and the xed eects be 1 Finally let V l be the level l variance matrix I will use the abbreviation (X) i to mean the sum of all the predictor terms for observation i For example in the three level model denition earlier, (X) i = X 1ijk 1 +X 2ijk 2jk +X 3ijk 3k Using these notational short cuts, the model can be written : y i = Bernoulli(p i ) where logit(p i )=(X) i lj MVN(0 V l ): 141

159 Algorithm The main dierences between this algorithm and the Gaussian algorithm arise from the dierent likelihood functions For prior distributions, I will allow the xed eects to have any prior distribution and the level l variance to have a general inverse Wishart prior, V l IW(S Pl Pl ) The three steps of the algorithm are then as follows : Step 1 - The xed eects, 1 For i in 1 ::: N F ixed (t) 1i = 1i with probability min(1 p(1i j y :::)=p( (t;1) 1i j y : : :)) = (t;1) 1i otherwise where 1i = (t;1) 1i + 1i 1i N(0 2 1i) and Y p( 1i j y :::) / p( 1 ) (1 + e ;(X)i ) ;y i (1 + e (X)i ) y i;1 : i2m T Step 2 - The level l residuals, l For l in 2 ::: N j in 1 ::: n l and i in 1 ::: n rl (t) lji = lji with probability min(1 p(lji j y :::)=p( (t;1) lji j y : : :)) = (t;1) lji otherwise where lji = (t;1) lji + lji lji N(0 2 lji) and Y p( lji j y :::) / (1 + e ;(X)i ) ;y i (1 + e (X)i ) y i;1 jv l j ; exp[; i2m l j Step 3 - The level l variance, V l 2 T ljv ;1 l lj ]: p(v ;1 l j y :::) / p( l j V l )p(v ;1 l ) V ;1 l Wishart nrl [S pos =( Xn l i=1 li T li + S Pl ) ;1 pos = n l + Pl ] where n l is the number of level l units If we want a uniform prior then we need 142

160 S Pl =0and Pl = ;n rl ; 1 where n rl is the number of random variables at level l Note that it is possible, by reparameterising this model to use the Gibbs sampler instead of the Metropolis sampler for residuals at levels 3 and upwards but this is not considered in this thesis 622 Other existing methods The existing methods for tting multi-level logistic regression models in MLwiN are described briey at the end of Chapter 2 They are quasi-likelihood methods based around Taylor series expansions Marginal quasi-likelihood (MQL) is described in Goldstein (1991), and Penalised quasi-likelihood (PQL) is introduced in Laird (1978) MCMC methods have also been used to t these models Zeger and Karim (1991) give a Gibbs sampling approach to tting multi-level logistic regression models amongst other multi-level models They use rejection sampling with a Gaussian kernel that is a good estimate of the current likelihood function to generate new estimates for the high level residuals The BUGS package (Spiegelhalter et al 1994) will also t these models It uses adaptive rejection sampling (Gilks and Wild 1992) as discussed in Chapter 3 in place of rejection sampling Breslow and Clayton (1993) consider multi-level logistic regression models within the family of generalized linear mixed models They perform some brief comparisons between the quasi-likelihood methods and the Gibbs sampling results in Zeger and Karim (1991) 63 Example 1 : Voting intentions dataset 631 Background The dataset used in this example is a component of the British Election Study analysed in Heath, Yang, and Goldstein (1996) This dataset also appears in Goldstein et al (1998) where it is used as the main example in the binary response models chapter The subsample analysed in Goldstein et al (1998) contains data on 800 voters from 110 constituencies, who were asked how they voted in the 1983 election Their response was categorised as to whether they 143

161 voted Conservative or not, and the interest was in how the voters' opinion on certain issues inuenced their voting intentions The explanatory variables are the voters' opinion scored on a 21 point scale and then centred around its mean for the following four issues Firstly whether Britain should possess nuclear weapons (Def) Secondly whether low unemployment or low ination is important (Unemp) Thirdly whether or not they would prefer tax cuts or higher taxes to pay for more government spending (Tax) and nally whether they are in favour of privatisation of public services (Priv) I will use this example for two purposes Firstly to show the dierences in the estimates produced by the Metropolis-Gibbs hybrid method and the quasilikelihood methods and secondly to see if the ndings on the ideal scaling for the Metropolis proposal distributions from the last chapter extend to logistic regression models 632 Model The model tted to the dataset is the same model as in Goldstein et al (1998) There will be ve xed eects, an intercept term and xed eects for the four opinion variables described above along with a random term to measure the constituency eect Let p ij be the probability that the ith voter in the jth constituency voted Conservative, then logit(p ij )= Def ij + 3 Unemp ij + 4 Tax ij + 5 Priv ij + u j where u j N(0 u) 2 To translate this to the response variable, y ij, requires y ij Bernoulli(p ij ) This model now ts into the framework described in the earlier section and can be tted using the Metropolis Gibbs hybrid method 633 Results The two quasi-likelihood methods, MQL and PQL and the Metropolis Gibbs hybrid method were all used to t the above model and the results are given in Table 61 The Metropolis Gibbs hybrid method was run using the adaptive 144

162 method described in the last chapter and with a desired acceptance rate of 44% A uniform prior was used for the variance parameter 2 u Table 61: Comparison of results from the quasi-likelihood methods and the MCMC methods for the voting intention dataset The MCMC method is based on a run of 50,000 iterations after a burn-in of 500 and adapting period Par MQL1 PQL2 MH Adapt 1 {0355 (0092) {0367 (0094) {0375 (0102) (0018) 0092 (0018) 0095 (0019) (0019) 0046 (0019) 0046 (0020) (0013) 0069 (0014) 0070 (0014) (0018) 0143 (0018) 0146 (0019) u (0112) 0154 (0117) 0253 (0154) From the table it can be seen that the PQL method gives parameter estimates that are larger (in magnitude) than the MQL method estimates for both the xed eects and the level 2 variance The MCMC method gives estimates that are larger (in magnitude) than both quasi-likelihood methods particularly for the level 2 variance In the simulations in Chapter 4 it was shown that for Gaussian models the uniform prior for the variance parameter gave variance estimates that were biased high, particularly for small datasets This dataset is not particularly small (110 level 2 units) and so more investigation is needed on which of the methods is giving the better variance estimate Analysis on which method is performing best in terms of bias and coverage properties will be performed on the second example in this chapter 634 Substantive Conclusions Considering just the PQL results and back transforming the variables onto an interpretable scale the following conclusions can be made Firstly a voter with average views on all four issues (Def ij = Unemp ij = Tax ij = Priv ij = 0), had a 409% probability of voting Conservative A voter was more likely to vote Conservative if they were in favour of Britain possessing nuclear weapons (5 points above average score implies a 523% probability of voting Conservative) They were more likely to vote Conservative if they preferred low ination to low unemployment (5 points above average score implies a 495% probability of 145

163 voting Conservative) They were more likely to vote Conservative if they preferred low taxes rather than higher government spending (5 points above average score implies a 466% probability of voting Conservative) They were more likely to vote Conservative if they were in favour of privatising public services (5 points above average score implies a 586% probability of voting Conservative) The unexplained variation at the constituency level is fairly large in practical terms but not statistically signicant 635 Optimum proposal distributions In the last chapter I showed that the optimum values for the scaling factor for the variance of a univariate normal proposal distribution as suggested by Gelman, Roberts, and Gilks (1995) does not generally follow for multi-level models I will now do a similar analysis of the multilevel logistic regression model by considering the voting intentions example To nd optimal proposal distributions several values of the scaling factor spread over the range 005 to 20 were chosen For each value of the scaling factor 3 runs were performed with a burn-in of 500 iterations and a main run of 50,000 iterations As before the Raftery Lewis ^N statistic was calculated for each parameter in each run Then the optimal scaling factor was chosen to be the value that minimises ^N The results for the voting intention dataset are summarised in Table 62 Table 62: Optimal scale factors for proposal variances and best acceptance rates for the voting intentions model Par Optimal SF Acceptance % Min ^N %{60% 15K %{60% 13K %{60% 14K %{60% 14K %{60% 14K Table 62 shows that for this model the results in Gelman, Roberts, and Gilks (1995) appear to be close to the optimal values obtained from the simulations One problem that is not highlighted by this table is that for this model the worst mixing properties are exhibited by the level 2 variance parameter, u 2 Figure 6-1 shows the eect of varying the scale factor on the ^N value for the parameter u 2 146

164 along with a best t loess curve From this gure there appears to be no clear relationship between the value of the scale factor and the ^N values for u 2 This parameter is actually updated using Gibbs sampling so is only aected indirectly by modifying the scale factor The loess curve does show a slight upturn as the scale factor gets smaller but there is far greater variability in the ^N values and so a clear relationship is more dicult to establish Raftery Lewis Nhat Level 2 Variance Parameter : Scale Factor Figure 6-1: Plot of the eect of varying the scale factor for the univariate Normal proposal distribution rate on the Raftery Lewis diagnostic for the 2 u parameter in the voting intentions dataset To conrm that the behaviour seen above for the scale factor follows for all multi-level logistic regression models I considered again an example from Chapter 2 At the end of Chapter 2 I introduced multi-level logistic regression models 147

165 by converting the response variable M5 from the JSP dataset into a pass/fail indicator, Mp5 which depended on whether the M5 mark was at least 30 or not The Model 24 (below) amongst others was tted to the JSP dataset Mp5 ij Bernoulli(p ij ) log(p ij =(1 ; p ij )) = M3 ij + SCHOOL j SCHOOL j N(0 2 s) I then repeated the procedure for nding the optimal proposal distributions detailed above using this model instead of the voting intentions model For this model the optimum value of the scale factor was found to be between 01 and 02 for both 0 and 1 The minimum values of ^N for 0 and 1 (both roughly 60,000) were for this model found to be far greater than than the minimum ^N value for u 2 (roughly 5,000) The only common factor between the two models is the optimal acceptance rate, which for Model 24 is in the range 40% to 70% for both 0 and 1 This adds further weight to using the adapting procedure detailed in Chapter 5 As in Chapter 5, no simple formula for an optimal scaling factor appears to exist for the multi-level models in this chapter However if the adapting method is used, a desired acceptance rate in the range 40% to 60% for the univariate normal Metropolis proposals will give close to optimal proposal distributions 64 Example 2 : Guatemalan child health dataset 641 Background The original Guatemalan Child Health dataset consisted of a subsample of respondents from the 1987 National Survey of Maternal and Child Health The subsample has 2449 responses and a three level structure of births within mothers within communities The subsample consists of all women from the chosen communities who had some form of prenatal care during pregnancy The response variable is whether this prenatal care was modern (physician or trained nurse) or 148

166 not Rodriguez and Goldman (1995) use the structure of this dataset to consider how well quasi-likelihood methods compare to considering the dataset without the multi-level structure and tting a standard logistic regression They perform this by constructing simulated datasets based on the original structure but with known true values for the xed eects and variances Rodriguez and Goldman (1995) consider the MQL method and show that the estimates of the xed eects produced by MQL are worse than the estimates produced by a standard logistic regression disregarding the multi-level structure Goldstein and Rasbash (1996) consider the same problem but consider the PQL method They show that the results produced by PQL second order estimation are far better than for MQL but still biased They also state that the example considered, with large underlying random parameter values is unusual If the variances in a variance component model do not exceed 05, which is more common, the rst order PQL estimation method and even the rst order MQL method will be adequate Although Rodriguez and Goldman (1995) considered several dierent models and several dierent values for the parameters, Goldstein and Rasbash (1996) only consider the model with the Guatemala structure and parameter values that MQL performs badly on Iwillnow try using the MCMC Metropolis Gibbs hybrid method described earlier in the chapter to t the same model and compare the results 642 Model The model considered is as follows : y ijk Bernoulli(p ijk ) where and logit(p ijk )= x 1ijk + 2 x 2jk + 3 x 3k + u jk + v k u jk N(0 2 u) and v k N(0 2 v): In this formulation i j and k index the level 1,2 and 3 units respectively The 149

167 variables x 1 x 2 and x 3 are composite variables at each level, as the original model contained many co-variates at each level 643 Original 25 datasets Goldstein and Rasbash (1996) considered the rst 25 datasets simulated by Rodriguez and Goldman (1995) to construct a table of comparisons between MQL and PQL ( Table 1 in Goldstein and Rasbash (1996)) This table will now be reconstructed here but will also include the MCMC method with the two alternative priors for the variance parameters The two priors to be considered are rstly a gamma ( ) prior for the precision parameters at both levels 2 and 3, and secondly a uniform prior for the variance parameters at levels 2 and 3 The MCMC procedures were run using a burn-in of 500 iterations after the adaptive method with desired acceptance rate 44% The main run was of length 100,000 for each dataset, whichwas based on preliminary analysis using the Raftery-Lewis diagnostic The MCMC methods took 2 hours each per dataset using MLwiN on a Pentium 200 MHz PC The quasi-likelihood results here vary from those seen in Goldstein and Rasbash (1996) as the tolerance was set at the default level in MLwiN (relative change from one iteration to the next of at most 001) In Goldstein and Rasbash (1996) a more stringent convergence criterion is used (relative change from one iteration to the next of at most 0001) and the results are in this case slightly less biased The other dierence with this table is that I am considering the variance parameters rather than the standard deviations at levels 2 and 3, as this is the variable reported by MLwiN The results of this analysis can be seen in Table 63 Results From Table 63 the improvements achieved by the MCMC methods in terms of bias can be clearly seen Both prior distributions for the variance parameters give results that are less biased than the PQL 2 method The biases in the random parameters are more extreme for the quasi-likelihood methods when the variances are considered instead of the standard deviations There is little to choose between the two MCMC variance priors in this example The gamma 150

168 Table 63: Summary of results (with Monte Carlo standard errors) for the rst 25 datasets of the Rodriguez Goldman example Parameter MQL 1 PQL 2 Gamma Uniform (True) prior prior 0 (065) 0483 (0028) 0615 (0035) 0643 (0037) 0659 (0038) 1 (100) 0758 (0033) 0948 (0043) 0996 (0045) 1017 (0047) 2 (100) 0760 (0013) 0951 (0018) 1004 (0020) 1025 (0021) 3 (100) 0744 (0033) 0952 (0043) 0997 (0044) 1022 (0046) v 2 (100) 0530 (0016) 0837 (0036) 0970 (0054) 1066 (00 57) u 2 (100) 0025 (0008) 0513 (0036) 0928 (0041) 1044 (00 44) prior gives results that are biased on the low side for the variance parameters while the uniform prior gives estimates that are biased high for both variances and xed eects With only 25 datasets the coverage properties of the 4 methods cannot be evaluated and so more datasets are needed 644 Simulating more datasets Although the results from these 25 dataset appear to show improvement in the unbiasedness of estimates produced by the MCMC methods over the quasilikelihood methods, to emphasise this improvement more datasets are needed Also with more datasets the coverage properties of the methods as well as the bias can be evaluated Simulation procedure The MLwiN SIMU, PRED and BRAN commands (Rasbash and Woodhouse 1995) were used to generate 500 datasets with the same underlying structure as the 25 datasets from Rodriguez and Goldman (1995) The simulation procedure is as follows : 1 Generate 161 v k s, one for each community, by drawing from a normal distribution with mean 0 and variance v 2 (10) 2 Generate 1558 u jk s, one for each mother, by drawing from a normal distribution with mean 0 and variance u 2 (10) 3 Evaluate logit(p ijk )= x 1ijk + 2 x 2jk + 3 x 3k + u jk + v k for all 2449 births using the PRED command 151

169 4 Use the BRAN command to generate a 0 or 1 (y ijk ) for each birth based on the relative likelihoodsofa0and1 response and a random uniform draw The same model considered so far was tted to these 500 datasets using the two quasi-likelihood methods and the MCMC method using the two dierent priors As the MCMC methods are time consuming the main run length was reduced for nding estimates for the 500 datasets from 100,000 iterations to 25,000 iterations The adaptive procedure and a `burn-in' of 500 iterations were used as before The results for the 500 simulations can be seen in Table 64 Results Table 64 conrms the bias results already seen when only 25 datasets were considered The MQL method performs badly for the xed eects and hopelessly for the variance parameters The PQL method performs a lot better with smaller xed eect biases but still shows large bias for the variances Both the MCMC priors perform much better than the quasi-likelihood methods and there is little bias The one exception is the variance parameters using the Uniform prior where some positive bias similar to that seen in the Gaussian models in Chapter 4canbe seen The coverage properties are illustrated in Table 64 and Figures 6-2 and 6-3 Here we see again how poor the MQL method is on this example In fact the method is so poor at estimating u 2 that none of the 500 datasets has a 95% interval estimate that contains the true value This is partly due to the large number of runs that have level 2 variance estimates of 0 The PQL method does reasonably well in terms of coverage for the xed eects but not as well as the MCMC methods It also gives variance estimates with very poor coverage properties, particularly at level 2 The MCMC methods both give very similar coverage properties and are both a vast improvement over the quasi-likelihood methods The Uniform prior in general gives slightly better coverage estimates but this is not always true and the Uniform prior also has larger interval widths 152

170 Table 64: Summary of results (with Monte Carlo standard errors) for the Rodriguez Goldman example with 500 generated datasets Parameter MQL 1 PQL 2 Gamma Uniform (True) prior prior Estimates (Monte Carlo SE) 0 (065) 0474 (0007) 0612 (0009) 0638 (0010) 0655 (0010) 1 (100) 0741 (0007) 0945 (0009) 0991 (0010) 1015 (0010) 2 (100) 0753 (0004) 0958 (0005) 1006 (0006) 1031 (0005) 3 (100) 0727 (0009) 0942 (0011) 0982 (0012) 1007 (0013) 2 (100) v 0550 (0004) 0888 (0009) 1023 (0011) 1108 (0011) 2 (100) u 0026 (0002) 0568 (0010) 0964 (0018) 1130 (0016) Coverage Probabilities (90%/95%) 0 676/ / / / / / / / / / / / / / / /936 2 v 06/24 702/ / /922 2 u 00/00 212/ / /930 Average Interval Widths (90%/95%) / / / / / / / / / / / / / / / / v 0339/ / / / u 0149/ / / /

171 % observed points <= posterior predictive percentile MQL PQL MCMC Gamma MCMC Uniform Parameter beta_0 : posterior predictive probability * 100 % observed points <= posterior predictive percentile MQL PQL MCMC Gamma MCMC Uniform Parameter beta_1 : posterior predictive probability * 100 % observed points <= posterior predictive percentile MQL PQL MCMC Gamma MCMC Uniform Parameter beta_2 : posterior predictive probability * 100 Figure 6-2: Plots comparing the actual coverage of the four estimation methods with their nominal coverage for the parameters 0 1 and 2 154

172 % observed points <= posterior predictive percentile MQL PQL MCMC Gamma MCMC Uniform Parameter beta_3 : posterior predictive probability * 100 % observed points <= posterior predictive percentile MQL PQL MCMC Gamma MCMC Uniform Level 3 variance parameter : posterior predictive probability * 100 % observed points <= posterior predictive percentile MQL PQL MCMC Gamma MCMC Uniform Level 2 variance parameter : posterior predictive probability * 100 Figure 6-3: Plots comparing the actual coverage of the four estimation methods with their nominal coverage for the parameters 3 2 v and 2 u 155

173 645 Conclusions Rodriguez and Goldman (1995) originally pointed out the deciencies of the MQL method on multi-level logistic regression models with structures similar to our datasets Goldstein and Rasbash (1996) then showed how the PQL method improved greatly on the MQL method but still showed some bias In this section I have shown that the Metropolis-Gibbs hybrid method described earlier in this thesis gives even better estimates both in terms of bias and coverage It is also clear that the choice of prior distribution for the variance parameters is not as important in this problem due to the large numbers of level 2 and 3 units, and both priors considered give better estimates than the quasi-likelihood methods One point tonoteasshown in the simulations in Breslow and Clayton (1993) is that the under-estimation from the quasi-likelihood methods is worst when there is a Bernoulli response When the model is changed to a binomial response and the denominator increased this under-estimation is reduced I will discuss briey models with a general binomial response in Chapter 8 65 Summary In this chapter the family of hierarchical binary response logistic regression models have been introduced and it has been shown how to adapt the Metropolis- Gibbs hybrid method to t these models Two examples that show yet more applications of multi-level modelling were used to illustrate tting such models The rst example had data on the intention of voters in the 1983 election, and was used to demonstrate the dierences between the estimates from the quasilikelihood methods and the MCMC methods It was also used to nd optimal proposal distributions for the Metropolis steps of the MCMC method The second example used simulation datasets based on a dataset from child health in Guatemala This example was used to compare the performance of the quasi-likelihood methods, MQL and PQL against the Metropolis-Gibbs hybrid method using two prior distributions for the variance parameter, where the true values of all parameters were known In this example it was shown that the MCMC methods outperform the quasi-likelihood methods both in terms of bias and coverage properties It was also shown that with such a large dataset the 156

174 choice of prior distribution for the variance is not of great importance 157

175 Chapter 7 Gaussian Models 3 - Complex Variation at level 1 71 Model denition In Chapters 4 and 5 I considered Gaussian models with a simple level 1 variance 2 In all the algorithms given the Gibbs sampler was used to update all variance parameters Although in Chapter 5 I considered some hybrid methods containing a mix of Gibbs and Metropolis sampling steps, the Metropolis steps updated the xed eects and residuals, and all variance parameters were always updated using the Gibbs sampler In this chapter I will remove the restriction that the model must have a simple constant variance at level 1 and instead allow the level 1 variance to depend on other predictor variables I will consider by way of an example the following simple two level model with complex variation at level 1 y ij =X F ij F +X R ij R j +X C ije ij e ij MVN(0 V C ) j R MVN(0 V R ): If I consider again the JSP dataset, introduced in Chapter 2, with M5, the maths score in year 5, as the response variable, then I could extend the random slopes regression model by allowing the M 3 predictor variable to be random at both levels 1 and 2 What this would then means is that the variability of an 158

176 individual pupil's M5 score is not only dependent on school level variables but also on his/her individual maths score in year 3 The model can be written as follows : e ij = e 0ij e 1ij M5 ij = F 0 + R 0j + e 0ij + M3 ij ( F 1 + R 1j + e 1ij ) 1 0 A MVN(0 VC ) j R R 0j 1j R 1 A MVN(0 VR ): Then Var(M5 ij j F j )=V C00 +2V C01 M3 ij +V C11 (M3 ij ) 2 and in this case, a simple Gibbs sampling approach cannot be used to update the level 1 variance parameters In the random slopes regression example in Chapter 2 there was a single e ij for each observation (pupil) which was evaluated as follows : e ij =y ij ; X F ij F ; X R ij R j : Then the level 1 variance, 2 could be updated via a scaled inverse 2 distribution based solely on the e ij s When the variation at level 1 is complex there is not a single e ij for each observation, and instead X C ije ij =y ij ; X F ij F ; X R ij R j where X C ij is the data vector of variables that are random at level 1 for observation ij, and e ij is the vector of residuals for observation ij From this I would have to estimate each vector e ij via an MCMC updating rule which is dicult Instead I propose to use Metropolis Hastings sampling on the level 1 variance matrix Before discussing the problem of creating a Hastings update for a variance matrix, I will rst consider the simpler case of using MCMC to update a scalar variance 72 Updating methods for a scalar variance In earlier chapters I dealt with multi-level models where the variance at level 1, 2 is a scalar In these chapters the conditional distribution of 2 is easily determined, and a Gibbs sampling step can be used to update 2 Similarly BUGS (Spiegelhalter et al 1994) ts such models using an adaptive rejection Gibbs sampling algorithm I am interested in nding alternative approaches for 159

177 sampling parameters that are restricted to being strictly positive Two alternative approaches are described below : 721 Metropolis algorithm for log 2 Ihave already used the Metropolis algorithm via a normal proposal distribution to update parameters dened on the whole real line A similar approach can be used on the current problem but rstly the variable of interest must be transformed to avariable that is dened on the whole real line Draper and Cheal (1997), when analysing a problem from Copas and Li (1997) use the approach of transforming 2 to log 2 They then use a multivariate normal proposal for all unknowns in the model, including log 2 I will use a univariate proposal for 2 as I am updating it separately from the other unknowns I will therefore consider updating log 2 by using a univariate normal(0 s 2 ) proposal distribution As the parameter of interest has been transformed from 2 to log 2 I need to consider the Jacobian of the transformation and build this into the prior The one disadvantage of this method is that it cannot extend to the harder problem where I have, a variance matrix to update The technique of transforming variance parameters to the log scale is not unique to Draper and Cheal (1997) and is also used in section 116 of Gelman et al (1995) on a hierarchical model 722 Hastings algorithm for 2 In Chapter 3, I used normal proposal distributions for unknown parameters, which had their mean xed at the current value for, t to generate t+1 This type of proposal has the advantage that p( t+1 Metropolis algorithm can be used As I am now considering a parameter, 2 j t ) = p( t j t+1 ), and so the that is restricted to be positive I want to use a proposal that generates strictly positive values I am therefore going to use a scaled inverse chi-squared distribution with expectation the current estimate for 2, t 2 to generate t+1: 2 The scaled inverse chi-squared distribution with parameters and s 2 has expectation ;2 s2, so letting = w + 2 and s 2 = wt 2, where w is a positive integer degrees of freedom parameter, produces a distribution with expectation 2 t The parameter w can be set to any value, 160

178 and plays a similar role to the variance parameter in the Metropolis proposal distribution, as it aects the acceptance probability This proposal is not symmetric, p( t+1 j t ) 6= p( t j t+1 ), so the Hastings ratio has to be calculated Assuming currently t 2 = a and that the value 2 t+1 = b is generated, then the Hastings ratio is as follows : hr = p(2 t+1 = b j 2 t+1 SI 2 (w +2 wa w+2 )) p( 2 t+1 = a j 2 t+1 SI 2 (w +2 wb w+2 )) = w+2 ;( (wb=w +2)w+2 2 a 2 +1) exp(; w+2 (wa=w +2) w+2 w+2 ;( 2 b 2 +1) exp(; w+2 = b w 2 +1 a ;( w 2 +2) exp(; wa) 2b a w 2 +1 b ;( w 2 +2) exp(; wb) 2a = ( b a )w+3 exp( w 2 (a b ; b a )): 2 2 wb ) (w+2)a wa ) (w+2)b This Hastings ratio can then be used in the Hastings algorithm 723 Example : Normal observations with an unknown variance The three methods discussed in this section will now be illustrated by the following example I generate 100 observations from a normal distribution with known mean 0 and known variance 4 Then I will assume the global mean is known to be zero and interest lies in estimating the variance parameter 2 Assigning a scaled inverse chi-squared prior for 2, the model is then as follows : Y i N(0 2 ) i =1 ::: SI 2 ( 0 0) 2 This problem has a conjugate prior and consequently the posterior has a known distribution as follows : 2 SI 2 ( 0 + n ( nv)=( 0 + n)) 161

179 P where n is the number of Y 0 s (in this case 100) and V = 1 ni=1 (y n i ; ) 2 For the prior distribution the values 0 =3and 2 0 =6will be used and the sample variance of the 100 simulated observations, V = This leads to the posterior distribution : 2 SI 2 (103 4:43): This distribution has posterior mean for 2 of with standard deviation The Gibbs sampling method used by BUGS will sample IID from the posterior distribution for 2, and so should perform best The Gibbs method can sample 2 IID from its posterior distribution as it is the only unknown in this model The Metropolis log 2 method and the Hastings method should get reasonable answers but may take longer to obtain accuracy 724 Results In the following analysis I will compare the three methods described earlier For the Gibbs sampler method a burn-in of 1,000 updates and a main run of 5,000 updates was used for each of three random number seeds and the results were averaged Several dierent s standard deviation values in the Metropolis method and several dierent w degrees of freedom values in the Hastings method were used for comparative purposes For each s or w, three runs were performed with a burn-in of 1,000 and main run of 100,000 and dierent random number seeds The results obtained were averaged over the three runs The Raftery Lewis convergence diagnostic column of Table 71 contains the largest value of ^N, the estimated run length required, between the three runs As can be seen from the table the largest ^N is 83,800 which is smaller than the 100,000 run lengths and so this suggests all the runs have achieved their default accuracy goals Table 71 shows that the Gibbs method has the predicted fast convergence rate All three methods give the correct answer to two decimal places The convergence rates of the Metropolis and Hastings methods vary with the parameter value of the proposal distribution This is also linked with the acceptance rates of the two algorithms The best proposal parameter values are when there is an acceptance rate of approximately 44%, although in this simple 162

180 Table 71: Comparison between three MCMC methods for a univariate normal model with unknown variance Method s/w Mean Sd Acc % R/L ^N Theory N/A N/A N/A Gibbs N/A , , ,000 Metropolis , , , , , , , ,100 Hastings , , , , ,200 example acceptance rates of between 30% and 60% give similar convergence rates There is therefore scope to incorporate the adaptive procedure illustrated in Chapter 5 to these methods I will now return to the original problem of updating a variance matrix 73 Updating methods for a variance matrix In the models in Chapter 5, the level 2 variance can be a matrix In the algorithms given, a simple Gibbs sampling step is used to update the variance as it has an inverse Wishart posterior distribution with parameters that are easily evaluated If there is complex variation at level 1, then the level 1 variance, is a matrix and this will also have an inverse Wishart posterior distribution Unfortunately the parameters will depend on the vector of level 1 residuals, e ij which are not easily evaluated Consequently an alternative method is needed that does not need to evaluate the level 1 residuals When I considered the case of a scalar variance, 163

181 I used a Hastings step that had a proposal distribution of a similar form to the posterior distribution, and I will consider a similar approach here 731 Hastings algorithm with an inverse Wishart proposal When there was a scalar variance, 2 to update, a proposal distribution that generated strictly positive values was required Now that there is a variance matrix, to update, I require a proposal distribution that generates positive denite matrices I am therefore going to use an inverse Wishart proposal distribution with expectation the current estimate for, t to generate t+1 The inverse Wishart distribution for a kxk matrix with parameters and S has expectation ( ; k ; 1) ;1 S, so letting = w + k +1 and S = w t, where w is a positive integer degrees of freedom parameter, produces a distribution with expectation t As in the univariate case, the parameter w is set to a value that gives the desired acceptance rate Again the proposal is not symmetric so the Hastings ratio must be calculated Assuming currently that t = A and that t+1 =Bisgenerated, then the Hastings ratio is as follows : hr = p( t+1 =Bj t+1 IW(w + k +1 Aw)) p( t+1 =Aj t+1 IW(w + k +1 Bw)) = j Aw j w+k+1 2 j B j j Bw j w+k+1 2 j A j = j A j 2w+3k+3 2 j B j 2w+3k+3 2 w+k+1+k+1 ;( 2 w+k+1+k+1 ;( 2 ) exp(; 1 2 tr(awb;1 )) ) exp(; 1 2 tr(bwa;1 )) exp( w 2 (tr(ba;1 ) ; tr(ab ;1 )) 732 Example : Bivariate normal observations with an unknown variance matrix A second simple example will now be used to show that this method works I will compare the results obtained with the theoretical answers and the Gibbs sampling results One hundred observations from a bivariate normal distribution 0 with 1 known mean vector, =(4 2) T and known variance matrix 20 {02 A {

182 were generated I then assume that the mean vector, is known and interest lies in estimating the variance matrix, An inverse Wishart prior distribution for was assigned, and the model is then as follows : Y i MV A A p() IW( 0 0 ): The inverse Wishart prior is conjugate for and consequently the posterior for has the following distribution : p( j Y ) IW( 0 + n ( 0 + nv)) where n is the number of Y i s, in this case 100 and V= 1 n P ni=1 (Y i ; )(Y i ; ) T In this example the values 0 =3and 0 = prior distribution Then the posterior for is p( j Y ) IW {05 { {3627 { A will be used for the 11 AA 2188 {0363 which has a posterior mean matrix { I will again compare the Gibbs sampling method used by BUGS which samples directly from the posterior distribution for and should perform well with the Hastings method which should take longer to converge 1 A : 733 Results For the Gibbs sampler method, a burn-in of 1,000 updates and a main run of 5000 updates was used for each of 3 random number seeds and the results were averaged For the Hastings method several dierent degrees of freedom values, w were used For each w three runs were performed with dierent random number seeds with a burn-in of 1,000 and a main run of 100,000 The Raftery and Lewis convergence diagnostic is the maximum ^N of the three variables monitored over the three runs The results can be seen in Table

183 Table 72: Comparison between two MCMC methods for a bivariate normal model with unknown variance matrix Method w Acc% R/L Theory N/A 2188 { N/A N/A Gibbs N/A 2193 (0315) {0365 (0167) 1178 (0168) k (0315) {0362 (0165) 1181 (0167) k (0313) {0364 (0166) 1180 (0169) k Hastings (0309) {0361 (0165) 1180 (0168) k (0312) {0364 (0167) 1181 (0169) k (0312) {0365 (0167) 1181 (0170) k (0312) {0365 (0168) 1186 (0169) k From Table 72 it can be seen that the choice of 100,000 as main run length for the Hastings method satises the Raftery Lewis convergence diagnostic for all selected values of w It can also be seen that the default accuracy goals are achieved quickest when the acceptance rate is between 30% and 35% This is dierent from the univariate case but this is due to the fact that the new procedure involves updating the whole variance matrix and not just a single parameter In fact Gelman, Roberts, and Gilks (1995) suggest a rate of 31:6% for a 3 dimensional normal update which compares favourably with this analysis It can be clearly seen that the method is estimating the variance matrix correctly Inow need to incorporate this method into the algorithm for the models that are the theme of this chapter, multi-level models with complex variation at level 1 74 Applying inverse Wishart updates to complex variation at level 1 In Chapter 5 I showed how the Gibbs algorithm for a Gaussian 3 level model is easily generalised to N levels In this section I will simply describe how to use the inverse Wishart updating step for the level 1 variance with a 2 level model Extending this algorithm to N levels should be analogous to the work in Chapter 5 The model to be considered is as follows : y ij =X F ij F +X R ij R j +X C ije ij 166

184 e ij MVN(0 V C ) R j MVN(0 V R ): The important part of the algorithm is to store the variance for each individual, ij 2 =(X C ij) T V C X C ij: All the other parameters then depend on the level 1variance, V C through these individual variances This means that the algorithm below is, apart from the updating step for V C almost identical to the algorithm for the same model without complex variation The main dierence is that everywhere that 2 appeared in the old algorithm, it is replaced by ij 2 and this often involves moving the ij 2 inside summations 741 MCMC algorithm I will assume that the following general priors are used, for the level 1 variance, p(v C ) IW( 1 S 1 ), for the level 2 variance, p(v R ) IW( 2 S 2 ) and for the xed eects, F N( p S p ) The algorithm then has four steps as follows : Step 1 - The xed eects, F p( F j y R V C V R ) / p(y j F R V C V R )p( F ) where F MV N( b F b D F ) bd F =[ X ij (X F ij) T X F ij 2 ij + S p ;1 ] ;1 and b F = b D F 2 4 X ij (X F ij) T (y ij ; X R ij R j ) 2 ij 3 + S p ;1 p 5 : 167

185 Step 2 - The level 2 residuals, R p( R j y F V C V R ) / p(y j F R V C V R )p( R jv R ) where R j MVN( b R j b D R j ) bd R j =[ n X j i=1 (X R ij) T X R ij 2 ij +V ;1 R ] ;1 and b R j = b D R j n X j i=1 Step 3 - The level 1 variance, V C (X R ij) T (y ij ; X F ij F ) : ij 2 This step now involves a Hastings update using an inverse Wishart proposal distribution V (t) C = V C with probability min(1 hr p(v C j y :::)=p(v (t;1) C j y :::)) = V (t;1) C otherwise where V C IW n1 (S = wv (t) C = w+n 1 +1) wbeing a tuning parameter which will aect the acceptance rate of the Hastings proposals and n 1 the number of random parameters at level 1 The Hastings ratio, hr is as follows hr = j V(t;1) C j V C j where = 2w+3n j exp(w 2 (tr(v C(V (t;1) C Step 4 - The level 2 variance, V R ) ;1 ) ; tr(v (t;1) C (V C) ;1 ))) p(v ;1 R j y F V C ) / p( R j V R )p(v ;1 R ) 168

186 V ;1 R Wishart n2 [S pos =( JX j=1 R j ( R j ) T + S P ) ;1 pos = J + P ] where n 2 is the number of random variables at level 2 If a uniform prior is required then set S P = 0 and P = ;n 2 ; Example 1 The example to be used to illustrate the algorithm was described at the start of the chapter and consists of the JSP dataset with an extension to the random slopes regression model so that the predictor variable M3 is random at level 1 The model is as follows : e ij = e 0ij e 1ij M5 ij = F 0 + R 0j + e 0ij + M3 ij ( F 1 + R 1j + e 1ij) 1 0 A MVN(0 VC ) j R R 0j 1j R 1 A MVN(0 VR ): The predictor variable M3 has been centred around its mean I will not use the actual response variable M5 as this produces estimates via IGLS and RIGLS that do not lead to a positive denite matrix V C This problem will be discussed in the next section where an alternative method is discussed Instead I will use the MLn SIMU command (Rasbash and Woodhouse 1995) to create a simulated dataset with a positive denite matrix V C The results from one simulated dataset are given in Table 73 The results for the MCMC methods are based on the average of three runs each of length 50,000 after a burn-in of 500 iterations The value of w = 150 was selected as this gives an acceptance rate of approximately 32%, and the rate suggested in Gelman, Roberts, and Gilks (1995) for a 3 dimensional normal update is 31:6% The last column contains results for method 2 which is discussed later in the chapter From Table 73 it should be noted that as the results are for only one dataset generated using the values in the `True' column, the estimates should not be identical to these true values What is clear is that the MCMC methods are 169

187 Table 73: Comparison between IGLS/RIGLS and MCMC method on a simulated dataset with the layout of the JSP dataset Parameter IGLS RIGLS MCMC IW MCMC (True) w = 150 Method 2 0 (30:0) (0487) (0492) (0527) (0526) 1 (0:5) 0537 (0080) 0537 (0081) 0536 (0088) 0538 (0088) V R00 (6:0) 8417 (2294) 8642 (2344) (2867) (2866) V R01 (;0:25) {0546 (0282) {0560 (0288) {0662 (0361) {0658 (0359) V R11 (0:1) 0176 (0062) 0183 (0064) 0231 (0084) 0230 (0084) V C00 (28:0) (2187) (2188) (2279) (2291) V C01 (;0:5) {0654 (0261) {0656 (0261) {0675 (0270) {0677 (0271) V C11 (0:5) 0571 (0091) 0571 (0091) 0589 (0099) 0589 (0100) exhibiting behaviour that could be predicted from the results in Chapter 4 The MCMC methods are using uniform priors for both variance matrices and this is giving larger variance estimates The discrepancy is also larger at level 2 where the estimates are based on 48 schools than at level 1 where the estimates are based on 887 pupils as expected The MCMC methods have wider uncertainty bands than those from IGLS and RIGLS which is also to be expected 743 Conclusions From this one example it can be concluded that provided the dataset used gives level 1 variance estimates in IGLS/RIGLS that produce a positive denite matrix then the above method can be used to t the model and give MCMC estimates One important point to note is that in the earlier examples with a single residual at level 1 for each individual, e ij these residuals could be estimated by subtraction at each iteration This would then lead to MCMC estimates for each e ij Following this through to the complex variation models, the subtraction approach will give estimates for the composite residual, X C ije ij but not for the individual e ij The best that is available is similar to the IGLS/RIGLS methods These methods calculate the residuals based on the nal estimates of the other parameters and could simply be applied while using the MCMC estimates for the other parameters The one point I touched on earlier is that the IGLS/RIGLS method for the 170

188 original dataset produces estimates that do not form a positive denite variance matrix, V C Although the method I have just given is a useful way of using the Metropolis-Hastings sampler, I will in the next section show an alternative method that can handle non-positive denite variance matrices 75 Method 2 : Using truncated normal Hastings update steps The inverse Wishart updating method assumes that the variance `matrix' at level 1 must be positive denite Although the way the model has been written thus far suggests that the variance at level 1 should be a matrix, an alternative form would be to consider the variance at level 1 as a quadratic form For example using the JSP example considered already, Var(M5 ij j F R j )=A +2BM3 ij + C(M3 ij ) 2 : Using the constraint that the A B stronger constraint than is actually needed 0 B 1 A is positive denite is a C Positive denite matrices will guarantee that any vector Xij C will produce a positive variance, but in the JSP example the rst random variable is constant and the second variable, M3 takes integer values, before centering, between 0 and 40 So a looser constraint is to allow all values (A B C )such that A +2B M3 ij + C (M3 ij ) 2 > 0 8 i j: This constraint looks quite complicated to work with but if I consider each of the variables, A B and C separately and assume the other variables are xed the constraints become easier I will now consider the steps required for our simple example before generalising the algorithm to all problems with complex variation at level Update steps at level 1 for JSP example At iteration t, assume that the current values for the parameters are A (t) B (t) and C (t), and let ij 2 = Var(M5 ij j F j R ) Then I will update the three parameters in turn 171

189 Updating parameter A At time t, 2 ij = A (t) +2B (t) M3 ij + C (t) (M3 ij ) 2 > 0 8 i j: So let 2B (t) M3 ij + C (t) (M3 ij ) 2 = ;r A ij then A (t) >r A ij 8 i j: This implies A (t) > max A where max A = max(r A ij): I will use a normal proposal distribution with variance, v A but only consider values generated that satisfy the constraint This will lead to a truncated normal proposal as shown in Figure 7-1 (i) The Hastings ratio can then be calculated by the ratio of the two truncated normal distributions shown in Figure 7-1 (i) and (ii) Letting the value for A at time t be A c and the proposed value for time t +1be A The update step is now as follows : hr = p(a(t+1) = A j A (t) = A c ) p(a (t+1) = A c j A (t) = A ) = 1 ; ((max A ; A )= p v A ) 1 ; ((max A ; A c )= p v A ) : A (t+1) = A with probability min(1 hr p(a j y :::)=p(a (t) j y :::)) = A (t) otherwise: Updating parameter B At time t, 2 ij = A (t) +2B (t) M3 ij + C (t) (M3 ij ) 2 > 0 8 i j: 172

190 So let A (t) + C (t) (M3 ij ) 2 = ;r B ij then B (t) > rij=(2 B M3 ij ) 8 M3 ij > 0 and B (t) < rij=(2 B M3 ij ) 8 M3 ij < 0: This leads to two constraints : B (t) >max B + where max B + = max(rij=(2 B M3 ij )) M3 ij > 0) and B (t) < min B ; where min B ; = min(rij=(2 B M3 ij )) M3 ij < 0): I will use a normal proposal distribution, with variance v B, but only consider values generated that satisfy these constraints This will lead to a truncated normal proposal as shown in Figure 7-1 (iii) The Hastings ratio can then be calculated by the ratio of the two truncated normal distributions shown in Figure 7-1 (iii) and (iv) Letting the value for B at time t be B c and the proposed value for time t +1beB hr = p(b(t+1) = B j B (t) = B c ) p(b (t+1) = B c j B (t) = B ) = ((min B ; ; B )= p v B ) ; ((max B + ; B )= p v B ) ((min B ; ; B c )= p v B ) ; ((max B + ; B c )= p v B ) : The update step is now as follows : B (t+1) = B with probability min(1 hr p(b j y :::)=p(b (t) j y :::)) = B (t) otherwise: Updating parameter C At time t, 2 ij = A (t) +2B (t) M3 ij + C (t) (M3 ij ) 2 > 0 8 i j: 173

191 So let A (t) +2B (t) M3 ij = ;r C ij then C (t) >r A ij=(m3 ij ) 2 8 i j: This implies C (t) >max C where max C = max(r C ij=(m3 ij ) 2 ): I will use a normal proposal distribution but only consider values generated that satisfy the constraint This will lead to a truncated normal proposal as shown in Figure 7-1 (i) The Hastings ratio can then be calculated by the ratio of the two truncated normal distributions shown in Figure 7-1 (i) and (ii) Letting the value for C at time t be C c and the proposed value for time t +1be C The update step is now as follows : hr = p(c (t+1) = C j C (t) = C c ) p(c (t+1) = C c j C (t) = C ) = 1 ; ((max C ; C c )= p v C ) 1 ; ((max C ; C c )= p v C ) : C (t+1) = C with probability min(1 hr p(c j y :::)=p(c (t) j y :::)) = C (t) otherwise: The results of using this second method on the simulated dataset example of the last section can be seen in the last column of Table 73 Here it can be seen that although the two methods are based on dierent constraints, this example where the correct answer has a positive denite framework means that both methods give similar estimates Before generalising the algorithm for all covariance structures at level 1, I will now return to the original data from the JSP dataset This will show that this second method has the advantage of tting models with a non-positive denite variance structure at level 1 174

192 M AB M AB (i) (ii) M AB m M AB m (iii) (iv) Figure 7-1: Plots of truncated univariate normal proposal distributions for a parameter, A is the current value, c and B is the proposed new value, M is max and m is min, the truncation points The distributions in (i) and (iii) have mean c, while the distributions in (ii) and (iv) have mean 175

193 752 Proposal distributions I have not as yet mentioned the proposal distributions used in this method in much detail I simply stated that I was using a truncated normal distribution with a particular variance for the untruncated version The problem of choosing a value for the variance parameter is the same problem that I had when considering updating the xed eects and higher level residuals by Metropolis Hastings updates Here like there, two possible solutions are to use the variance of the parameter estimate from the RIGLS procedure multiplied by a suitable scaling to give a variance for the normal proposal distribution, or to use an adaptive approach before the burn-in and main run of the simulation In example 1, I used a scaling of 58 on the variance scale which gave acceptance rates of between 35% and 50% 753 Example 2 : Non-positive denite and incomplete variance matrices at level 1 The original JSP dataset gave a non-positive denite matrix as an estimate for the level 1 variance The term V e11 is very small and is not statistically signicant, so it could be removed from the model This will mean that in practical terms the variance of individual students' M 5 scores is dependent on their M 3 score in a linear way, rather than a quadratic This will then lead to an incomplete variance matrix at level 1 This sort of variance structure at level 1 can be useful in other situations, for example if the data consists of boys and girls and it is believed that there is a dierence in variabilitybetween boys and girls Then the variance equation could be as follows : Var(Y ij j F R )=V e00 + V e01 Boy ij : where Boy ij has value 1 if the child is a boy and 0 if the child is a girl, and V e01 can be negative Here including the quadratic term will not make sense as (Boy ij ) 2 = Boy ij 8 i j so an incomplete variance structure should be used Table 74 contains the estimates of the RIGLS and second MCMC methods for the complex variation model tted to the original JSP data (Model 2) The 176

194 table also contains estimates when rstly V e11 is removed (Model 3), and secondly V e01 is removed (Model 4) The MCMC algorithm described earlier can easily accommodate the removal of terms from the variance matrix The term is simply set to zero before running the simulation method and during the method the term is never updated Table 74: Comparison between RIGLS and MCMC method 2 on three models with complex variation tted to the JSP dataset Parameter Model 2 Model 3 Model (0356) (0356) (0367) (0033) 0617 (0033) 0612 (0042) V R (1230) 4271 (1230) 4638 (1306) RIGLS V R01 {0246 (0098) {0247 (0098) {0340 (0117) V R (0010) 0017 (0010) 0028 (0017) V C (1465) (1419) (1651) V C01 {1206 (0137) {1203 (0079) 0000 (0000) V C (0027) 0000 (0000) 0063 (0039) (0387) (0385) (0398) (0039) 0617 (0039) 0614 (0048) V R (1647) 5330 (1651) 5603 (1709) MCMC V R01 {0319 (0143) {0327 (0142) {0394 (0160) V R (0016) 0030 (0016) 0048 (0023) V C (1528) (1409) (1545) V C01 {1221 (0136) {1178 (0079) 0000 (0000) V C (0028) 0000 (0000) 0067 (0033) Scale Accept Acc(V C00 ) Acc(V C01 ) Acc(V C11 ) RL(V C00 ) 51K 516K 227K Diag RL(V C01 ) 119K 1001K RL(V C11 ) 134K 233K The MCMC estimates are based on the average of 3 runs each of length 50,000 after a burn-in of 500 Each run takes approximately an hour on a Pentium 200MHz PC The Raftery Lewis estimates are larger than the 50,000 iterations performed on each run, however the estimates in the table are based on the average of 3 runs, which gives a potential 150,000 iterations per estimate The 177

195 Raftery Lewis estimates may also be inated as they are based on a thinned chain of length 10,000 (every 5th iteration of the main chain) due to storage restrictions The scaling of the proposal distribution variances was chosen based on some shorter runs, so that acceptance rates were close to 44% From Table 74 it can be seen that the MCMC method can handle rstly non-positive denite level 1 variance matrices, and secondly incomplete level 1 variance matrices The MCMC estimates exhibit behaviour similar to that seen in example 1 The level 2 MCMC variance estimates are larger than those from RIGLS but at level 1 there is less dierence in the estimates from the two methods In fact the level 1 variance term, V C00 has smaller estimates using the MCMC method Prior distributions In the above example, uniform priors were used for all the variance parameters As an alternative, informativeinverse Wishart priors could be used for the level 1 variance matrix, if the matrix is complete If the matrix is incomplete, an alternative prior to the uniform is not immediately obvious The problem of nding alternative priors in this case is outside the scope of this thesis A possible solution would be to use univariate normal prior distributions for each parameter as the likelihood would ensure that the posterior gave acceptable values for the positivity constraints 754 General algorithm for truncated normal proposal method The algorithm for a general Gaussian multi-level model using the updating method of truncated normal proposals at level1follows directly from the example in the last section As both this method and the earlier method calculate the ijs 2 the update steps are all the same except for the update step for the level 1 variance I will assume the model is as for the rst method, y ij =X F ij F +X R ij R j +X C ije ij e ij MVN(0 V C ) R j MVN(0 V R ): 178

196 Step 3 - The level 1 variance, V C The variance matrix V C is considered term by term and one of the following two steps is followed depending on whether the term lies on the diagonal Updating diagonal terms, V Cnn At time t, where 2 ij = (X C ij) T V (t) C X C ij > 0 = (X C ij(n)) 2 V (t) Cnn ; r C ij(nn) > 0 8 i j So r C ij(nn) =(X C ij(n)) 2 V (t) Cnn ; (X C ij) T V (t) C X C ij: V (t) Cnn >max Cnn where max Cnn = max(r C ij(nn)=(x C ij(n)) 2 ): I will use a normal proposal distribution with variance, v nn but only consider values generated that satisfy the constraint This will lead to a truncated normal proposal as shown in Figure 7-1 (i) The Hastings ratio can then be calculated by the ratio of the two truncated normal distributions shown in Figure 7-1 (i) and (ii) Letting the value for V Cnn at time t be A and the proposed value for time t +1beB, The update step is now as follows : hr = p(v(t+1) Cnn = B j V (t) Cnn = A) p(v (t+1) Cnn = A j V (t) Cnn = B) = 1 ; ((max Cnn ; B)= p v nn ) 1 ; ((max Cnn ; A)= p v nn ) : V (t+1) Cnn = V Cnn with probability min(1 hr p(v Cnnjy :::) ) = V (t) Cnn otherwise: p(v (t) Cnnjy :::) 179

197 Updating non diagonal terms, V Cmn At time t, where 2 ij = (X C ij) T V (t) C X C ij > 0 = 2 X C ij(m)x C ij(n)v (t) Cmn ; r C ij(mn) > 0 8 i j So r C ij(mn) =2 X C ij(m)x C ij(n)v (t) Cmn ; (X C ij) T V (t) C X C ij: where V (t) Cmn >max Cmn + max Cmn + = max(r C ij(mn)=(2 X C ij(m)x C ij(n)) X C ij(m)x C ij(n) > 0) and where V (t) Cmn <min Cmn ; min Cmn ; = min(r C ij(mn)=(2 X C ij(m)x C ij(n)) X C ij(m)x C ij(n) < 0): I will use a normal proposal distribution with variance, v mn but only consider values generated that satisfy the constraint This will lead to a truncated normal proposal as shown in Figure 7-1 (iii) The Hastings ratio can then be calculated by the ratio of the two truncated normal distributions shown in Figure 7-1 (iii) and (iv) Letting the value for V Cmn at time t be A and the proposed value for time t +1beB, hr = p(v(t+1) Cmn = B j V (t) Cmn = A) p(v (t+1) Cmn = A j V (t) Cmn = B) = ((min Cmn ; ; B)=p v nn ) ; ((max Cmn + ; B)= p v nn ) ((min Cmn ; ; A)= p v nn ) ; ((max Cmn + ; A)= p v nn ) : The update step is now as follows : 180

198 V (t+1) Cmn = V Cmn with probability min(1 hr p(v Cmnjy :::) ) = V (t) Cmn 76 Summary otherwise: p(v (t) Cmnjy :::) In this chapter two MCMC algorithms have been given for the solution of Gaussian multi-level models with complex variation at level 1 The rst method uses an inverse Wishart proposal distribution for the level 1 variance matrix, and can only be used when the variance matrix at level 1 is strictly positive denite The second method uses truncated normal proposals for the individual variance terms It is not constrained to only positive denite matrices and can even cope with incomplete variance matrices Both these methods were illustrated through some simple examples and gave sensible estimates for the problems given The one problem that both methods fail to solve is that they do not give estimates for individual level 1 residuals This along with other models that are not covered in this thesis will be discussed in the next chapter, along with potential solutions 181

199 Chapter 8 Conclusions and Further Work 81 Conclusions The general aim of this thesis is to combine the two areas of multi-level modelling and Markov chain Monte Carlo (MCMC) methods by tting multi-level models using MCMC methods This task was split into three parts Firstly the types of problems that are tted in multi-level modelling were identied and the existing maximum likelihood methods were investigated Secondly MCMC algorithms for these models were derived and nally these methods were compared to the maximum likelihood based methods both in terms of estimate bias and coverage properties Two simple 2 level Gaussian models were rstly considered and it was shown how to t these models using the Gibbs sampler method Then extensive simulation studies were carried out to compare dierent prior distributions for the variance parameters in these models using the Gibbs sampler method with the two maximum likelihood methods IGLS and RIGLS on these two models The results showed that in terms of bias the RIGLS method was less biased than the MCMC methods In terms of coverage properties the MCMC methods do better than the maximum likelihood methods in many situations although not always The conclusions are summarised in more detail in Section 45 The Gibbs sampler algorithms given for the two simple multi-level models were then generalised to t the family of N level Gaussian models Two alternativehybrid Metropolis-Gibbs methods were also given along with adaptive samplers and all these methods were compared with each other 182

200 It was found that the Gibbs sampler method was better than the hybrid methods for the Gaussian models in terms of number of iterations required to achieve a desired accuracy Of the hybrid methods there was little to choose between the univariate proposal distribution method and the multivariate proposal distribution method It was also found that the adaptive samplers were a better method of achieving desired accuracies in a minimum number of iterations than the methods using proposal distributions based on scaled variance estimates The univariate Normal proposal distribution Metropolis method was adapted to t binary response multi-level models This method was then compared with the quasi-likelihood methods via a simulation study on one binary response model from Rodriguez and Goldman (1995) where the quasi-likelihood methods perform particularly badly For this model it was shown that the Metropolis-Gibbs hybrid method performs much better both in terms of bias and coverage properties It was also shown that for this model, the choice of prior distribution for the variance parameters is less important Finally Gaussian models with complex variation at level 1 were considered and two MCMC methods with Hastings updates at level 1 were given The rst method was based on an inverse Wishart proposal distribution for the level 1 variance matrix but could only be used when the level 1 variance parameters formed a complete positive denite matrix The second method was based on truncated normal proposals for the individual variance components and could be used on any variance structure at level 1 Both these methods were tested on some simple examples As can be seen many dierent multi-level models have been considered in this thesis but these are just a selection from a far larger eld The further work section following these conclusions will outline some other models that I did not have time to consider Before mentioning this future work I will mention briey the implementation of MCMC methods in the MLwiN package (Goldstein et al 1998) 811 MCMC options in the MLwiN package A by-product of this thesis has been the programming of MCMC methods into the MLwiN package (Goldstein et al 1998) The rst release of MLwiN was in 183

201 February 1998 and over the rst 6 months the package has been used by hundreds of users from the social science community and elsewhere, some familiar with its forerunner, MLn (Rasbash and Woodhouse 1995) and some new users Although many users who are familiar with the IGLS and RIGLS methods may not use the MCMC methods, it is hoped that by implementing MCMC in MLwiN, we will be exposing more people to MCMC methods Early feedback is encouraging, and two workshops have been set up to teach these new users more about the MCMC methods in MLwiN and Bayesian statistics in general I have also received several questions about the MCMC options in MLwiN that show that a new MCMC user community hasemerged Apart from introducing MCMC methods to a new community of users, the MLwiN package has been invaluable to me, for performing the simulation studies found in this thesis On this note I must also acknowledge the BUGS package (Spiegelhalter et al 1994) which was used to compare results in the original programming of the MCMC options in MLwiN and to perform some of the simulations in Chapter 4 82 Further work In this section I intend to include a brief introduction to other models that have not been considered in this thesis but which can be tted in the MLwiN package using maximum likelihood methods Some of these models have notbeen considered due to time constraints although tting them using MCMC is not dicult Other models are more dicult or have simply not been considered yet but are included for completeness Goldstein (1995) contains more information on tting all the following models using maximum likelihood methods 821 Binomial responses Binomial response models are used to t datasets where the response variable is a proportion The binomial distribution has two parameters, p the probability of success, and n the number of trials, known as the denominator in Goldstein (1995) The binary response models considered in Chapter 6 are a special case of the binomial response models where the denominator is 1 for all observations 184

202 The logit link function is used to t binomial response models as shown for the binary response models in Chapter 6 In fact a multi-level binomial response model can be converted to a multi-level binary response model by including an extra level in the model for the individual trials A3level binomial response logistic regression model can be dened as follows : y ijk = Binomial(n ijk p ijk ) where logit(p ijk )=X 1ijk 1 +X 2ijk 2jk +X 3ijk 3k 2jk MV N(0 V 2 ) 3k MVN(0 V 3 ): The Metropolis Gibbs hybrid method can be used to t Binomial response models, and this can be easily done via minor alterations to the algorithm for the binary response model in Chapter 6 This has not yet been implemented in MLwiN but the only main requirements are to store the denominator n i for each observation i, and then modify two conditional distributions The conditional distribution for the xed eects should now be : Y p( 1i j y :::) / p( 1 ) (1 + e ;(X)i ) ;y i (1 + e (X)i ) y i;n i : i2m T and the conditional distribution for the level l residuals should now be : Y p( lji j y :::) / (1 + e ;(X)i ) ;y i (1 + e (X)i ) y i;n i jv l j ; exp[; i2m l j Extra binomial variation 2 T ljv ;1 l lj ]: As with the binary response models, the binomial response models do not have a parameter for the level 1 variance This is because if the mean and the denominator of a binomial distribution are known then the variance can be calculated, (var(y i )=n i p i (1 ; p i )) The multi-level binomial response model could also be written as follows : y i = p i + e i z i z i = q p i (1 ; p i )=n i 185

203 where logit(p i )=(X) i lj MVN(0 V l ): If the model is assumed to have binomial variation then e 2 is constrained to 1 The assumption of binomial variation need not hold and the quasilikelihood methods in MLwiN will allow the constraint to be dropped When the assumption of binomial variation is dropped the variation is known as extra binomial variation Work is needed to t models with extra binomial variation using the MCMC methods 822 Multinomial models The binomial model is used for proportional data where the response variable is a collection of observations that have two possible states (0 and 1) This is a special case of the multinomial model which is used when the response variable is a collection of observations with S possible states The quasi-likelihood methods in MLwiN can t multinomial models but as yet I have not considered how tot these models using MCMC This work would probably follow easily after tting the multivariate Normal response models mentioned later in this chapter 823 Poisson responses for count data The other sort of discrete data that needs to be considered is count data As count data are restricted to being positive integer values, the Poisson distribution is usually used for the response variable along with the log link function A multi-level Poisson model with a log link can therefore be written : y i P oisson( i ) where log( i )=(X) i lj MVN(0 V l ): The Poisson model is related to the binomial in that it has a variance that is xed if the mean is known The algorithm for the binary response model can also 186

204 be modied to t Poisson models by modifying the conditional distributions For the Poisson model the conditional distribution for the xed eects should now be : Y p( 1i j y :::) / p( 1 ) e ;e(x)i (e (X)i ) y i : i2m T and the conditional distribution for the level l residuals should now be : Y p( lji j y :::) / e ;e(x)i (e (X)i ) y i jv l j ; exp[; i2m l j Extra Poisson variation 2 T ljv ;1 l lj ]: An alternative form for the multi-level Poisson response model is as follows : where y i = i + e i z i z i = q ( i ) log( i )=(X) i lj MVN(0 V l ): If the model is assumed to have Poisson variation then 2 e is constrained to 1 The assumption of Poisson variation need not hold and the quasi-likelihood methods in MLwiN will allow the constraint to be dropped When the assumption of Poisson variation is dropped the excess heterogeneity isknown as extra Poisson variation More work is needed to t models with extra Poisson variation using the MCMC methods 824 Extensions to complex variation at level 1 In Chapter 7 I introduced two methods based on Hastings updating steps that will t Gaussian models with complex variation The Hastings update step for the level 1 variance parameters was only added to the general Gibbs sampling algorithm from Chapter 5 and uniform priors were used for these parameters More work needs to be done on incorporating the Hastings step into the hybrid Metropolis Gibbs methods in Chapter 5 and to allow users to incorporate 187

205 informative prior distributions for these models In particular informative prior distributions for incomplete variance matrices at level 1 will need some more thought Simulating level 1 residuals At present the Hastings algorithm works with the composite level 1 residual, X C ije ij to generate simulated values for the level 1 variance parameters This has the disadvantage that the r individual level 1 residuals for observation ij e rij cannot then be estimated by the MCMC method However the maximum likelihood methods have a procedure for generating residuals which can be used with the MCMC estimates for the xed eects and the variance parameters This procedure can be used for now and more work on nding MCMC methods that simulate the individual residuals can be performed Incomplete variance matrices at higher levels The MLwiN package allows the variance structure at any level to be an incomplete matrix This can be useful if the random parameters at a higher level are thought to be uncorrelated, as the covariance term can then be set to zero in the model The quasi-likelihood methods have no problem tting these models but these models are outside the general Gaussian framework tted by the MCMC algorithms in Chapter 5 The second Hastings update method given in Chapter 7 for the problem of complex variation at level 1 dealt with incomplete variance matrices at level 1 and this method could probably be used also at higher levels The one disadvantage is that the higher level residuals will then not be estimated but instead the composite higher level residuals will be used as for complex variation at level Multivariate response models All the models studied in this thesis have a single response variable but many problems exist where there are many response variables A simple approach would be to t each response variable separately with its own multi-level model There is often, however correlation between the response variables and this correlation 188

206 needs to be modelled To model many responses together multivariate response models need to be tted The MLwiN package uses aclever approach to tting such models An extra level is added to the model and the various responses for each observation are considered dierent observations at this bottom level This in eect reduces the multivariate response model to a special case of the univariate response variable with no variability at level 1 Then through the use of indicator variables the multivariate response model can be tted For examples of multivariate response models and more details on how they are tted using maximum likelihood methods in MLwiN see Chapter 4 of Goldstein (1995) Work is required to t these models using MCMC methods in MLwiN 189

207 Bibliography Bernardo, J M and A F M Smith (1994) Bayesian Theory Chichester: Wiley Box, G E P and M E Muller (1958) A Note on the Generation of Random Normal Deviates Annals of Mathematical Statistics 29, 610{611 Box, G E P and G C Tiao (1992) Bayesian Inference in Statistical Analysis New York: John Wiley Breslow, N E and D G Clayton (1993) Approximate Inference in Generalized Linear Mixed Models Journal of the American Statistical Association 88, 9{25 Brooks, S P and G O Roberts (1997) Assessing Convergence of Markov Chain Monte Carlo Algorithms Unpublished Bryk, A S and S W Raudenbush (1992) Hierarchical Linear Models Newbury Park: Sage Bryk, A S, S W Raudenbush, M Seltzer, and R Congdon (1988) An Introduction to HLM: Computer Program and User's guide (20 ed) Chicago: University of Chicago Dept of Education Chateld, C (1989) The Analysis of Time Series : An Introduction (4th ed) London: Chapman and Hall Copas, J B and H G Li (1997) Inference for Non-Random Samples Journal of the Royal Statistical Society, Series B 59, 55{96 Cowles, M K and B P Carlin (1996) Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review Journal of the American Statistical Association 91, 883{

208 Dempster, A P, N M Laird, and D B Rubin (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm (with discussion) Journal of the Royal Statistical Society, Series B 39, 1{38 Draper, D (1995) Inference and Hierarchical Modeling in the Social Sciences Journal of Educational and Behavioral Statistics 20, 115{147 Draper, D and R Cheal (1997) Practical MCMC for Assessment and Propagation of Model Uncertainty Unpublished DuMouchel, W and C Waternaux (1992) Hierarchical Model for Combining Information and for Meta-analyses (Discussion) In J M Bernardo, J O Berger, A P Dawid, and A F M Smith (Eds), Bayesian Statistics 4, pp 338{341 Oxford: Clarendon Press Gelfand, A E, S E Hills, A Racine-Poon, and A F M Smith (1990) Illustration of Bayesian Inference in Normal Data Models Using Gibbs Sampling Journal of the American Statistical Association 85, 972{985 Gelfand, A E and S K Sahu (1994) On Markov Chain Monte Carlo acceleration Journal of Computational and Graphical Statistics 3, 261{276 Gelfand, A E, S K Sahu, and B P Carlin (1995) EcientParameterizations for normal linear mixed models Biometrika 82, 479{488 Gelfand, A E and A F M Smith (1990) Sampling Based Approaches to Calculating Marginal Densities Journal of the American Statistical Association 85, 398{409 Gelman, A, J B Carlin, H S Stern, and D B Rubin (1995) Bayesian Data Analysis London: Chapman and Hall Gelman, A, G O Roberts, and W R Gilks (1995) Ecient Metropolis Jumping Rules In J M Bernardo, J O Berger, A P Dawid, and A F M Smith (Eds), Bayesian Statistics 5, pp 599{607 Oxford: Oxford University Press Gelman, A and D B Rubin (1992) Inference from Iterative Simulation Using Multiple Sequences Statistical Science 7, 457{472 Geman, S and D Geman (1984) Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 721{

209 Geweke, J (1992) Evaluating the Accuracy of Sampling Based Approaches to the Calculation of Posterior Moments In J M Bernardo, J O Berger, A PDawid, and A F M Smith (Eds), Bayesian Statistics 4, pp 169{193 Oxford: Oxford University Press Gilks, W R (1995) Full Conditional Distributions In W R Gilks and S Richardson and D J Spiegelhalter (Ed), Markov Chain Monte Carlo in Practice London: Chapman and Hall Gilks, W R, G O Roberts, and S K Sahu (1996) Adaptive Markov Chain Monte Carlo Research report 20, Statistics Laboratory, University of Cambridge Gilks, W R and P Wild (1992) Adaptive Rejection Sampling for Gibbs Sampling Journal of the Royal Statistical Society, Series C 41, 337{348 Goldstein, H (1986) Multilevel mixed linear model analysis using iterative generalised least squares Biometrika 73, 43{56 Goldstein, H (1989) Restricted unbiased iterative generalised least squares estimation Biometrika 76, 622{623 Goldstein, H (1991) Nonlinear Multilevel Models With an Application to Discrete Response Data Biometrika 78, 45{51 Goldstein, H (1995) Multilevel Statistical Models (2 ed) London: Edward Arnold Goldstein, H and J Rasbash (1996) Improved Approximations for Multilevel Models with Binary Responses Journal of the Royal Statistical Society, Series A 159, 505{513 Goldstein, H, J Rasbash, I Plewis, D Draper, W Browne, M Yang, G Woodhouse, and M Healy (1998) A user's guide to MLwiN (10 ed) London: Institute of Education Goldstein, H and D J Spiegelhalter (1996) League Tables and Their Limitations: Statistical Issues in Comparisons of Institutional Performance Journal of the Royal Statistical Society, Series A 159, 385{409 Hastings, W K (1970) Monte Carlo Sampling Methods using Markov Chains and their Applications Biometrika 57 (1), 97{

210 Heath, A, M Yang, and H Goldstein (1996) Multilevel Analysis of the Changing Relationship between Class and Party in Britain Quality and Quantity 30, 389{404 Hills, S W and A F M Smith (1992) Parameterization Issues in Bayesian Inference In J M Bernardo, J O Berger, A PDawid, and A F M Smith (Eds), Bayesian Statistics 4, pp 227{246 Oxford: Oxford University Press Kreft, I G G, J de Leeuw, and R van der Leeden (1994) Review of Five Multilevel Analysis Programs : BMDP-5V, GENMOD, HLM, ML2, and VARCL American Statistician 48, 324{335 Laird, N M (1978) Empirical Bayes Methods for Two-Way Contingency Tables Biometrika 65, 581{590 Longford, N T (1987) A Fast Scoring Algorithm for Maximum Likelihood Estimation in Unbalanced Mixed Models with Nested Random Eects Biometrika 74, 817{827 Longford, N T (1988) VARCL - software for variance components analysis of data with hierarchically nested random eects (maximum likelihood) (10 ed) Princeton, NJ: Educational Testing Service MacEachern, S N and L M Berliner (1994) Subsampling the Gibbs Sampler The American Statistician 48, 188{190 McCullagh, P and J A Nelder (1983) Generalized Linear Models London: Chapman and Hall Metropolis, N, A W Rosenbluth, M N Rosenbluth, A H Teller, and E Teller (1953) Equations of State Calculations by Fast Computing Machines Journal of Chemical Physics 21, 1087{1092 Muller, P (1993) A generic approach to posterior integration and Gibbs sampling Technical report, ISDS, Duke University Raftery, A E and S M Lewis (1992) How Many Iterations in the Gibbs Sampler? In J M Bernardo, J O Berger, A PDawid, and A F M Smith (Eds), Bayesian Statistics 4, pp 763{773 Oxford: Oxford University Press 193

211 Rasbash, J and G Woodhouse (1995) MLn: Command Reference Guide (10 ed) London: Institute of Education Ripley, B D(1987) Stochastic Simulation New York, USA: Wiley Rodriguez, G and N Goldman (1995) An Assessment of Estimation Procedures for Multilevel Models with Binary Responses Journal of the Royal Statistical Society, Series A 158, 73{89 Seltzer, M H (1993) Sensitivity Analysis for Fixed Eects in the Hierarchical Model: A Gibbs Sampling Approach Journal of Educational Statistics 18, 207{235 Seltzer, M H, W H Wong, and A S Bryk (1996) Bayesian Analysis in Applications of Hierarchical Models: Issues and Methods Journal of Educational and Behavioral Statistics 21, 131{167 Silverman, B W (1986) Density Estimation for Statistics and Data Analysis London: Chapman and Hall Spiegelhalter, D J, A Thomas, N G Best, and W R Gilks (1994) BUGS: Bayesian inference using Gibbs sampling Version 030a Technical report, MRC Biostatistics Unit, Cambridge Spiegelhalter, D J, A Thomas, N G Best, and W R Gilks (1995) BUGS: Bayesian inference using Gibbs sampling Version 050 Technical report, MRC Biostatistics Unit, Cambridge Stiratelli, R, N M Laird, and J Ware (1984) Random Eects Models for Serial Observations with Binary Responses Biometrics 40, 961{971 Woodhouse, G, J Rasbash, H Goldstein, and M Yang (1995) Introduction to Multilevel Modelling In GWoodhouse (Ed), A Guide to MLn for New Users Institute of Education Zeger, S L and M R Karim (1991) Generalized Linear Models with Random Eects: a Gibbs Sampling Approach Journal of the American Statistical Association 86, 79{86 194