Generalized Linear Mixed Models via Monte Carlo Likelihood Approximation Short Title: Monte Carlo Likelihood Approximation

Generalized Linear Mixed Models via Monte Carlo Likelihood Approximation Short Title: Monte Carlo Likelihood Approximation http://users.stat.umn.edu/ christina/googleproposal.pdf Christina Knudson Bio I m a doctoral candidate at the University of Minnesota s School of Statistics. I am ABD and about a year away from graduating. I graduated cum laude from Carleton college with a BA in Mathematics. I was born and raised in Decorah, Iowa, which is one of the top 20 small towns in America (according to Smithsonian Magazine). I started coding in the spring of 2007, first in Python and then Java. I started using R during the summer of 2007 at the Summer Institute for Training in Biostatistics, then I continued programming with R during my summer internship at the National Institutes of Health in 2008. Most of my graduate coursework has been in R, and I have taught undergraduate classes in R at the University of Minnesota as well. Part of my work for my doctoral thesis is an R package that fits Generalized Linear Mixed Models (GLMMs) using Monte Carlo Likelihood Approximation (MCLA). I have written part of my package already and I plan to expand and generalize it this summer. Contact Information Student name: Christina Knudson Link id: cknudson05 Student postal address: 1901 Minnehaha Ave, Apt 317, Minneapolis MN, 55404 Telephone: 1-507-384-2220 Emails: knud0158@umn.edu, christina@umn.edu, cknudson05@gmail.com Website: http://users.stat.umn.edu/ christina/ Student Affiliation Institution: University of Minnesota Program: Statistics 1

Stage of completion: Early 2015 Contact to verify: charlie@stat.umn.edu or galin@stat.umn.edu Advisors: Charles Geyer and Galin Jones Schedule Conflicts During August 3 through 7, I will attend the Joint Statistical Meetings to present my R package. Mentors Mentor names: Charles Geyer and Galin Jones Mentor emails: charlie@stat.umn.edu and galin@stat.umn.edu Mentor link ids: cjgeyer I have been in touch with my mentors about my package. I meet with each of them at least weekly, and sometimes I talk to Charlie several times per week. Background GLMMs are popular in many fields from ecology to economics. The popularity of GLMMs is apparent through a Google search, which yields 242,000 results. The challenge for researchers is finding an easy-to-implement and reliable method for fitting and testing GLMMs. For very simple problems with just a few random effects, the likelihood can be approximated by numerical integration. Most models have crossed random effects, which numerical integration cannot handle. Thus, a commonly used method is penalized quasi-likelihood (PQL), which is implemented in packages such as lme4, nlme, and MASS. However, PQL relies on approximations of unknown accuracy to approximate the likelihood and suffers from problematic inferential properties, such as parameter estimates that tend to be too low (McCulloch and Searle, 2001). Since the likelihood is approximated to an unknown accuracy by the quasi-likelihood, any inference performed on the approximated likelihood will also produce results with an unknown level of accuracy. Without bootstrapping, a PQL user cannot know how valid their confidence intervals or likelihood ratio test results are. The popularity of PQL despite its inadequacies shows that there is a high demand for tools to fit GLMMs. 2

Monte Carlo Likelihood Approximation (MCLA) is another tool for fitting GLMMs. This method approximates the likelihood either through Markov Chain Monte Carlo (MCMC) or Ordinary Monte Carlo (OMC), and the resulting likelihood approximation is used to fit and test GLMMs (Geyer and Thompson, 1992). Because MCLA approximates the entire likelihood, any type of likelihood-based inference can be performed. Inference such as maximum likelihood or likelihood-ratio testing is standard for many simpler models, but MCLA is the only method that can perform these techniques for GLMMs. Moreover, MCLA is supported by a rigorous theoretical foundation supplied by Geyer (1994) and Sung and Geyer (2007). Despite MCLA s solid theoretical underpinnings, it is not yet a widely-used technique. MCLA via MCMC is too difficult for most users because they do not know when the Markov chain has run long enough to produce reliable answers. Sung and Geyer s (2007) version of MCMLA via OMC is more user-friendly but is limited to smaller problems. My current work performs MCLA via OMC with an improved importance sampling distribution. Rather than selecting an importance sampling distribution independently of the data, my package uses an importance sampling distribution that is similar to the true distribution of the random effects. The importance sampling distribution is specified based on the data. With this importance sampling distribution, my package performs MCLA for GLMMs with a Poisson or Bernoulli response using the canonical link. The package assumes the random effects are independently drawn from a normal distribution with mean 0 and unknown variances. There can be any number of fixed or random effects. The package is in the testing stage and is nearing completion for the setting described earlier in this paragraph. This package is part of my doctoral thesis in statistics, which I am earning at the University of Minnesota with Professors Charles Geyer and Galin Jones as my co-advisors. Goals and objectives for Google Summer of Code My goals are (1) to rewrite sections of my package in C to improve its speed, (2) write functions to perform likelihood ratio tests for comparing nested models, (3) write additional functions to fit models with correlated random effects. 3

Details I consider my goals separately, since completion of one goal does not rely on completion of the other goals. (1) Two steps stand out in my package as time-consuming: the step that decides the parameters for the importance sampling distribution and the step that maximizes the likelihood approximation. Thus, I will need to rewrite these two functions in C. The main obstacle here will be coding in C, with which I do not have extensive experience. Because I have written functioning R code that performs these steps, I will be able to compare the R results to the C results to verify my functions are correct. I have been working with a couple of data sets, including the benchmark Booth and Hobert (1999) data set with known maximum likelihood estimates, so I will be able to test my code on these data sets. Because I can rewrite one function in C without affecting the other functions, I should be able to write these functions in either order. I think it would be better to rewrite the step that maximizes the likelihood approximation first, since that function is more computationally-intensive and is also more important. The function that chooses the parameter values of the importance sampling distribution can be rewritten in C second because it is less timeconsuming than the other function as it is. The equations for these functions are detailed in my design document, which is on my website at http://users.stat.umn.edu/ knud0158/ designdoc.pdf. (2) Hypothesis testing for nested models can be split into three cases: the nested models differ in their fixed effects but have the same variance components, the nested models differ by one variance component and possibly some fixed effects, the nested models differ by two or more variance components and possibly by some fixed effects. I have worked out the details for calculating the test statistics and p-values for the first two cases in my design document at http://users.stat.umn.edu/ knud0158/designdoc.pdf. Coding the last case will take longer because I will need to determine the test statistic and its sampling distribution. Part of the challenge will be writing a function so that, given two models, the code will know which test statistics and pvalues to calculate and report. My advisor Charlie has written a function in his 4

aster package that also does model comparison, so I will look to that for guidance. I will be able to test my code on the Coull and Agresti (2000) flu data set by modeling the log odds of catching the flu over four years. The model will have a few variance components that I will be able to test: a variance component for a subject-specific random effect, another for a year-to-year random effect, and another for the decreased chance of getting flu when a strain of flu virus reappears in a later year. (3) The covariance matrix for the random effects in my currently working code is diagonal, meaning the random effects are independently drawn based on one of possibly many variance components. I would like to generalize the covariance matrix in order to fit models with locationdependence. This generality would make my R package more usable and practical. To fit these types of models, I would like to code an additional variance structures with exponential decay based on the distance between observations. I have not written the details of how I will execute these changes into my design document yet, but I have written my current package with these future changes in mind. I will test my code on the Caffo et al. (2005) automobile theft data set by modeling the number of cars stolen in a Baltimore neighborhood based on the distance to sites of other car thefts. I will compare my results to the results achieved through Monte Carlo EM. Proposed Timeline May 19 to May 26: look at Charlie s aster package code for model comparison. Design and write code for my own package to determine how the two models differ. May 26 to June 2: write the hypothesis testing code for the first two cases detailed earlier. June 2 to June 15: test and correct the hypothesis testing code. June 16: Complete documentation for hypothesis testing function and submit updated R package to CRAN. June 16 to June 30: write C function that maximizes the likelihood approximation. June 30 to July 7: test the newly-written C function and compare it to my R results. 5

July 7: submit updated R package to CRAN. July 7 to July 14: write C function that selects the importance sampling distribution. July 14 to July 21: test the newly-written C function and compare it to my R results. July 21: submit updated R package to CRAN. July 21 to July 28: write function for new variance structure and incorporate into package. July 28 to August 2: test the new variance structure and compare results to those reported by Caffo et al. (2005). August 8 to August 11: document the new variance structure. August 11: submit final version of fully-updated R package to CRAN. I expect to complete all work by August 11. If something starts to take longer than predicted, then I may need to postpone the new variance structure to the fall, since the first two goals are more important. References Booth, J. G. and Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society, Series B, 61:265 285. Caffo, B., Jenk, W., and Jones, G. (2005). Ascent-based monte carlo em. Journal of Royal Statistical Society, Series B, 67:261 274. Coull, B. and Agresti, A. (2000). Random effects modeling of multiple binomial responses using the multivariate binomial logit-normal distribution. Biometrics, 56:73 80. Geyer, C. J. (1994). On the convergence of Monte Carlo maximum likelihood calculations. Journal of the Royal Statistical Society, Series B, 61:261 274. 6

Geyer, C. J. and Thompson, E. (1992). Constrained Monte Carlo maximum likelihood for dependent data. Journal of the Royal Statistical Society, Series B, 54:657 699. McCulloch, C. and Searle, S. (2001). Generalized, Linear, and Mixed Models. John Wiley and Sons, New York. Sung, Y. J. and Geyer, C. J. (2007). Monte Carlo likelihood inference for missing data models. Annals of Statistics, 35:990 1011. 7