Generalized Linear Mixed Models via Monte Carlo Likelihood Approximation Short Title: Monte Carlo Likelihood Approximation



Similar documents
Statistics Graduate Courses

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Statistics in Applications III. Distribution Theory and Inference

Program description for the Master s Degree Program in Mathematics and Finance

Parallelization Strategies for Multicore Data Analysis

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

Quantitative Methods for Finance

Learning outcomes. Knowledge and understanding. Competence and skills

STA 4273H: Statistical Machine Learning

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Statistical Rules of Thumb

Rouch, Jean. Cine-Ethnography. Minneapolis, MN, USA: University of Minnesota Press, p 238

GLM, insurance pricing & big data: paying attention to convergence issues.

Handling attrition and non-response in longitudinal data

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Gaussian Processes to Speed up Hamiltonian Monte Carlo

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Simple Linear Regression Inference

Simple Linear Regression

Ph.D. Biostatistics Note: All curriculum revisions will be updated immediately on the website

Confidence Intervals for Spearman s Rank Correlation

Master programme in Statistics

LECTURE 16. Readings: Section 5.1. Lecture outline. Random processes Definition of the Bernoulli process Basic properties of the Bernoulli process

Graduate Course Offerings in Transportation Engineering at Villanova University

Poisson Models for Count Data

PROBABILITY AND STATISTICS. Ma To teach a knowledge of combinatorial reasoning.

Use of deviance statistics for comparing models


Total Credits: 30 credits are required for master s program graduates and 51 credits for undergraduate program.

Statistics in Retail Finance. Chapter 6: Behavioural models

Orthogonal Distance Regression

RUSRR048 COURSE CATALOG DETAIL REPORT Page 1 of 6 11/11/ :33:48. QMS 102 Course ID

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

From the help desk: Bootstrapped standard errors

A study on the bi-aspect procedure with location and scale parameters

Laura F. Boehm Vock. Voice: (715) Website: pages.stolaf.edu/boehm/

How To Understand The Theory Of Probability

Graduate Certificate in Systems Engineering

Imputing Missing Data using SAS

Economic Statistics (ECON2006), Statistics and Research Design in Psychology (PSYC2010), Survey Design and Analysis (SOCI2007)

Dealing with large datasets

STAT 360 Probability and Statistics. Fall 2012

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Factorial Invariance in Student Ratings of Instruction

Multivariate Normal Distribution

Curriculum - Doctor of Philosophy

List of Ph.D. Courses

APPLIED MISSING DATA ANALYSIS

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Learning Objectives for Selected Programs Offering Degrees at Two Academic Levels

2015 TUHH Online Summer School: Overview of Statistical and Path Modeling Analyses

Least Squares Estimation

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

The Applied and Computational Mathematics (ACM) Program at The Johns Hopkins University (JHU) is

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Market Risk Analysis. Quantitative Methods in Finance. Volume I. The Wiley Finance Series

An introduction to Value-at-Risk Learning Curve September 2003

171:290 Model Selection Lecture II: The Akaike Information Criterion

SAS Software to Fit the Generalized Linear Model

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

More details on the inputs, functionality, and output can be found below.

Model Selection and Claim Frequency for Workers Compensation Insurance

Linear Threshold Units

HLM software has been one of the leading statistical packages for hierarchical

School of Public Health and Health Services Department of Epidemiology and Biostatistics

SAMPLE SIZE TABLES FOR LOGISTIC REGRESSION

Logistic Regression (1/24/13)

M.Ed. in Educational Psychology: Research, Statistics, and Evaluation

START Selected Topics in Assurance

Likelihood Approaches for Trial Designs in Early Phase Oncology

Confidence Intervals for One Standard Deviation Using Standard Deviation

Regression Modeling Strategies

QUALITY ENGINEERING PROGRAM

Imputing Values to Missing Data

Analysis of Financial Time Series

Executive Program in Managing Business Decisions: A Quantitative Approach ( EPMBD) Batch 03

Masters in Financial Economics (MFE)

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Bayesian Statistics in One Hour. Patrick Lam

ESTIMATING COMPLETION RATES FROM SMALL SAMPLES USING BINOMIAL CONFIDENCE INTERVALS: COMPARISONS AND RECOMMENDATIONS

Master of Arts in Mathematics

The Variability of P-Values. Summary

Java Modules for Time Series Analysis

Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT

R2MLwiN Using the multilevel modelling software package MLwiN from R

Marketing Mix Modelling and Big Data P. M Cain

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Penalized regression: Introduction

Likelihood: Frequentist vs Bayesian Reasoning

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Sensitivity Analysis in Multiple Imputation for Missing Data

Proposal for Undergraduate Certificate in Large Data Analysis

Transcription:

Generalized Linear Mixed Models via Monte Carlo Likelihood Approximation Short Title: Monte Carlo Likelihood Approximation http://users.stat.umn.edu/ christina/googleproposal.pdf Christina Knudson Bio I m a doctoral candidate at the University of Minnesota s School of Statistics. I am ABD and about a year away from graduating. I graduated cum laude from Carleton college with a BA in Mathematics. I was born and raised in Decorah, Iowa, which is one of the top 20 small towns in America (according to Smithsonian Magazine). I started coding in the spring of 2007, first in Python and then Java. I started using R during the summer of 2007 at the Summer Institute for Training in Biostatistics, then I continued programming with R during my summer internship at the National Institutes of Health in 2008. Most of my graduate coursework has been in R, and I have taught undergraduate classes in R at the University of Minnesota as well. Part of my work for my doctoral thesis is an R package that fits Generalized Linear Mixed Models (GLMMs) using Monte Carlo Likelihood Approximation (MCLA). I have written part of my package already and I plan to expand and generalize it this summer. Contact Information Student name: Christina Knudson Link id: cknudson05 Student postal address: 1901 Minnehaha Ave, Apt 317, Minneapolis MN, 55404 Telephone: 1-507-384-2220 Emails: knud0158@umn.edu, christina@umn.edu, cknudson05@gmail.com Website: http://users.stat.umn.edu/ christina/ Student Affiliation Institution: University of Minnesota Program: Statistics 1

Stage of completion: Early 2015 Contact to verify: charlie@stat.umn.edu or galin@stat.umn.edu Advisors: Charles Geyer and Galin Jones Schedule Conflicts During August 3 through 7, I will attend the Joint Statistical Meetings to present my R package. Mentors Mentor names: Charles Geyer and Galin Jones Mentor emails: charlie@stat.umn.edu and galin@stat.umn.edu Mentor link ids: cjgeyer I have been in touch with my mentors about my package. I meet with each of them at least weekly, and sometimes I talk to Charlie several times per week. Background GLMMs are popular in many fields from ecology to economics. The popularity of GLMMs is apparent through a Google search, which yields 242,000 results. The challenge for researchers is finding an easy-to-implement and reliable method for fitting and testing GLMMs. For very simple problems with just a few random effects, the likelihood can be approximated by numerical integration. Most models have crossed random effects, which numerical integration cannot handle. Thus, a commonly used method is penalized quasi-likelihood (PQL), which is implemented in packages such as lme4, nlme, and MASS. However, PQL relies on approximations of unknown accuracy to approximate the likelihood and suffers from problematic inferential properties, such as parameter estimates that tend to be too low (McCulloch and Searle, 2001). Since the likelihood is approximated to an unknown accuracy by the quasi-likelihood, any inference performed on the approximated likelihood will also produce results with an unknown level of accuracy. Without bootstrapping, a PQL user cannot know how valid their confidence intervals or likelihood ratio test results are. The popularity of PQL despite its inadequacies shows that there is a high demand for tools to fit GLMMs. 2

Monte Carlo Likelihood Approximation (MCLA) is another tool for fitting GLMMs. This method approximates the likelihood either through Markov Chain Monte Carlo (MCMC) or Ordinary Monte Carlo (OMC), and the resulting likelihood approximation is used to fit and test GLMMs (Geyer and Thompson, 1992). Because MCLA approximates the entire likelihood, any type of likelihood-based inference can be performed. Inference such as maximum likelihood or likelihood-ratio testing is standard for many simpler models, but MCLA is the only method that can perform these techniques for GLMMs. Moreover, MCLA is supported by a rigorous theoretical foundation supplied by Geyer (1994) and Sung and Geyer (2007). Despite MCLA s solid theoretical underpinnings, it is not yet a widely-used technique. MCLA via MCMC is too difficult for most users because they do not know when the Markov chain has run long enough to produce reliable answers. Sung and Geyer s (2007) version of MCMLA via OMC is more user-friendly but is limited to smaller problems. My current work performs MCLA via OMC with an improved importance sampling distribution. Rather than selecting an importance sampling distribution independently of the data, my package uses an importance sampling distribution that is similar to the true distribution of the random effects. The importance sampling distribution is specified based on the data. With this importance sampling distribution, my package performs MCLA for GLMMs with a Poisson or Bernoulli response using the canonical link. The package assumes the random effects are independently drawn from a normal distribution with mean 0 and unknown variances. There can be any number of fixed or random effects. The package is in the testing stage and is nearing completion for the setting described earlier in this paragraph. This package is part of my doctoral thesis in statistics, which I am earning at the University of Minnesota with Professors Charles Geyer and Galin Jones as my co-advisors. Goals and objectives for Google Summer of Code My goals are (1) to rewrite sections of my package in C to improve its speed, (2) write functions to perform likelihood ratio tests for comparing nested models, (3) write additional functions to fit models with correlated random effects. 3

Details I consider my goals separately, since completion of one goal does not rely on completion of the other goals. (1) Two steps stand out in my package as time-consuming: the step that decides the parameters for the importance sampling distribution and the step that maximizes the likelihood approximation. Thus, I will need to rewrite these two functions in C. The main obstacle here will be coding in C, with which I do not have extensive experience. Because I have written functioning R code that performs these steps, I will be able to compare the R results to the C results to verify my functions are correct. I have been working with a couple of data sets, including the benchmark Booth and Hobert (1999) data set with known maximum likelihood estimates, so I will be able to test my code on these data sets. Because I can rewrite one function in C without affecting the other functions, I should be able to write these functions in either order. I think it would be better to rewrite the step that maximizes the likelihood approximation first, since that function is more computationally-intensive and is also more important. The function that chooses the parameter values of the importance sampling distribution can be rewritten in C second because it is less timeconsuming than the other function as it is. The equations for these functions are detailed in my design document, which is on my website at http://users.stat.umn.edu/ knud0158/ designdoc.pdf. (2) Hypothesis testing for nested models can be split into three cases: the nested models differ in their fixed effects but have the same variance components, the nested models differ by one variance component and possibly some fixed effects, the nested models differ by two or more variance components and possibly by some fixed effects. I have worked out the details for calculating the test statistics and p-values for the first two cases in my design document at http://users.stat.umn.edu/ knud0158/designdoc.pdf. Coding the last case will take longer because I will need to determine the test statistic and its sampling distribution. Part of the challenge will be writing a function so that, given two models, the code will know which test statistics and pvalues to calculate and report. My advisor Charlie has written a function in his 4

aster package that also does model comparison, so I will look to that for guidance. I will be able to test my code on the Coull and Agresti (2000) flu data set by modeling the log odds of catching the flu over four years. The model will have a few variance components that I will be able to test: a variance component for a subject-specific random effect, another for a year-to-year random effect, and another for the decreased chance of getting flu when a strain of flu virus reappears in a later year. (3) The covariance matrix for the random effects in my currently working code is diagonal, meaning the random effects are independently drawn based on one of possibly many variance components. I would like to generalize the covariance matrix in order to fit models with locationdependence. This generality would make my R package more usable and practical. To fit these types of models, I would like to code an additional variance structures with exponential decay based on the distance between observations. I have not written the details of how I will execute these changes into my design document yet, but I have written my current package with these future changes in mind. I will test my code on the Caffo et al. (2005) automobile theft data set by modeling the number of cars stolen in a Baltimore neighborhood based on the distance to sites of other car thefts. I will compare my results to the results achieved through Monte Carlo EM. Proposed Timeline May 19 to May 26: look at Charlie s aster package code for model comparison. Design and write code for my own package to determine how the two models differ. May 26 to June 2: write the hypothesis testing code for the first two cases detailed earlier. June 2 to June 15: test and correct the hypothesis testing code. June 16: Complete documentation for hypothesis testing function and submit updated R package to CRAN. June 16 to June 30: write C function that maximizes the likelihood approximation. June 30 to July 7: test the newly-written C function and compare it to my R results. 5

July 7: submit updated R package to CRAN. July 7 to July 14: write C function that selects the importance sampling distribution. July 14 to July 21: test the newly-written C function and compare it to my R results. July 21: submit updated R package to CRAN. July 21 to July 28: write function for new variance structure and incorporate into package. July 28 to August 2: test the new variance structure and compare results to those reported by Caffo et al. (2005). August 8 to August 11: document the new variance structure. August 11: submit final version of fully-updated R package to CRAN. I expect to complete all work by August 11. If something starts to take longer than predicted, then I may need to postpone the new variance structure to the fall, since the first two goals are more important. References Booth, J. G. and Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society, Series B, 61:265 285. Caffo, B., Jenk, W., and Jones, G. (2005). Ascent-based monte carlo em. Journal of Royal Statistical Society, Series B, 67:261 274. Coull, B. and Agresti, A. (2000). Random effects modeling of multiple binomial responses using the multivariate binomial logit-normal distribution. Biometrics, 56:73 80. Geyer, C. J. (1994). On the convergence of Monte Carlo maximum likelihood calculations. Journal of the Royal Statistical Society, Series B, 61:261 274. 6

Geyer, C. J. and Thompson, E. (1992). Constrained Monte Carlo maximum likelihood for dependent data. Journal of the Royal Statistical Society, Series B, 54:657 699. McCulloch, C. and Searle, S. (2001). Generalized, Linear, and Mixed Models. John Wiley and Sons, New York. Sung, Y. J. and Geyer, C. J. (2007). Monte Carlo likelihood inference for missing data models. Annals of Statistics, 35:990 1011. 7