Automated Biosurveillance Data from England and Wales, 1991 2011



Similar documents
Poisson Models for Count Data

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Logistic Regression (a type of Generalized Linear Model)

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

STATISTICA Formula Guide: Logistic Regression. Table of Contents

SAS Software to Fit the Generalized Linear Model

EVALUATION OF PROBABILITY MODELS ON INSURANCE CLAIMS IN GHANA

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Logistic Regression (1/24/13)

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Factors affecting online sales

Introduction to Predictive Modeling Using GLMs

Regression Analysis: A Complete Example

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Generalized Linear Models

Lecture 8: Gamma regression

2. Simple Linear Regression

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

15.1 The Structure of Generalized Linear Models

Computer exercise 4 Poisson Regression

International Statistical Institute, 56th Session, 2007: Phil Everson

Normality Testing in Excel

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Descriptive Statistics

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Regression III: Advanced Methods

11. Analysis of Case-control Studies Logistic Regression

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

Predicting Health Care Costs by Two-part Model with Sparse Regularization

GLM I An Introduction to Generalized Linear Models

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University

MATHEMATICAL METHODS OF STATISTICS

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Gene Expression Analysis

Multivariate Normal Distribution

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Causal Forecasting Models

Pearson's Correlation Tests

Logs Transformation in a Regression Equation

Time series Forecasting using Holt-Winters Exponential Smoothing

PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE

Appendix 1: Estimation of the two-variable saturated model in SPSS, Stata and R using the Netherlands 1973 example data

The zero-adjusted Inverse Gaussian distribution as a model for insurance claims

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Fairfield Public Schools

MTH 140 Statistics Videos

2.2 Elimination of Trend and Seasonality

HLM software has been one of the leading statistical packages for hierarchical

Part 2: Analysis of Relationship Between Two Variables

ANNUITY LAPSE RATE MODELING: TOBIT OR NOT TOBIT? 1. INTRODUCTION

Premaster Statistics Tutorial 4 Full solutions

SKEWNESS. Measure of Dispersion tells us about the variation of the data set. Skewness tells us about the direction of variation of the data set.


Modelling the Scores of Premier League Football Matches

Nonnested model comparison of GLM and GAM count regression models for life insurance data

Underwriting risk control in non-life insurance via generalized linear models and stochastic programming

GLMs: Gompertz s Law. GLMs in R. Gompertz s famous graduation formula is. or log µ x is linear in age, x,

Technical Efficiency Accounting for Environmental Influence in the Japanese Gas Market

CALL VOLUME FORECASTING FOR SERVICE DESKS

Examining a Fitted Logistic Model

Statistical Models in R

Poisson Regression or Regression of Counts (& Rates)

BEST PRACTICE GUIDE ON STATISTICAL ANALYSIS OF FATIGUE DATA

Simple Linear Regression Inference

17. SIMPLE LINEAR REGRESSION II

Univariate Regression

Regression and Correlation

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Course Syllabus MATH 110 Introduction to Statistics 3 credits

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

PELLISSIPPI STATE COMMUNITY COLLEGE MASTER SYLLABUS INTRODUCTION TO STATISTICS MATH 2050

Predictive Modeling in Long-Term Care Insurance

Lecture 18: Logistic Regression Continued

Introduction to General and Generalized Linear Models

Nominal and ordinal logistic regression

Gamma Distribution Fitting

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Chi Square Tests. Chapter Introduction

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

Penalized Logistic Regression and Classification of Microarray Data

Simple linear regression

Chapter 27 Using Predictor Variables. Chapter Table of Contents

Lecture 6: Poisson regression

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Transcription:

Article DOI: http://dx.doi.org/10.3201/eid1901.120493 Automated Biosurveillance Data from England and Wales, 1991 2011 Technical Appendix This online appendix provides technical details of statistical methods; further technical description of results; and online Technical Appendix Figures 1 to 5 referred to in the main text. References are numbered as in the main text. Statistical Methods The models used to analyse the data were quasi-poisson models of the form E(y t ) = μ t, var(y t ) = μ t with 1 (A) where y t is the count of a particular organism in week t and is a dispersion parameter representing extra-poisson variability when > 1. Generalised linear models (GLM) of the form log(μ t ) = α + βt + seas(t) (B) were used, where seas(t) is a 12-level factor representing seasons, the factor levels roughly corresponding to calendar months (13). In analyses where more detailed modelling of trends was required we used generalized additive models (GAM) of the form log(μ t ) = α + s(t) + seas(t) (C) where s(t) represents a smooth function of time (14). Time t was centred in all analyses. Estimation of the regression parameters was by maximum quasi-likelihood, and inclusion of model terms was tested using the likelihood ratio test. The dispersion parameter was estimated by dividing the Pearson chi-square by the degrees of freedom (13, p. 200). To avoid overfitting the model with sparse data, if the estimated value of was less than 1 we forced it to equal 1 ( was only ever less than 1 with sparse data). A natural statistical model for surveillance data is the negative binomial model, describing a random variable which is conditionally Poisson with mean ν, with ν itself a random variable with a gamma density of mean μ. This is a negative binomial distribution Page 1 of 6

with mean μ and variance of the form μ, with > 1 (13, p. 199). The skewness of the distribution is (2 1) / ( μ) 1/2. To study the mean-variance and mean-skewness relationships empirically we first deseasonalized the data using the transformation z t = y t /exp(seas(t)), (D) where y t is the observed count and seas(t) is the fitted seasonal factor. Mean-variance relationships were analysed by plotting the log of the variance of z t against the log of the mean of z t in adjacent 6-month periods. The line log(var) = log(mean) corresponds to the Poisson distribution, and the line log(var) = log( ) + log(mean) corresponds to the negative binomial model. Consistency of the data with the latter was investigated by testing the null hypothesis that the slope in the normal errors regression of log(var) against log(mean) is 1. Similarly, we plotted the skewness of z t against the log of the mean of z t. The curve skewness = exp( 0.5 log(mean)) corresponds to the Poisson distribution, and the line skewness = exp(log( 1/2 (2 1)) 0.5 log(mean)) corresponds to the negative binomial model. Consistency of the data with the latter was investigated by testing the null hypothesis that the coefficient of log(mean) in the normal errors regression with log link is 0.5. Formal goodness of fit tests were not employed, owing to the sparsity of the data for many organisms. Results Means, Seasonality and Trends Linear trends were investigated using the model in equation B, with slope parameter β. We investigated seasonality as follows. When the estimated value of α in the GLM of equation A was non-negative, as was the case for 283 organisms, we fitted the GAM of equation B. We did not seek to fit the GAM to sparse data (defined as those organisms with α < 0) owing to convergence problems. Seasonality was assessed from the final model fitted for each time series. Dispersion We studied overdispersion relative to the Poisson distribution for all 2,254 organisms. We first obtained the log of the mean weekly count, α. Then for organisms with specimen dates spanning 52 or fewer weeks we fitted the GLM of equation A and obtained the Page 2 of 6

dispersion parameter,. For organisms spanning more than 52 weeks we used the procedure described previously, fitting a GLM or a GAM according to the value of α, and obtained the dispersion parameter. The means shown in online Technical Appendix Figure 4 (left) and Figure 4 (right) are the values of exp(α) (the data were centred prior to analysis). Relationships between Mean, Variance and Skewness Figure 5 (B) and Figure 6 (B) of the main text show histograms of the slope parameters obtained for the two regressions, of log(variance) against log(mean), and of skewness against log(mean). The median value of the slope parameter for the linear regression of log(variance) on log(mean) was 1.2, corresponding to a variance function proportional to μ 1.2. For the log-linear regression of skewness against log(mean), the median slope parameter was 0.34, corresponding to a skewness function proportional to μ 0.34. Thus, the data tend to exhibit greater variance and skewness than under the negative binomial model, though very substantial departures from it are uncommon. The quasi-poisson and negative binomial models provide reasonable compromises if a single model is sought for all organisms, though there is clearly room for improvement, perhaps by allowing a more general power dependence between the variance (and the skewness) and the mean. Technical Appendix Figures 1 to 5 All additional figures are referred to in the main text. Technical Appendix Figure 4 is also referred to under Results. Page 3 of 6

Technical Appendix Figure 1. Number of laboratories reporting by date of specimen and date of report. Both decline over time. The numbers by date of report are lower than the number by date of specimen, suggesting batching of reports. Technical Appendix Figure 2. Differences in weekly counts, date of report minus date of specimen. Left: isolates. Right: organism types. Both fluctuate around zero, with substantial variance. Page 4 of 6

Technical Appendix Figure 3. Histograms of reporting delays (days) for four organisms. Modal delays are longer for the salmonellas, owing to the extra typing step involved. Page 5 of 6

Technical Appendix Figure 4. Dispersion parameter. Left: plotted against mean count (on log scale). Right: for data by week of report and by week of specimen, with diagonal line. Technical Appendix Figure 5. Weekly counts of Helicobacter pylori isolates. Left: histogram, showing an excess of weeks with a count of zero. Right: time series of weekly counts, showing sudden changes in level, and a long run of zeroes between 1996 and 2001. Page 6 of 6