Modeling Count Data from Hawk Migrations

Transcription

1 ModelingCountDatafromHawkMigrations M.S. Plan B Project Report January 12, 2011 Fengying Miao M.S. Applied and Computational Mathematics Candidate Dr. Ronald Regal Advisor University of Minnesota Duluth Department of Mathematics and Statistics 1

2 TableofContents i. ACKNOWLEDGEMENTS... 4 ii. ABSTRACT INTRODUCTION HAWKEXAMPLE THEGENERALANDGENERALIZEDLINEARMODELS GeneralLinearModel ExponentialFamilyofDistributions GeneralizedLinearModels DONOTLOG TRANSFORMCOUNTDATA COMPARISONOFESTIMATIONMETHODSUSINGTHEDELTAMETHOD ExpectedValuesandVariancesofNonlinearFunctionsofRandomVariables SingleMean Poissonforsinglemean Log normalforsinglemean TwoMeans Poissonfortwomeans Log normalfortwomeans AlternativeNonlinearModels FURTHERCOMPARISONSOFMODELS SingleMean Exactcalculationforasinglemean Specificexampleofcomparingexactcalculationand TwoMeans Exactcalculationfordifferencebetweenmeans ComparisonsofModelsforTwoMeansbyDoingSimulation Regression FITTINGMODELSTOHAWKDATA Simpleintroductiontosomepotentialvariables FittingModels FitMixedModeltoData FitNlmixedModeltoData

3 7.3 SummaryofFindings CONCLUSION REFERENCES APPENDICES SASCode Rcode

4 i. ACKNOWLEDGEMENTS I would like to take this opportunity to give my sincere thanks to my advisor, Dr. Ronald Regal, for his great support and guidance, which make it possible for me to finish the project. Dr. Ronald Regal is the best advisor I have ever had. I would never forget his great help in my study and life and what he told me that finding the limits of our knowledge and understanding is always important. I also want to give my thanks to Dr. Richard Green and Dr. Gerald Niemi for being on my degree committee, reviewing my report and providing useful suggestions. I also thank Heidi Seeland for providing the datasets in this project. Thanks to Dr. Zhuangyi Liu for accepting me into this good program and letting me have the chance of learning things from Dr. Regal and other great people. 4

5 ii. ABSTRACT The General Linear Model (LM) with assumptions of independence, linearity and equal variance underlies most statistical analyses. Because of its generality, many kinds of data are transformed to satisfy its assumptions. Count data are often log-transformed using to more nearly match the assumptions. However, adding a value of one to counts might generate biases, so we need to choose a proper model for count data. In addition E(Ln(X)) is not the same as Ln(E(X)) so even if the relationship is linear for Ln(E(X)), the same will not be perfectly true for E(Ln(X)). To avoid or reduce the bias from transforming data, the Generalized Linear Model (GLM) and nonlinear mixed (NLMIXED) model could be considered instead. This report investigates how LM regression models, Generalized Linear Models based on Poisson and negative binomial distributions, and approximate nonlinear models fit with NLMIXED model compare when estimating the slope of a linear trend when analyzing count data. Implementations of comparing models are done by the popular statistical software SAS with packages, PROC REG, PROC GENMOD, PROC MIXED and PROC NLMIXED. A real data set from a hawk migration is analyzed and fitted with the mixed model. The NLMIXED model is used to analyze the variances and means of. 5

6 1 INTRODUCTION A statistical model is used to predict the probabilistic future behavior of a system from data. The main purpose of model building is to obtain proper estimates with small bias and little variability. The traditional model (LM) has been widely used, since many data can be modeled this way and there are many available theories to be applied. Different methods, such as square-root transformation and log-transforming, are often used to transform data, usually response variables, to meet the assumptions of LM. These methods might work well for continuous response variable and certain discrete variable, such count data including few zero observations which rules out direct log transformations. For example, in a study where migrating hawks are counted hourly, the numbers counted are often zero. More and more methods and models have been explored to break the limits of the assumptions. The Generalized Linear Model, GLM, an extension of LM, allows the analyst to specify the distribution of data, which address the problem of transforming data to be normally distributed. The NLMIXED model in which both fixed and random effects are allowed to have nonlinear relationships with response variables has become increasingly popular and allow flexibility of nonlinear functions as well as user specified likelihood functions. These newly born models can be applied to a wider range of real problems. Currently, the computing statistical software has been keeping in line with the numerical methods and making them more applicable. To get best estimates of response variable for a particular system, it is important to fit proper model for best describing data. In this report I describe my investigations into finding appropriate models to analyze count data such as hawk migration counts. Model selection from LM regression, GLMs and NLMIXED is based on simulations that will be done separately for simulating Poisson data and log-normal data. The GLMs are used by specifying Poisson and 6

7 negative binomial distributions. Relative bias and relative RMSE are used to evaluate how well the models work. The relationship between variance and means of real data is modeled by using NLMIXED model, which cannot deal with complicated random effects in real data we are going to study. A mixed model fit with SAS proc MIXED is used for the real hawk data to account for dependence of observations on the same day and from the same site. 7

8 2 HAWKEXAMPLE A data set from monitoring of migrating hawks is used to illustrate the issues and conclusions in this report. The data were collected in fall of 2008 by Heidi Seeland and Anna Peterson, graduate students at UMD with Dr. Gerald Niemi as their advisors. In this section, the structure of hawk data is initially described. Further details on fitting models to the data are discussed in later chapters. The data set contains counts of hawk and eagles at three distances from the shore of Lake Superior, over seven hours on certain days between August 29, 2008, and November 11, The sampling plan had eight sets of three sampling locations spread out along the north shore of Lake Superior. The eight sets of three sampling sites were called transects and numbered for 1 to 8 up the shore starting from Duluth. Fig. 1 Locations of Hawk Counts (Seeland 2010) 8

9 One general category of hawks is accipiters, which fly lower and closer to tree cover. To make the data set more understandable, let s first introduce buteos, larger hawks with broad wings that soar higher on wind currents. Figure 2 shows a plot of the average number of buteos per 7 hour each day plotted against dates. Fig. 2 Average Buteos Counts per day VS Dates in Original Scale In this original scale, buteos counts on a day are dispersedly distributed over time. The huge variation makes the form of the time trend unclear. As shown in Figure 3, log transformation of buteos counts generates a more clearly increasing trend of buteos across time. 9

10 Fig. 3 Average Buteos Counts per day VS Dates in Log-Scale The relationship between buteos and dates follow a general linear trend, except the last two points. For simplicity in applying models in this project which focus on estimating the slope of a linear trend, we will leave out points after November 1 in demonstrating the fitting some models to these data. 3 THEGENERALANDGENERALIZEDLINEARMODELS Recent papers including Ohara and Kotze (2010) have advocated generalized linear models with Poisson or negative binomial distributions rather than using normal linear models in the log scale. Before comparing these models in a wider range of situations than considered by Ohara and Kotze, I will briefly describe general and generalized linear models. 3.1 GeneralLinearModel 10

11 Consider a situation where we are interested, for example, in describing the number of violation tickets people get for violating traffic regulations annually as a function of their age. The average number of violation tickets is predicted by the following equation (3.1) where y is the response variable, Violation Tickets, is the explanatory variable, Age, and measures the deviation of the measured y from its expected value. It may now be asked whether, after allowing for the effect of age, a person s sex has any influence on the frequency of violation tickets people get. Based on this assumption, the appropriate model might be described as (3.2) where and represent Age and Sex, respectively. Each time a new variable has been introduced into the model, an additional parameter has been added. This process is an approach by which we find a mathematical description of the structure in the values of response variable. These two models discussed above involve a linear combination of parameters,,,, and are consequently known as linear models. For example, polynomial regression model that y is a non-linear function of the explanatory variable belongs this category despite the fact. The general form of linear models is described as (3.3) where and represents the error that explanatory variable cannot tell. By introducing vectors and matrices, (3.3) can be rewritten as 11

12 , (3.4) or in the following compact form (3.5) Besides linearity, the usual general linear model also assumes normality, independence and equal variance of observations, which can be written as where,. (3.6) 3.2 ExponentialFamilyofDistributions Linear models are postulated more often than non-linear ones because they are mathematically easier to manipulate and usually easier to interpret. They appear to provide an adequate description of many data sets. A wider class including normal distribution is called the exponential family of distributions. Consider a single random variable Y whose probability distribution depends on a single parameter θ. The distribution belongs to the exponential family if it can be written in the form If, the distribution is said to be in canonical form, and is sometimes is called the natural parameter of the distribution. Other parameters are regarded as nuisance parameters. The (3.7) 12

13 exponential family includes such useful distributions as binomial, Poisson, negative binomial, and gamma distributions, in addition to the normal distribution. 3.3 GeneralizedLinearModels There are many types of data which might not be normally distributed in original scale. To address this problem, a transformation may be used to normalize the data. Often, people deal with the log-transformation first, before evaluating other transformation techniques. But discrete response variables, such as birds count data, often contain many zero observations and are unlikely to have a normally distributed error structure. Maindonald & Braun (2007) argued that generalized linear models (GLMs) have largely removed the need for transforming count data. More recently, GLMs have been developed and commonly used. A GLM is an extension of the well-known linear models to include response variables that follow any probability distribution in the exponential family of distributions. The key idea is that, the relationship between and a linear predictor is specified by a link function: (3.8) where and is a link function that links the random component,, to the systematic component. Equation (3.8) can be written as (3.9) 13

14 For example, count data could be appropriately analyzed as a Poisson random variable within the context of the Generalized Linear Model. So, for the observation bird count, we have. The probability function for is described as (3.10) If we had a covariate x for predictor days, then (3.11) For the Poisson distribution, the mean and variance are equal. Real data do not always follow this, and the variance ( ) is often much larger than the mean µ. This so-called overdispersion can be incorporated into a model in several ways. These all estimate the amount of extra variation but make different assumptions about how this extra variation scales with the mean. The negative binomial distribution, for example, assumes with an overdispersion parameter and the mean. The negative binomial distribution approximates to Poisson distribution when is much bigger than i.e. approaches to infinity. To introduce the negative binomial distribution in a simple way, we only use one variable here. Suppose where. Then we can describe the probability function of negative binomial distribution as follows: (3.12) 14

15 The negative binomial is also a Gamma-Poisson Mixture. Suppose and.then we can have the following procedures: (3.13) 4 DONOTLOG TRANSFORMCOUNTDATA Ohara and Kotze (2010) provide a detailed discussion in their paper Do Not Log-transform Count Data. In that paper, they put forward that log-transformation of counts has the additional quandary in how to deal with zero observations. With just one zero observation (if this observation represents a sampling unit), the whole data set is usually adjusted by adding a value (usually 1, the lowest possible nonzero count) before transformation, so they introduced GLMs to deal with count data. They simulated data sets from a negative binomial distribution with different values of. Low indicates greater variance in the data. From section 3, we know 15

16 that negative binomial distribution can be viewed as gamma mixture of Poisson. Low shrinks the graph of Gamma probability function of, which pulls values to a smaller domain, thus generating more clumping data. For each simulation, n=100 data points were simulated at each of 20 mean values, µ = 1, 2,..., 20. Five hundred replicate simulations were carried out for each value of. Then they compared the outcome of fitting models that were transformed in various ways (log, square root) with results from fitting models using overdispersed, quasi-poisson models and negative binomial models to untransformed count data. The simulations were compared by calculating the mean bias and root mean-squared error in estimating log (µ). In their results, the quasi-poisson and negative binomial models behave similarly, having negligible bias, whereas the models based on a normal distribution are all biased, particularly at low means and high variances. The square-root transformation has a lower bias than any of the log-transformations, unless the mean is low. Thus, they recommend that count data not be transformed to be used in parametric tests. For such data, GLMs and their derivatives are more appropriate. However, their simulations were from negative binomial distributions. Poisson models with extra-binomial variation still model the variation as proportional to the mean, whereas negative binomial models include a term in the variance proportional to the mean squared. In many data sets, Ln(Y+1) is fairly normal. For any of the discussions from here on Log or Ln are interchangeable notations. Generally, when in statistics Log means Ln. For example in SAS, Log(y) means Ln(y). Fitting a linear relationship to Ln(Y+1) of the daily counts of buteos shown in Figure 3 gives us the following normal plot of residuals. 16

17 This normal plot is reasonably straight, at least close enough for normal methods to work well enough. In later sections where we fit models to hourly counts, the normal plot is even straighter. The results from Ohara and Kotze are limited to 1) negative binomial data, 2) estimating a single mean and 3) very large replication, n=100. The generalized linear models worked in their simulations, but how will they work in estimating slopes of trends if the data are normal with variances not like Poisson or negative binomial data? 5 COMPARISONOFESTIMATIONMETHODSUSINGTHEDELTAMETHOD 5.1 ExpectedValuesandVariancesofNonlinearFunctionsofRandom Variables In discussions below, I will use Taylor series approximations for approximating expected values and variances of the nonlinear functions. First, I describe these methods, commonly known as propagation of error or delta method. 17

18 Suppose we have a random variable, and we know and, but we are interested in the mean and variance of for some function. For example, we might be able to measure and determine its mean and variance, but we are really interested in, which is related to in a known way. If is linear, then this is pretty straightforward: (5.1.1) (5.1.2) (5.1.3) However, in many cases is not linear. In many areas of mathematics we find approximations by linearizing a nonlinear problem we cannot solve exactly. In probability and statistics, this method is called propagation of error or the delta method. : Denote as the mean of. We use a first-order Taylor series approximation around (5.1.4) since (5.1.5) (5.1.6) 18

19 We have, but we know that in general from Jensen s Inequality. Thus, we can carry out the Taylor Series expansion to the second order to get an improved approximation of. (5.1.7) Taking the expectation of right-hand side, we have, (5.1.8) (5.1.9) How good such approximations depends on how nonlinear is in the neighborhood of defined by the size of, where is the standard deviation of. In comparing disadvantages of using Log(Y+1), the Poisson case with Y from Poisson distribution should be studied where log-normal estimation is at a great disadvantage. Using the delta method, we can start by comparing Poisson and log-normal estimation for the simple case of a single mean, the case considered by Ohara and Kotze and then compare two means. For the observation from the hawk counts, we have. To make the notation consistent through the discussions of one mean, two means, and regression, throughout I will use or. 5.2 SingleMean 19

20 Most of this report focuses on estimating changes or slopes across time, for example estimating how bird populations are changing across several years of monitoring. But first I will discuss briefly the case of estimating a single mean. In the one mean case, we consider the average number of hawks at where is considered as predictor day. Let Poissonforsinglemean From (3.10), we have at, so for Poisson likelihood, using the delta method, the expectation and variance of can be obtained as follows: (5.2.1) (5.2.2) Note that the degree of bias depends on the number of replicates,. Ohara and Kotze used n=100 which results in little bias if Poisson data are modeled Log normalforsinglemean If we use a normal distribution as an approximation to the distribution of, then.for the Poisson model above we use the log of the average, whereas in the normal model we use the average of log values. (5.2.3) (5.2.4) 20

21 Note that the Poisson model has smaller bias, expected value closer to log (µ). The smaller bias is more pronounced for larger n such as n=100 for Ohara and Kotze. Unlike Poisson estimation, the bias in using the mean of log values does not disappear with increasing n. 5.3 TwoMeans Suppose that and correspond to the average number of hawks at and, respectively. In this case a regression of Y on X will give a slope that is the same as the difference between the means. Considering the difference between means, I will use for and for Poissonfortwomeans From (3.10), we know the true and. Then we get and, where and. By applying the delta method, we have (5.3.1) (5.3.2) Log normalfortwomeans For Log-transformation to, and. 21

22 We are more concerned about the slope, so we would like to obtain the followings by using delta method: (5.3.3) (5.3.4) Again, the primary disadvantage of using normal likelihood methods is the larger bias. The results given above are based on approximations, but based on simulations and exact calculation given below, the general trends are accurate. 5.4 AlternativeNonlinearModels In the previous sections we used the delta method to find approximations for the mean and variance for those parameter estimates, and we saw that using results in more biased estimates. Alternatively, we could use these approximations to derive more unbiased estimators. Since the means are no longer linear functions, we will need to use nonlinear models to accomplish the estimation. Nonlinear mixed models in which fixed and random effects have nonlinear relationships to the response variable are becoming more and more popular nowadays. For using Taylor series expansions: ] ] (5.4.1) (5.4.2) 22

23 If we assume that the variance is equal to the mean as in a Poisson distribution then a normal approximation will use (5.4.3) More generally, we can assume an overdispersion models such as or. Since both the mean and variance are nonlinear functions of the parameters, procedures such as SAS NLMIXED is used to fit these nonlinear models, as I discuss later more. 6 FURTHERCOMPARISONSOFMODELS The final purpose is to fit a good model for data on hawk migration by modeling effects such as date, time of day, weather and distance from shore. To check comparisons of alternative models, I did simulations and exact calculation to investigate how Poisson, negative binomial and lognormal models compare when the data are Poisson, negative binomial and log-normal for log(y+1). I also investigated methods for bias corrections using approximate propagation of error methods for log-normal for log(y+1). A simple way to check different models is only to see how hawk counts are distributed based on time effect. 6.1 SingleMean Exactcalculationforasinglemean In section 5.2, we have discussed the application of delta method for single mean and cases for two means. For single mean case, we only compare exact calculation with delta method. Let and. The exact calculation with the estimator undefined for S=0 is 23

24 (6.1.1) (6.1.2) Using (6.1.3) (6.1.4) The two methods aren't on equal footing above, since the Poisson calculations don't use S=0 cases, but these are not common in the models considered, and comparisons of exact and deltamethod results, for the same model, are completely comparable Specificexampleofcomparingexactcalculationand I use the one simple case to illustrate the differences between exact calculation and the delta method approximation. Assume that we have observations and the observed hawk counts have. Then the true value of is that. Applying equations in sections 5.2 and 6.1, results for bias and root mean squared error (RMSE) about parameter estimate of are shown below. 24

25 Exactcalculation forpoisson regression Deltamethodfor Poissonregression Exactcalculation forlog normalof True Deltamethodforlognormalof Table for and (6.1.5) (6.1.6) Conclusions from these results are as follows. 1) Comparing the first and second or third and fourth columns, we see that the delta method approximations are quite good for means of this size. For smaller means the approximations will be less precise. The delta method approximations could be used to develop more efficient models as done later in this report. 2) Comparing the first and third columns, the normal approximation has larger bias, smaller variance, and a bit larger RMSE. Developing approximately bias corrected estimators could be competitive with generalized linear models. For comparing models with discrete distributions and closed form solutions such as log(y+1) or Poisson estimation, the simulations of Ohara and Kotze can be replaced with exact calculations. In addition, simple delta method approximations can be used for initial comparisons of alternative modeling methods before using more lengthy exact calculations for final results on promising methods. 6.2 TwoMeans 25

26 The next step up in complexity is comparing two means. The difference between means is the same as the regression slope with only two x values. In this section let s look at this simpler case before moving on to a more usual regression case Exactcalculationfordifferencebetweenmeans In two-means case, I would like to discuss the comparisons among exact calculation, delta method and simulation for Poisson and log-normal model. We also assume the data are Poisson distributed with and. Based on (3.10), for the method of estimation, where and, and in exact calculation can be obtained by the following: (6.2.1) (6.2.2) Where,k=1,2. Forthemethodthat,wehavethefollowings: (6.2.3) (6.2.4) 26

27 6.2.2 ComparisonsofModelsforTwoMeansbyDoingSimulation Basically, I would like to compare different methods of estimation, such as, and non-linear mixed model, of biases and RMSEs for. Data sets were simulated from a Poisson distribution. To check if the mean and number of data points in each simulation are factors, I simulated data sets with different values of two-means and data points[(,,n=10), (,,n=20), (,,n=10), (,,n=20)]. The data were analyzed assuming that time is a factor. Models were fitted making the following assumptions about the response, y: 1. y follows a Poisson distribution 2. y follows a negative binomial distribution 3. log(y+1) transformation follows a normal distribution a. A standard regression with mean linearly related to x and constant variance. b. Nonlinear approximations to the mean and variance with nlmixed. The simulations were also compared by using the mean bias and root mean-squared error (RMSE). Simulations and analyses were carried out in the SAS statistical program using proc reg, proc genmod and proc nlmixed. Fig. 4 and Fig. 5 show the bias and RMSE of against different models for the data generated from different two means and data points. For example, 5_10with20 means that the data are generated from, and.the amount of bias doesn t depend much on the two-means or number of observations data points, even though 27

28 the regression model for log(y+1) has a little dependence on two means. But basically, nonlinear mixed model gives the best estimate of the slope, that is, the difference of means. The data set with higher mean generates lower bias than the one with lower mean. Fig. 4 Estimated mean bias from four different models, applied to data simulated form a Poisson distribution. A low bias means that the model will basically return the true value. The root mean-squared error shows a similar pattern, with the non-linear mixed model having a low RMSE. A combination of higher mean and more data points gives lower RMSE. From these plots in Fig. 4 and Fig. 5, Poisson, negative binomial and nlmixed models perform well for Poisson data no matter what values are chosen for and. In short, we don t have to worry if the selection of initial values affects the outcome of comparing models. 28

29 Fig. 5 Estimated root mean-squared error from four six different models, applied to data simulated form a Poisson distribution Regression From the previous sections, using nonlinear approximations to the means and variances is a viable alternative to fitting the correct model when data are Poisson. If data were always Poisson, these methods would not be necessary. But since data often follow fairly lognormal patterns with much larger variances relative to the mean than a Poisson distribution or even a negative binomial distribution, these methods could work well over a large range of models. For the hawk migration data, our primary interests are usually regression type analyses such as whether the populations are decreasing over a span of years. To decide on what methods I should use to fit models to the hawk data, I will compare Poisson, regression, negative binomial regression, and nlmixed models when data are Poisson 29

30 and log normal distributed. We are most interested in estimating the slope,, to monitor changes in bird populations over time. The results will be shown with relative bias and relative RMSE of, which makes it easier to see how large the bias and RMSE's are without referring to the actual parameter values. (6.2.5) (6.2.6) For example, a value of 0.2 means that the ratio of estimate divided by the true value in error is 20%. For simplicity here, we consider the number of days past September 1 as a trend factor for the hourly hawk counts and generate data corresponding to the number of days from 0 to RegressionwithPoissondata The simplest model for count data is a Poisson distribution, so as in previous sections, at the beginning of this Regression section, data are also simulated from a Poisson distribution. The SAS statistical program is the main one for analyses. The main procedures include proc reg, proc genmod (For Poisson and negative binomial regression) and proc nlmixed. The following code is used to generate Poisson data with the mean of : %let b0=0; %let b1=0.02; %let nsim=1000; %let n=10; data sim3.mydata; call streaminit(1895); do isim=1 to &nsim; do days=0 to 50 by 5; do rep=1 to &n; mu=&b0+(&b1)*days; y=rand('poisson',exp(mu)); ln_y_1=log(y+1); 30

31 output; end; end; end; The values for β 0 and β 1 are chosen to represent relatively small expected counts corresponding to hourly observation of hawk counts, Y; the mean counts increase from an average of 1 bird per hour to 2.7 birds per hour. For the nlmixed model for Ln(Y+1) the approximations to the mean and variance of Ln(Y+1) play big roles in estimates of parameters. From section 6.1, will be a good way to do approximations. The nlmixed code is as follows: proc nlmixed data=sim3.mydata; ods output ParameterEstimates=parm_nlmix; by isim; title 'nlmixed'; parms b0=0 b1=0.031 r=1 c=0.45; bounds c > 0; mu = b0 + b1*days; mu_y = exp(mu); var_y = c*mu_y**r; mu_ln = log(mu_y+1) - 0.5*var_y/((mu_y+1)**2); var_ln = abs(var_y/(mu_y+1)**2); model ln_y_1 ~ normal(mu_ln, var_ln); The comparison results can be seen in the following table (1). The nlmixed model doesn t work as well as Poisson model and negative binomial model, both of which work very well for Poisson data. The negative binomial model does very well for Poisson data. We sacrifice little in fitting this more general model to the data. The nlmixed approximation does not do a bad job either. Perhaps the extra complexity of the nlmixed models for more complex data will be worth the small sacrifice in efficiency when the data are Poisson. From this simulation result, it is very obvious that the regression model does not fit well for Poisson data. We see the comparison more intuitively in Fig. 6 and Fig

32 Obs method MEAN rel_bias rel_rmse 1 Poisson Reg Negbin nlmixed Table (1) Fig. 6 The estimated relative bias for different models 32

33 Fig. 7 The estimated relative RMSE for different models Regressionwithlognormaldata As discussed earlier, in many case Ln(Y+1) is fairly normally distributed. Potentially, when data are of this sort, the generalized linear models such as Poisson regression or negative binomial regression might not be efficient compared to methods assuming normal errors. In this section, I simulate hawk data with Y following a discrete version of a lognormal distribution where Ln(Y) is normal. Because the discrete version will have zero counts, the analysis will be performed with Ln(Y+1). I will take the variance of Y to be proportional to a power of the expected value of Y. Since has a log normal distribution, where. The following equations show the way to generate random variables. Then the mean and variance of log normal variables are as follows: (6.2.7) (6.2.8) For my simulations 33

34 ` (6.2.9) (6.2.10) where and are both constants. Solving for we find (6.2.11) Then after knowing from (6.2.11), it is easily to get from (6.2.9) (6.2.12) For a Poisson distribution Vay(Y) = E(Y) which corresponds to r=1 and c=1. Meanwhile, Y has constant variance in Ln-scale when r=2. From the above methods, here comes the SAS code of generating data as followings: %let b0=0; %let b1=0.02; %let nsim=1000; %let n=10; %let c=1; data one; title 'Run Simulation'; call streaminit( ); do isim=1 to &nsim; do days=0 to 50 by 5; do r= 1 to 3.0 by 0.2; do rep=1 to &n; ln_mu_y=&b0+(&b1)*days; mu_y=exp(ln_mu_y); sig_2=log(1+(&c)*exp((r-2)*ln_mu_y)); std=sqrt(sig_2); mu=ln_mu_y-0.5*sig_2; x=rand('normal',mu,std); y1=exp(x); rem = y1 - floor(y1); y = floor(y1) + 1*(rand('uniform') < rem); ln_y_1 = log(y+1); var_y = (exp(sig_2) - 1)*mu_y**2; mu_y_r = mu_y**r; output; end; end; end; end; 34

35 The code y = floor(y1) + 1*(rand('uniform') < rem); is to keep the expected value of the rounded version of Y the same as the expected value of y. For example if Y1 = 1.75, then floor(y) = 1, the smallest integer less than or equal to Y1, and Y=1 with probability 0.25 and Y=2 with probability For these simulated values the mean of Y increases from 1.0 at days=0 to 2.7 at days=50. These values were chosen to represent fairly small counts corresponding to smaller hourly recording for the data of Seeland (2010). Hawk data are count data, which possibly include many zeros, so it is more meaningful to compare models with dependent variable where represents buteos counts. In our simulation, it is easy to build relationship between and independent variable days with regression, Poisson and negative binomial models. Here I also mainly introduce the nlmixed model. Before finalizing nlmixed model, the first key thing is to find good approximations to the mean and variance of Ln(Y+1). Three approximation methods were compared. The best one was chosen with the approximate mean of closest to the real mean and variance. To do a meaningful simulation, simulating data and choosing an appropriate approximation method in nlmixed model are two key steps. The main step in our simulation part is about selecting and checking approximation method. Different expansions of and generate different approximations of variances and means of. For example, we can apply Taylor Series to do the following expansion. In our simulation, we used a different approximation method by constructing a log likelihood function which can be seen in the following nlmixed code: proc nlmixed data=one; ods output ParameterEstimates=parm_nlmix2; by isim r; title 'nlmixed normal intervals'; parms b0=0 b1=0.013 r=3 c=1; 35

36 bounds c > 0; ln_mu_y = b0 + b1*days; sig_2 =log(c*exp((r-2)*ln_mu_y)+1); mu = ln_mu_y - 0.5*sig_2; if y = 0 then LogLike = log(probnorm((log(0.5)-mu)/sqrt(sig_2))) ; else LogLike = log(probnorm((log(y+0.5)-mu)/sqrt(sig_2)) - probnorm((log(y-0.5)-mu)/sqrt(sig_2))); model y ~ general(loglike); This likelihood treats the observed counts as rounded lognormal data. Since this was the way the simulation data were generated, this maximum likelihood method should be optimal at least asymptotically for large sample sizes. Comparing this nlmixed model with regression, Poisson and negative binomial models, the result plots are showed in Fig. 8 and Fig. 9. Obviously, regression model performs poorly with large relative biases and RMSEs. Poisson and negative binomial models perform well with good estimates. We can say nlmixed model works very well especially when with r greater than 2. Fig. 8 Relative bias of against different r 36

37 Fig. 9 Relative RMSE of against different r SummariesforModelComparison Through comparing relative bias and RMSE for different models, the nlmixed model generally does a good job no matter whether the data are Poisson distribution or log normal distribution. Surprisingly, the regression model doesn t work well for log normal data. Meanwhile, Poisson and negative binomial still perform well. When the variance is proportional to a large power of the mean, say 3 or more, the nlmixed nonlinear approximation works better, but for data between Poisson, r = 1, and lognormal with constant variance, r = 2, the generalized linear models, particularly the negative binomial model, work well even for lognormal data. The negative binomial variance allows both Poisson variance with θ large and variance proportional to the mean squared with θ small. In the next section in the analysis of the hawk 37

38 data, we will note that the variance is estimated to be proportional to µ 1.8 which is within the range where either negative binomial or nonlinear approximations work well. 7 FITTINGMODELSTOHAWKDATA In this section, we will further look at the hawk example introduced in section 2. From Fig. 3 in section 2 with logarithm of average buteos counts each day against date, we can find that there might be a linearly increasing trend in date. Does the wind during the observation hour affect buteos migration? Could the distance to dry land be a factor in buteos counts? To draw valid forecasting of buteos counts, model selection is important to us. Buteo counts are discrete variables and might include zeros. Models for such data include Poisson and negative binomial distributions, but it s possible that there are too many zeros for Poisson or negative binomial distributions. Another option is to use and apply methods for normally distributed data. The Central Limit Theorem (CLT) helps make models work assuming normality of data. This made us do simulations in section 6 and try to find an appropriate model for hawk data. 7.1 Simpleintroductiontosomepotentialvariables Hawk counts were recorded under certain weather, geographic and geological conditions. Let s get to know more basic ideas about how we use these conditions. 1. Wind is considered as one of possible factor. Best wind direction is nearly zero=north, so north is chosen as the referenced wind direction. Wind was recorded as degrees clockwise from north. The Wind_north_sp is wind speed times the cossin of the wind angle relative to north and can be understood as the strength of the northly wind vector. 38

39 The variable Wind Pre is used to record the number of days that winds did not have a westerly component before observation day. 2. We wonder if the time of a day, that is, a specific hour when observations began, could be a factor in counting migrating buteos, so variable Time will represent the starting time of observations a day. 3. We have noticed that buteo counts slightly increase with date. Then we use the variable day to represent the number of days since Sept. 1, Precipitation is also considered as potential predictor. The variable Precip Pre recorded the number of days with 50% or more hours of precipitation prior to observation day. 5. Likewise, we wonder if the distance to water would affect buteos counts. Distance to the shore of Lake Superior is used to see if buteos migration somehow is related to this geographical location. 7.2 FittingModels From section 6.2.3, it seems that the negative binomial model would be a reasonable choice given that NLMIXED cannot handle the random effects that we need in the model. These mixed models with negative binomial data can be fit with SAS procedure NLMIXED. However, fitting these types of models with these random effects turns out to be tricky. Nonlinear optimization and numerical integration are needed, and for all models we fit, the resulting gradient vector in the "solution" was not close to zero, which is what we want if we are at a local maximum of the log-likelihood function. So we are back to using Ln(Y+1) as an initial analysis of these data. From the simulations, using Ln(Y+1) should be less efficient than using the better methods, so at least if an effect comes out significant using Ln(Y+1), this would likely be the case fitting more 39

40 efficient models. Further work will need to be done beyond this project to figure out how to fit more complicated models. At the first step, we try many independent variables, e.g. 19 and use as a selection criterion to obtain potential variables. Generally, several models might be highly similar in the quality of the fit based on selection. Based on the values, we only can choose a shorter list of independent variables to start studying. The runs were done without random effects in the models, since software is readily available to do this. The p-values will not be correct, but the relative importance of the potential independent variables should be fine. We then fit a model including variables included in the top models based on. Then the p-value is used as one of the criteria to cut down variables based on former runs. Here is an example to show you how we get rid of variables. By running a regression model including independent variable temp_chg, we found that the p value of variable temp_chg is around , which is very big indicating that temp_chg is not needed if other variables are included in the model. We can say that temperature-change is not an important predictor to buteos counts, so variable temp_chg need not be considered in the model. Finally there are only 14 independent variables left by using the similar method of using p-values to reduce variables FitMixedModeltoData Mixed models are widely used to model a linear relationship when the dependent data have known structure. The commonly used mixed model involves repeated measurement. Repeated measures are encountered in hawk data, so a mixed model is applied in analyzing the relationship between the logarithm of the average buteo counts and some possible predictors. 40

41 Transects are numbered by ordering the distances from Duluth up to the North Shore of lake Superior. Drawing general conclusions about places in general is more meaningful than finding out the effect of these specific transects. Thus, transect is considered as a random effect here. Date is the day of observing hawk migration, which is treated as random effect too. The sites on a given transect were distances from shore recorded as the variable shore (a, b, or c) where shore = a is closest to Lake Superior and shore = c is farthest from Lake Superior. To account for dependence of hourly measurements at the same site, a shore*date random effect is also included. Even though nlmixed might fit well for hawk data, nlmixed cannot handle both date and shore*date random effects. This is the reason for using mixed rather than nlmixed for including those random effects in the model. Proc Mixed in SAS system provides a very flexible platform for dealing with repeated measures problems. The mixed model can provide a better p value than regression model to cut down variables. One of mixed model codes is as follows: proc mixed data=fengying.buteos_before_nov plots=residualpanel(unpack); class transect date shore; model Ln_buteos_plus_1 = day shore Wind_Prev Precip_Pre wind_east wind_north_sp time time*time/ residual outpm=outpm solution ; random date shore*date; ods select solutionf covparms tests3 ResidualQQplot ; The estimated transect variance comes out as 0. To make the convergence of the estimation simpler and more likely to find the right MLE, transect is taken out of the random effects for the mixed models in our study FitNlmixedModeltoData The procedure nlmixed model cannot handle the model with both date and shore*date random effects, but nlmixed model can be used to check the relationship between variances and mean 41

42 and also to check for the best wind direction. This model should be fixed to include all the variables from previous runs. No random effects in our nlmixed model were used to make the estimation easier. Applying the approximation method we finalized in section into the hawk data, we came up the following nlmixed code: proc nlmixed data=fengying.buteos_before_nov;; parms b0=-8.8 b_day=0.02 b_shore_a=0.4 b_shore_b=0.3 b_wind_prev=0.2 b_time=1.5 b_time_2 = b_precip_pre=-0.75 k=0.2 r=2 c=0.5 theta=0; bounds c > 0; wind = k*wind_sp*cos( (wind_dir-theta)* /180 ); ln_mu_y = b0 + b_day*day + b_shore_a*shore_a + b_shore_b*shore_b + wind + b_wind_prev*wind_prev + b_time*time + b_time_2*time*time + b_precip_pre*precip_pre; sig_2 =log(c*exp((r-2)*ln_mu_y)+1); mu = ln_mu_y - 0.5*sig_2; y = buteos; if y = 0 then ll = log(probnorm((log(0.5)-mu)/sqrt(sig_2))) ; else ll = log(probnorm((log(y+0.5)-mu)/sqrt(sig_2)) - probnorm((log(y- 0.5)-mu)/sqrt(sig_2))); model y ~ general(ll); Using this code, the maximum likelihood estimate of the clockwise angle relative to north is with a standard error of 10 o, very nearly true north. The estimate of r is 1.8 with a standard error of 0.19, corresponding to, indicating Y is log normal. We can say that the data would not be modeled well as Poisson data. 7.3 SummaryofFindings One of mixed model was introduced in section Firstly, a newly built model prompts us to look at how the errors of the model are distributed. In Fig. 10, we can see that the residuals are almost distributed around the straight line except the last two points, which indicates the data are fairly well approximated by a normal distribution. 42

43 Fig. 10 QQ-plot for residuals from a mixed model Intuitively, a good model should have the predicted values as close to true values as possible. The R 2 value of a model is the square of the correlation between the fitted and observed values. It is interesting to see how the predicted values from the mixed model are compared with real values of. In Fig. 11 the basic trend can be described as equation y=x except two outliers. The R 2 value for this model is about 0.4. Based on the simulations, better models could potentially be fit, but generally speaking, this model works fairly well for these buteo data. 43

44 Fig. 11 Ln(buteos+1) VS. Predicted Mean of Ln(buteos+1) The following table 7.3(1) with p-values shows that the effects day, wind_north_sp and time are significant to predict buteos counts. Again, better models could potentially be fit, but the significant p-values from this model are reliable. This is like using non-parametric methods when data are normal or some other distribution. The statistics are not as efficient as they could be, but significant effects can still be considered significant. Type3TestsofFixedEffects Effect NumDF DenDF FValue Pr>F day <.0001 shore Wind_Prev Precip_Pre wind_east wind_north_sp <.0001 time <.0001 time*time <.0001 Table 7.3(1) Test Results of Fixed Effects 44

45 The day effect has been showed in Fig. 3. More and more buteos migrate as date gets close to winter. In this part, we are more concerned about the wind_north_sp and time effects. To check their effects, LSMEANS statements were added to the previous mixed model, respectively. For example, lsmeans wind_north_sp /obsmargins;. For the mixed model with this LSMEANS statement, the variable wind_east with p value of is not useful to this model. The Fig. 12 is the plot the estimates of against the least squares means of wind_north_sp using each unique value of wind_north_sp as its own effect, using wind_north_sp as a "class" variable, rather than a linear effect. Obviously, there is an increasing trend in wind_north_sp, which further illustrates that north is the best wind direction for buteos migration. Fig. 12 Estimates of VS. LS-means of wind_north_sp For another mixed model with LS-means variable time, wind_east also came up to be an unimportant predictor. The relationship between estimate and time seems to be a parabola, which 45

46 is shown in Fig 13 using each hour of the day as its own effect, day as a "class" variable, rather than a quadratic effect. Basically, the buteo migration peak in a day is in the early afternoon. Fig. 13 Estimates of VS. LS-means of time In nlmixed model, we want to see if there exists a relationship between variance of and mean of, where Y is the buteo counts. We used the idea that in nlmixed model. After running the nlmixed code in section 7.2.2, the estimates of r is 1.8 with a standard erro of In another words, the variance of buteo count is approximately proportional to the square of the mean of buteos counts, so the mixed model which assumes equal variances in the log scale is reasonable. 8 CONCLUSION Count data are commonly studied nowadays. The LM, GLM, MIXED and NLMIXED models can be considered for modeling count data. The outcome from our simulations comes in line with 46

47 some of the results from Ohara and Kotze (2010) that log-transformation of count data performs poorly while negative binomial and Poisson work well, so we do not recommend logtransforming count data with many zero observations. The negative binomial model might perform better than the Poisson model for these kinds of data. The mixed model provides an effective way to analyze count data with complicated random effects instead of NLMIXED model. When applying the NLMIXED model, the main focus should be put on choosing a good approximation method. SAS is a good statistical software to fit these models. In our simulations the negative binomial model did well even for lognormal data. However, in 2008 there were not as pronounced large bursts of buteos during the migration. With more very large count days, the variance may be a higher power of the mean where the nonlinear models would have advantage over the negative binomial models. To answer the questions of research for the hawk data set in this report, the mixed model is used to analyze it. The effects of day, wind_north_sp and time play central roles in estimating buteos count during a certain period of time. It is understandable that there are increasing number of buteos that migrate as time gets closer to winter. Buteos fly from north to south with the benefit of north wind when winter is coming, so it makes sense that wind_north_sp is a significant factor and there exists an increasing trend in wind_north_sp. It is possible that buteos prefer to migrate during a slightly warmer time, in the early afternoon, which can also be seen from Fig. 13 with a downward parabola. For future studies, we can incorporate the mixed model with other bird data, such as accipiters which fly closer to the ground. By analyzing other bird data, we can further see if day, wind_north_sp and time are still significant in predicting other bird s counts. Although the transformation of hawk data, to some extent, supports the mixed model, the generalized mixed 47