Modeling Count Data from Hawk Migrations

Size: px
Start display at page:

Download "Modeling Count Data from Hawk Migrations"

Transcription

1 ModelingCountDatafromHawkMigrations M.S. Plan B Project Report January 12, 2011 Fengying Miao M.S. Applied and Computational Mathematics Candidate Dr. Ronald Regal Advisor University of Minnesota Duluth Department of Mathematics and Statistics 1

2 TableofContents i. ACKNOWLEDGEMENTS... 4 ii. ABSTRACT INTRODUCTION HAWKEXAMPLE THEGENERALANDGENERALIZEDLINEARMODELS GeneralLinearModel ExponentialFamilyofDistributions GeneralizedLinearModels DONOTLOG TRANSFORMCOUNTDATA COMPARISONOFESTIMATIONMETHODSUSINGTHEDELTAMETHOD ExpectedValuesandVariancesofNonlinearFunctionsofRandomVariables SingleMean Poissonforsinglemean Log normalforsinglemean TwoMeans Poissonfortwomeans Log normalfortwomeans AlternativeNonlinearModels FURTHERCOMPARISONSOFMODELS SingleMean Exactcalculationforasinglemean Specificexampleofcomparingexactcalculationand TwoMeans Exactcalculationfordifferencebetweenmeans ComparisonsofModelsforTwoMeansbyDoingSimulation Regression FITTINGMODELSTOHAWKDATA Simpleintroductiontosomepotentialvariables FittingModels FitMixedModeltoData FitNlmixedModeltoData

3 7.3 SummaryofFindings CONCLUSION REFERENCES APPENDICES SASCode Rcode

4 i. ACKNOWLEDGEMENTS I would like to take this opportunity to give my sincere thanks to my advisor, Dr. Ronald Regal, for his great support and guidance, which make it possible for me to finish the project. Dr. Ronald Regal is the best advisor I have ever had. I would never forget his great help in my study and life and what he told me that finding the limits of our knowledge and understanding is always important. I also want to give my thanks to Dr. Richard Green and Dr. Gerald Niemi for being on my degree committee, reviewing my report and providing useful suggestions. I also thank Heidi Seeland for providing the datasets in this project. Thanks to Dr. Zhuangyi Liu for accepting me into this good program and letting me have the chance of learning things from Dr. Regal and other great people. 4

5 ii. ABSTRACT The General Linear Model (LM) with assumptions of independence, linearity and equal variance underlies most statistical analyses. Because of its generality, many kinds of data are transformed to satisfy its assumptions. Count data are often log-transformed using to more nearly match the assumptions. However, adding a value of one to counts might generate biases, so we need to choose a proper model for count data. In addition E(Ln(X)) is not the same as Ln(E(X)) so even if the relationship is linear for Ln(E(X)), the same will not be perfectly true for E(Ln(X)). To avoid or reduce the bias from transforming data, the Generalized Linear Model (GLM) and nonlinear mixed (NLMIXED) model could be considered instead. This report investigates how LM regression models, Generalized Linear Models based on Poisson and negative binomial distributions, and approximate nonlinear models fit with NLMIXED model compare when estimating the slope of a linear trend when analyzing count data. Implementations of comparing models are done by the popular statistical software SAS with packages, PROC REG, PROC GENMOD, PROC MIXED and PROC NLMIXED. A real data set from a hawk migration is analyzed and fitted with the mixed model. The NLMIXED model is used to analyze the variances and means of. 5

6 1 INTRODUCTION A statistical model is used to predict the probabilistic future behavior of a system from data. The main purpose of model building is to obtain proper estimates with small bias and little variability. The traditional model (LM) has been widely used, since many data can be modeled this way and there are many available theories to be applied. Different methods, such as square-root transformation and log-transforming, are often used to transform data, usually response variables, to meet the assumptions of LM. These methods might work well for continuous response variable and certain discrete variable, such count data including few zero observations which rules out direct log transformations. For example, in a study where migrating hawks are counted hourly, the numbers counted are often zero. More and more methods and models have been explored to break the limits of the assumptions. The Generalized Linear Model, GLM, an extension of LM, allows the analyst to specify the distribution of data, which address the problem of transforming data to be normally distributed. The NLMIXED model in which both fixed and random effects are allowed to have nonlinear relationships with response variables has become increasingly popular and allow flexibility of nonlinear functions as well as user specified likelihood functions. These newly born models can be applied to a wider range of real problems. Currently, the computing statistical software has been keeping in line with the numerical methods and making them more applicable. To get best estimates of response variable for a particular system, it is important to fit proper model for best describing data. In this report I describe my investigations into finding appropriate models to analyze count data such as hawk migration counts. Model selection from LM regression, GLMs and NLMIXED is based on simulations that will be done separately for simulating Poisson data and log-normal data. The GLMs are used by specifying Poisson and 6

7 negative binomial distributions. Relative bias and relative RMSE are used to evaluate how well the models work. The relationship between variance and means of real data is modeled by using NLMIXED model, which cannot deal with complicated random effects in real data we are going to study. A mixed model fit with SAS proc MIXED is used for the real hawk data to account for dependence of observations on the same day and from the same site. 7

8 2 HAWKEXAMPLE A data set from monitoring of migrating hawks is used to illustrate the issues and conclusions in this report. The data were collected in fall of 2008 by Heidi Seeland and Anna Peterson, graduate students at UMD with Dr. Gerald Niemi as their advisors. In this section, the structure of hawk data is initially described. Further details on fitting models to the data are discussed in later chapters. The data set contains counts of hawk and eagles at three distances from the shore of Lake Superior, over seven hours on certain days between August 29, 2008, and November 11, The sampling plan had eight sets of three sampling locations spread out along the north shore of Lake Superior. The eight sets of three sampling sites were called transects and numbered for 1 to 8 up the shore starting from Duluth. Fig. 1 Locations of Hawk Counts (Seeland 2010) 8

9 One general category of hawks is accipiters, which fly lower and closer to tree cover. To make the data set more understandable, let s first introduce buteos, larger hawks with broad wings that soar higher on wind currents. Figure 2 shows a plot of the average number of buteos per 7 hour each day plotted against dates. Fig. 2 Average Buteos Counts per day VS Dates in Original Scale In this original scale, buteos counts on a day are dispersedly distributed over time. The huge variation makes the form of the time trend unclear. As shown in Figure 3, log transformation of buteos counts generates a more clearly increasing trend of buteos across time. 9

10 Fig. 3 Average Buteos Counts per day VS Dates in Log-Scale The relationship between buteos and dates follow a general linear trend, except the last two points. For simplicity in applying models in this project which focus on estimating the slope of a linear trend, we will leave out points after November 1 in demonstrating the fitting some models to these data. 3 THEGENERALANDGENERALIZEDLINEARMODELS Recent papers including Ohara and Kotze (2010) have advocated generalized linear models with Poisson or negative binomial distributions rather than using normal linear models in the log scale. Before comparing these models in a wider range of situations than considered by Ohara and Kotze, I will briefly describe general and generalized linear models. 3.1 GeneralLinearModel 10

11 Consider a situation where we are interested, for example, in describing the number of violation tickets people get for violating traffic regulations annually as a function of their age. The average number of violation tickets is predicted by the following equation (3.1) where y is the response variable, Violation Tickets, is the explanatory variable, Age, and measures the deviation of the measured y from its expected value. It may now be asked whether, after allowing for the effect of age, a person s sex has any influence on the frequency of violation tickets people get. Based on this assumption, the appropriate model might be described as (3.2) where and represent Age and Sex, respectively. Each time a new variable has been introduced into the model, an additional parameter has been added. This process is an approach by which we find a mathematical description of the structure in the values of response variable. These two models discussed above involve a linear combination of parameters,,,, and are consequently known as linear models. For example, polynomial regression model that y is a non-linear function of the explanatory variable belongs this category despite the fact. The general form of linear models is described as (3.3) where and represents the error that explanatory variable cannot tell. By introducing vectors and matrices, (3.3) can be rewritten as 11

12 , (3.4) or in the following compact form (3.5) Besides linearity, the usual general linear model also assumes normality, independence and equal variance of observations, which can be written as where,. (3.6) 3.2 ExponentialFamilyofDistributions Linear models are postulated more often than non-linear ones because they are mathematically easier to manipulate and usually easier to interpret. They appear to provide an adequate description of many data sets. A wider class including normal distribution is called the exponential family of distributions. Consider a single random variable Y whose probability distribution depends on a single parameter θ. The distribution belongs to the exponential family if it can be written in the form If, the distribution is said to be in canonical form, and is sometimes is called the natural parameter of the distribution. Other parameters are regarded as nuisance parameters. The (3.7) 12

13 exponential family includes such useful distributions as binomial, Poisson, negative binomial, and gamma distributions, in addition to the normal distribution. 3.3 GeneralizedLinearModels There are many types of data which might not be normally distributed in original scale. To address this problem, a transformation may be used to normalize the data. Often, people deal with the log-transformation first, before evaluating other transformation techniques. But discrete response variables, such as birds count data, often contain many zero observations and are unlikely to have a normally distributed error structure. Maindonald & Braun (2007) argued that generalized linear models (GLMs) have largely removed the need for transforming count data. More recently, GLMs have been developed and commonly used. A GLM is an extension of the well-known linear models to include response variables that follow any probability distribution in the exponential family of distributions. The key idea is that, the relationship between and a linear predictor is specified by a link function: (3.8) where and is a link function that links the random component,, to the systematic component. Equation (3.8) can be written as (3.9) 13

14 For example, count data could be appropriately analyzed as a Poisson random variable within the context of the Generalized Linear Model. So, for the observation bird count, we have. The probability function for is described as (3.10) If we had a covariate x for predictor days, then (3.11) For the Poisson distribution, the mean and variance are equal. Real data do not always follow this, and the variance ( ) is often much larger than the mean µ. This so-called overdispersion can be incorporated into a model in several ways. These all estimate the amount of extra variation but make different assumptions about how this extra variation scales with the mean. The negative binomial distribution, for example, assumes with an overdispersion parameter and the mean. The negative binomial distribution approximates to Poisson distribution when is much bigger than i.e. approaches to infinity. To introduce the negative binomial distribution in a simple way, we only use one variable here. Suppose where. Then we can describe the probability function of negative binomial distribution as follows: (3.12) 14

15 The negative binomial is also a Gamma-Poisson Mixture. Suppose and.then we can have the following procedures: (3.13) 4 DONOTLOG TRANSFORMCOUNTDATA Ohara and Kotze (2010) provide a detailed discussion in their paper Do Not Log-transform Count Data. In that paper, they put forward that log-transformation of counts has the additional quandary in how to deal with zero observations. With just one zero observation (if this observation represents a sampling unit), the whole data set is usually adjusted by adding a value (usually 1, the lowest possible nonzero count) before transformation, so they introduced GLMs to deal with count data. They simulated data sets from a negative binomial distribution with different values of. Low indicates greater variance in the data. From section 3, we know 15

16 that negative binomial distribution can be viewed as gamma mixture of Poisson. Low shrinks the graph of Gamma probability function of, which pulls values to a smaller domain, thus generating more clumping data. For each simulation, n=100 data points were simulated at each of 20 mean values, µ = 1, 2,..., 20. Five hundred replicate simulations were carried out for each value of. Then they compared the outcome of fitting models that were transformed in various ways (log, square root) with results from fitting models using overdispersed, quasi-poisson models and negative binomial models to untransformed count data. The simulations were compared by calculating the mean bias and root mean-squared error in estimating log (µ). In their results, the quasi-poisson and negative binomial models behave similarly, having negligible bias, whereas the models based on a normal distribution are all biased, particularly at low means and high variances. The square-root transformation has a lower bias than any of the log-transformations, unless the mean is low. Thus, they recommend that count data not be transformed to be used in parametric tests. For such data, GLMs and their derivatives are more appropriate. However, their simulations were from negative binomial distributions. Poisson models with extra-binomial variation still model the variation as proportional to the mean, whereas negative binomial models include a term in the variance proportional to the mean squared. In many data sets, Ln(Y+1) is fairly normal. For any of the discussions from here on Log or Ln are interchangeable notations. Generally, when in statistics Log means Ln. For example in SAS, Log(y) means Ln(y). Fitting a linear relationship to Ln(Y+1) of the daily counts of buteos shown in Figure 3 gives us the following normal plot of residuals. 16

17 This normal plot is reasonably straight, at least close enough for normal methods to work well enough. In later sections where we fit models to hourly counts, the normal plot is even straighter. The results from Ohara and Kotze are limited to 1) negative binomial data, 2) estimating a single mean and 3) very large replication, n=100. The generalized linear models worked in their simulations, but how will they work in estimating slopes of trends if the data are normal with variances not like Poisson or negative binomial data? 5 COMPARISONOFESTIMATIONMETHODSUSINGTHEDELTAMETHOD 5.1 ExpectedValuesandVariancesofNonlinearFunctionsofRandom Variables In discussions below, I will use Taylor series approximations for approximating expected values and variances of the nonlinear functions. First, I describe these methods, commonly known as propagation of error or delta method. 17

18 Suppose we have a random variable, and we know and, but we are interested in the mean and variance of for some function. For example, we might be able to measure and determine its mean and variance, but we are really interested in, which is related to in a known way. If is linear, then this is pretty straightforward: (5.1.1) (5.1.2) (5.1.3) However, in many cases is not linear. In many areas of mathematics we find approximations by linearizing a nonlinear problem we cannot solve exactly. In probability and statistics, this method is called propagation of error or the delta method. : Denote as the mean of. We use a first-order Taylor series approximation around (5.1.4) since (5.1.5) (5.1.6) 18

19 We have, but we know that in general from Jensen s Inequality. Thus, we can carry out the Taylor Series expansion to the second order to get an improved approximation of. (5.1.7) Taking the expectation of right-hand side, we have, (5.1.8) (5.1.9) How good such approximations depends on how nonlinear is in the neighborhood of defined by the size of, where is the standard deviation of. In comparing disadvantages of using Log(Y+1), the Poisson case with Y from Poisson distribution should be studied where log-normal estimation is at a great disadvantage. Using the delta method, we can start by comparing Poisson and log-normal estimation for the simple case of a single mean, the case considered by Ohara and Kotze and then compare two means. For the observation from the hawk counts, we have. To make the notation consistent through the discussions of one mean, two means, and regression, throughout I will use or. 5.2 SingleMean 19

20 Most of this report focuses on estimating changes or slopes across time, for example estimating how bird populations are changing across several years of monitoring. But first I will discuss briefly the case of estimating a single mean. In the one mean case, we consider the average number of hawks at where is considered as predictor day. Let Poissonforsinglemean From (3.10), we have at, so for Poisson likelihood, using the delta method, the expectation and variance of can be obtained as follows: (5.2.1) (5.2.2) Note that the degree of bias depends on the number of replicates,. Ohara and Kotze used n=100 which results in little bias if Poisson data are modeled Log normalforsinglemean If we use a normal distribution as an approximation to the distribution of, then.for the Poisson model above we use the log of the average, whereas in the normal model we use the average of log values. (5.2.3) (5.2.4) 20

21 Note that the Poisson model has smaller bias, expected value closer to log (µ). The smaller bias is more pronounced for larger n such as n=100 for Ohara and Kotze. Unlike Poisson estimation, the bias in using the mean of log values does not disappear with increasing n. 5.3 TwoMeans Suppose that and correspond to the average number of hawks at and, respectively. In this case a regression of Y on X will give a slope that is the same as the difference between the means. Considering the difference between means, I will use for and for Poissonfortwomeans From (3.10), we know the true and. Then we get and, where and. By applying the delta method, we have (5.3.1) (5.3.2) Log normalfortwomeans For Log-transformation to, and. 21

22 We are more concerned about the slope, so we would like to obtain the followings by using delta method: (5.3.3) (5.3.4) Again, the primary disadvantage of using normal likelihood methods is the larger bias. The results given above are based on approximations, but based on simulations and exact calculation given below, the general trends are accurate. 5.4 AlternativeNonlinearModels In the previous sections we used the delta method to find approximations for the mean and variance for those parameter estimates, and we saw that using results in more biased estimates. Alternatively, we could use these approximations to derive more unbiased estimators. Since the means are no longer linear functions, we will need to use nonlinear models to accomplish the estimation. Nonlinear mixed models in which fixed and random effects have nonlinear relationships to the response variable are becoming more and more popular nowadays. For using Taylor series expansions: ] ] (5.4.1) (5.4.2) 22

23 If we assume that the variance is equal to the mean as in a Poisson distribution then a normal approximation will use (5.4.3) More generally, we can assume an overdispersion models such as or. Since both the mean and variance are nonlinear functions of the parameters, procedures such as SAS NLMIXED is used to fit these nonlinear models, as I discuss later more. 6 FURTHERCOMPARISONSOFMODELS The final purpose is to fit a good model for data on hawk migration by modeling effects such as date, time of day, weather and distance from shore. To check comparisons of alternative models, I did simulations and exact calculation to investigate how Poisson, negative binomial and lognormal models compare when the data are Poisson, negative binomial and log-normal for log(y+1). I also investigated methods for bias corrections using approximate propagation of error methods for log-normal for log(y+1). A simple way to check different models is only to see how hawk counts are distributed based on time effect. 6.1 SingleMean Exactcalculationforasinglemean In section 5.2, we have discussed the application of delta method for single mean and cases for two means. For single mean case, we only compare exact calculation with delta method. Let and. The exact calculation with the estimator undefined for S=0 is 23

24 (6.1.1) (6.1.2) Using (6.1.3) (6.1.4) The two methods aren't on equal footing above, since the Poisson calculations don't use S=0 cases, but these are not common in the models considered, and comparisons of exact and deltamethod results, for the same model, are completely comparable Specificexampleofcomparingexactcalculationand I use the one simple case to illustrate the differences between exact calculation and the delta method approximation. Assume that we have observations and the observed hawk counts have. Then the true value of is that. Applying equations in sections 5.2 and 6.1, results for bias and root mean squared error (RMSE) about parameter estimate of are shown below. 24

25 Exactcalculation forpoisson regression Deltamethodfor Poissonregression Exactcalculation forlog normalof True Deltamethodforlognormalof Table for and (6.1.5) (6.1.6) Conclusions from these results are as follows. 1) Comparing the first and second or third and fourth columns, we see that the delta method approximations are quite good for means of this size. For smaller means the approximations will be less precise. The delta method approximations could be used to develop more efficient models as done later in this report. 2) Comparing the first and third columns, the normal approximation has larger bias, smaller variance, and a bit larger RMSE. Developing approximately bias corrected estimators could be competitive with generalized linear models. For comparing models with discrete distributions and closed form solutions such as log(y+1) or Poisson estimation, the simulations of Ohara and Kotze can be replaced with exact calculations. In addition, simple delta method approximations can be used for initial comparisons of alternative modeling methods before using more lengthy exact calculations for final results on promising methods. 6.2 TwoMeans 25

26 The next step up in complexity is comparing two means. The difference between means is the same as the regression slope with only two x values. In this section let s look at this simpler case before moving on to a more usual regression case Exactcalculationfordifferencebetweenmeans In two-means case, I would like to discuss the comparisons among exact calculation, delta method and simulation for Poisson and log-normal model. We also assume the data are Poisson distributed with and. Based on (3.10), for the method of estimation, where and, and in exact calculation can be obtained by the following: (6.2.1) (6.2.2) Where,k=1,2. Forthemethodthat,wehavethefollowings: (6.2.3) (6.2.4) 26

27 6.2.2 ComparisonsofModelsforTwoMeansbyDoingSimulation Basically, I would like to compare different methods of estimation, such as, and non-linear mixed model, of biases and RMSEs for. Data sets were simulated from a Poisson distribution. To check if the mean and number of data points in each simulation are factors, I simulated data sets with different values of two-means and data points[(,,n=10), (,,n=20), (,,n=10), (,,n=20)]. The data were analyzed assuming that time is a factor. Models were fitted making the following assumptions about the response, y: 1. y follows a Poisson distribution 2. y follows a negative binomial distribution 3. log(y+1) transformation follows a normal distribution a. A standard regression with mean linearly related to x and constant variance. b. Nonlinear approximations to the mean and variance with nlmixed. The simulations were also compared by using the mean bias and root mean-squared error (RMSE). Simulations and analyses were carried out in the SAS statistical program using proc reg, proc genmod and proc nlmixed. Fig. 4 and Fig. 5 show the bias and RMSE of against different models for the data generated from different two means and data points. For example, 5_10with20 means that the data are generated from, and.the amount of bias doesn t depend much on the two-means or number of observations data points, even though 27

28 the regression model for log(y+1) has a little dependence on two means. But basically, nonlinear mixed model gives the best estimate of the slope, that is, the difference of means. The data set with higher mean generates lower bias than the one with lower mean. Fig. 4 Estimated mean bias from four different models, applied to data simulated form a Poisson distribution. A low bias means that the model will basically return the true value. The root mean-squared error shows a similar pattern, with the non-linear mixed model having a low RMSE. A combination of higher mean and more data points gives lower RMSE. From these plots in Fig. 4 and Fig. 5, Poisson, negative binomial and nlmixed models perform well for Poisson data no matter what values are chosen for and. In short, we don t have to worry if the selection of initial values affects the outcome of comparing models. 28

29 Fig. 5 Estimated root mean-squared error from four six different models, applied to data simulated form a Poisson distribution Regression From the previous sections, using nonlinear approximations to the means and variances is a viable alternative to fitting the correct model when data are Poisson. If data were always Poisson, these methods would not be necessary. But since data often follow fairly lognormal patterns with much larger variances relative to the mean than a Poisson distribution or even a negative binomial distribution, these methods could work well over a large range of models. For the hawk migration data, our primary interests are usually regression type analyses such as whether the populations are decreasing over a span of years. To decide on what methods I should use to fit models to the hawk data, I will compare Poisson, regression, negative binomial regression, and nlmixed models when data are Poisson 29

30 and log normal distributed. We are most interested in estimating the slope,, to monitor changes in bird populations over time. The results will be shown with relative bias and relative RMSE of, which makes it easier to see how large the bias and RMSE's are without referring to the actual parameter values. (6.2.5) (6.2.6) For example, a value of 0.2 means that the ratio of estimate divided by the true value in error is 20%. For simplicity here, we consider the number of days past September 1 as a trend factor for the hourly hawk counts and generate data corresponding to the number of days from 0 to RegressionwithPoissondata The simplest model for count data is a Poisson distribution, so as in previous sections, at the beginning of this Regression section, data are also simulated from a Poisson distribution. The SAS statistical program is the main one for analyses. The main procedures include proc reg, proc genmod (For Poisson and negative binomial regression) and proc nlmixed. The following code is used to generate Poisson data with the mean of : %let b0=0; %let b1=0.02; %let nsim=1000; %let n=10; data sim3.mydata; call streaminit(1895); do isim=1 to ≁ do days=0 to 50 by 5; do rep=1 to &n; mu=&b0+(&b1)*days; y=rand('poisson',exp(mu)); ln_y_1=log(y+1); 30

31 output; end; end; end; The values for β 0 and β 1 are chosen to represent relatively small expected counts corresponding to hourly observation of hawk counts, Y; the mean counts increase from an average of 1 bird per hour to 2.7 birds per hour. For the nlmixed model for Ln(Y+1) the approximations to the mean and variance of Ln(Y+1) play big roles in estimates of parameters. From section 6.1, will be a good way to do approximations. The nlmixed code is as follows: proc nlmixed data=sim3.mydata; ods output ParameterEstimates=parm_nlmix; by isim; title 'nlmixed'; parms b0=0 b1=0.031 r=1 c=0.45; bounds c > 0; mu = b0 + b1*days; mu_y = exp(mu); var_y = c*mu_y**r; mu_ln = log(mu_y+1) - 0.5*var_y/((mu_y+1)**2); var_ln = abs(var_y/(mu_y+1)**2); model ln_y_1 ~ normal(mu_ln, var_ln); The comparison results can be seen in the following table (1). The nlmixed model doesn t work as well as Poisson model and negative binomial model, both of which work very well for Poisson data. The negative binomial model does very well for Poisson data. We sacrifice little in fitting this more general model to the data. The nlmixed approximation does not do a bad job either. Perhaps the extra complexity of the nlmixed models for more complex data will be worth the small sacrifice in efficiency when the data are Poisson. From this simulation result, it is very obvious that the regression model does not fit well for Poisson data. We see the comparison more intuitively in Fig. 6 and Fig

32 Obs method MEAN rel_bias rel_rmse 1 Poisson Reg Negbin nlmixed Table (1) Fig. 6 The estimated relative bias for different models 32

33 Fig. 7 The estimated relative RMSE for different models Regressionwithlognormaldata As discussed earlier, in many case Ln(Y+1) is fairly normally distributed. Potentially, when data are of this sort, the generalized linear models such as Poisson regression or negative binomial regression might not be efficient compared to methods assuming normal errors. In this section, I simulate hawk data with Y following a discrete version of a lognormal distribution where Ln(Y) is normal. Because the discrete version will have zero counts, the analysis will be performed with Ln(Y+1). I will take the variance of Y to be proportional to a power of the expected value of Y. Since has a log normal distribution, where. The following equations show the way to generate random variables. Then the mean and variance of log normal variables are as follows: (6.2.7) (6.2.8) For my simulations 33

34 ` (6.2.9) (6.2.10) where and are both constants. Solving for we find (6.2.11) Then after knowing from (6.2.11), it is easily to get from (6.2.9) (6.2.12) For a Poisson distribution Vay(Y) = E(Y) which corresponds to r=1 and c=1. Meanwhile, Y has constant variance in Ln-scale when r=2. From the above methods, here comes the SAS code of generating data as followings: %let b0=0; %let b1=0.02; %let nsim=1000; %let n=10; %let c=1; data one; title 'Run Simulation'; call streaminit( ); do isim=1 to &nsim; do days=0 to 50 by 5; do r= 1 to 3.0 by 0.2; do rep=1 to &n; ln_mu_y=&b0+(&b1)*days; mu_y=exp(ln_mu_y); sig_2=log(1+(&c)*exp((r-2)*ln_mu_y)); std=sqrt(sig_2); mu=ln_mu_y-0.5*sig_2; x=rand('normal',mu,std); y1=exp(x); rem = y1 - floor(y1); y = floor(y1) + 1*(rand('uniform') < rem); ln_y_1 = log(y+1); var_y = (exp(sig_2) - 1)*mu_y**2; mu_y_r = mu_y**r; output; end; end; end; end; 34

35 The code y = floor(y1) + 1*(rand('uniform') < rem); is to keep the expected value of the rounded version of Y the same as the expected value of y. For example if Y1 = 1.75, then floor(y) = 1, the smallest integer less than or equal to Y1, and Y=1 with probability 0.25 and Y=2 with probability For these simulated values the mean of Y increases from 1.0 at days=0 to 2.7 at days=50. These values were chosen to represent fairly small counts corresponding to smaller hourly recording for the data of Seeland (2010). Hawk data are count data, which possibly include many zeros, so it is more meaningful to compare models with dependent variable where represents buteos counts. In our simulation, it is easy to build relationship between and independent variable days with regression, Poisson and negative binomial models. Here I also mainly introduce the nlmixed model. Before finalizing nlmixed model, the first key thing is to find good approximations to the mean and variance of Ln(Y+1). Three approximation methods were compared. The best one was chosen with the approximate mean of closest to the real mean and variance. To do a meaningful simulation, simulating data and choosing an appropriate approximation method in nlmixed model are two key steps. The main step in our simulation part is about selecting and checking approximation method. Different expansions of and generate different approximations of variances and means of. For example, we can apply Taylor Series to do the following expansion. In our simulation, we used a different approximation method by constructing a log likelihood function which can be seen in the following nlmixed code: proc nlmixed data=one; ods output ParameterEstimates=parm_nlmix2; by isim r; title 'nlmixed normal intervals'; parms b0=0 b1=0.013 r=3 c=1; 35

36 bounds c > 0; ln_mu_y = b0 + b1*days; sig_2 =log(c*exp((r-2)*ln_mu_y)+1); mu = ln_mu_y - 0.5*sig_2; if y = 0 then LogLike = log(probnorm((log(0.5)-mu)/sqrt(sig_2))) ; else LogLike = log(probnorm((log(y+0.5)-mu)/sqrt(sig_2)) - probnorm((log(y-0.5)-mu)/sqrt(sig_2))); model y ~ general(loglike); This likelihood treats the observed counts as rounded lognormal data. Since this was the way the simulation data were generated, this maximum likelihood method should be optimal at least asymptotically for large sample sizes. Comparing this nlmixed model with regression, Poisson and negative binomial models, the result plots are showed in Fig. 8 and Fig. 9. Obviously, regression model performs poorly with large relative biases and RMSEs. Poisson and negative binomial models perform well with good estimates. We can say nlmixed model works very well especially when with r greater than 2. Fig. 8 Relative bias of against different r 36

37 Fig. 9 Relative RMSE of against different r SummariesforModelComparison Through comparing relative bias and RMSE for different models, the nlmixed model generally does a good job no matter whether the data are Poisson distribution or log normal distribution. Surprisingly, the regression model doesn t work well for log normal data. Meanwhile, Poisson and negative binomial still perform well. When the variance is proportional to a large power of the mean, say 3 or more, the nlmixed nonlinear approximation works better, but for data between Poisson, r = 1, and lognormal with constant variance, r = 2, the generalized linear models, particularly the negative binomial model, work well even for lognormal data. The negative binomial variance allows both Poisson variance with θ large and variance proportional to the mean squared with θ small. In the next section in the analysis of the hawk 37

38 data, we will note that the variance is estimated to be proportional to µ 1.8 which is within the range where either negative binomial or nonlinear approximations work well. 7 FITTINGMODELSTOHAWKDATA In this section, we will further look at the hawk example introduced in section 2. From Fig. 3 in section 2 with logarithm of average buteos counts each day against date, we can find that there might be a linearly increasing trend in date. Does the wind during the observation hour affect buteos migration? Could the distance to dry land be a factor in buteos counts? To draw valid forecasting of buteos counts, model selection is important to us. Buteo counts are discrete variables and might include zeros. Models for such data include Poisson and negative binomial distributions, but it s possible that there are too many zeros for Poisson or negative binomial distributions. Another option is to use and apply methods for normally distributed data. The Central Limit Theorem (CLT) helps make models work assuming normality of data. This made us do simulations in section 6 and try to find an appropriate model for hawk data. 7.1 Simpleintroductiontosomepotentialvariables Hawk counts were recorded under certain weather, geographic and geological conditions. Let s get to know more basic ideas about how we use these conditions. 1. Wind is considered as one of possible factor. Best wind direction is nearly zero=north, so north is chosen as the referenced wind direction. Wind was recorded as degrees clockwise from north. The Wind_north_sp is wind speed times the cossin of the wind angle relative to north and can be understood as the strength of the northly wind vector. 38

39 The variable Wind Pre is used to record the number of days that winds did not have a westerly component before observation day. 2. We wonder if the time of a day, that is, a specific hour when observations began, could be a factor in counting migrating buteos, so variable Time will represent the starting time of observations a day. 3. We have noticed that buteo counts slightly increase with date. Then we use the variable day to represent the number of days since Sept. 1, Precipitation is also considered as potential predictor. The variable Precip Pre recorded the number of days with 50% or more hours of precipitation prior to observation day. 5. Likewise, we wonder if the distance to water would affect buteos counts. Distance to the shore of Lake Superior is used to see if buteos migration somehow is related to this geographical location. 7.2 FittingModels From section 6.2.3, it seems that the negative binomial model would be a reasonable choice given that NLMIXED cannot handle the random effects that we need in the model. These mixed models with negative binomial data can be fit with SAS procedure NLMIXED. However, fitting these types of models with these random effects turns out to be tricky. Nonlinear optimization and numerical integration are needed, and for all models we fit, the resulting gradient vector in the "solution" was not close to zero, which is what we want if we are at a local maximum of the log-likelihood function. So we are back to using Ln(Y+1) as an initial analysis of these data. From the simulations, using Ln(Y+1) should be less efficient than using the better methods, so at least if an effect comes out significant using Ln(Y+1), this would likely be the case fitting more 39

40 efficient models. Further work will need to be done beyond this project to figure out how to fit more complicated models. At the first step, we try many independent variables, e.g. 19 and use as a selection criterion to obtain potential variables. Generally, several models might be highly similar in the quality of the fit based on selection. Based on the values, we only can choose a shorter list of independent variables to start studying. The runs were done without random effects in the models, since software is readily available to do this. The p-values will not be correct, but the relative importance of the potential independent variables should be fine. We then fit a model including variables included in the top models based on. Then the p-value is used as one of the criteria to cut down variables based on former runs. Here is an example to show you how we get rid of variables. By running a regression model including independent variable temp_chg, we found that the p value of variable temp_chg is around , which is very big indicating that temp_chg is not needed if other variables are included in the model. We can say that temperature-change is not an important predictor to buteos counts, so variable temp_chg need not be considered in the model. Finally there are only 14 independent variables left by using the similar method of using p-values to reduce variables FitMixedModeltoData Mixed models are widely used to model a linear relationship when the dependent data have known structure. The commonly used mixed model involves repeated measurement. Repeated measures are encountered in hawk data, so a mixed model is applied in analyzing the relationship between the logarithm of the average buteo counts and some possible predictors. 40

41 Transects are numbered by ordering the distances from Duluth up to the North Shore of lake Superior. Drawing general conclusions about places in general is more meaningful than finding out the effect of these specific transects. Thus, transect is considered as a random effect here. Date is the day of observing hawk migration, which is treated as random effect too. The sites on a given transect were distances from shore recorded as the variable shore (a, b, or c) where shore = a is closest to Lake Superior and shore = c is farthest from Lake Superior. To account for dependence of hourly measurements at the same site, a shore*date random effect is also included. Even though nlmixed might fit well for hawk data, nlmixed cannot handle both date and shore*date random effects. This is the reason for using mixed rather than nlmixed for including those random effects in the model. Proc Mixed in SAS system provides a very flexible platform for dealing with repeated measures problems. The mixed model can provide a better p value than regression model to cut down variables. One of mixed model codes is as follows: proc mixed data=fengying.buteos_before_nov plots=residualpanel(unpack); class transect date shore; model Ln_buteos_plus_1 = day shore Wind_Prev Precip_Pre wind_east wind_north_sp time time*time/ residual outpm=outpm solution ; random date shore*date; ods select solutionf covparms tests3 ResidualQQplot ; The estimated transect variance comes out as 0. To make the convergence of the estimation simpler and more likely to find the right MLE, transect is taken out of the random effects for the mixed models in our study FitNlmixedModeltoData The procedure nlmixed model cannot handle the model with both date and shore*date random effects, but nlmixed model can be used to check the relationship between variances and mean 41

42 and also to check for the best wind direction. This model should be fixed to include all the variables from previous runs. No random effects in our nlmixed model were used to make the estimation easier. Applying the approximation method we finalized in section into the hawk data, we came up the following nlmixed code: proc nlmixed data=fengying.buteos_before_nov;; parms b0=-8.8 b_day=0.02 b_shore_a=0.4 b_shore_b=0.3 b_wind_prev=0.2 b_time=1.5 b_time_2 = b_precip_pre=-0.75 k=0.2 r=2 c=0.5 theta=0; bounds c > 0; wind = k*wind_sp*cos( (wind_dir-theta)* /180 ); ln_mu_y = b0 + b_day*day + b_shore_a*shore_a + b_shore_b*shore_b + wind + b_wind_prev*wind_prev + b_time*time + b_time_2*time*time + b_precip_pre*precip_pre; sig_2 =log(c*exp((r-2)*ln_mu_y)+1); mu = ln_mu_y - 0.5*sig_2; y = buteos; if y = 0 then ll = log(probnorm((log(0.5)-mu)/sqrt(sig_2))) ; else ll = log(probnorm((log(y+0.5)-mu)/sqrt(sig_2)) - probnorm((log(y- 0.5)-mu)/sqrt(sig_2))); model y ~ general(ll); Using this code, the maximum likelihood estimate of the clockwise angle relative to north is with a standard error of 10 o, very nearly true north. The estimate of r is 1.8 with a standard error of 0.19, corresponding to, indicating Y is log normal. We can say that the data would not be modeled well as Poisson data. 7.3 SummaryofFindings One of mixed model was introduced in section Firstly, a newly built model prompts us to look at how the errors of the model are distributed. In Fig. 10, we can see that the residuals are almost distributed around the straight line except the last two points, which indicates the data are fairly well approximated by a normal distribution. 42

43 Fig. 10 QQ-plot for residuals from a mixed model Intuitively, a good model should have the predicted values as close to true values as possible. The R 2 value of a model is the square of the correlation between the fitted and observed values. It is interesting to see how the predicted values from the mixed model are compared with real values of. In Fig. 11 the basic trend can be described as equation y=x except two outliers. The R 2 value for this model is about 0.4. Based on the simulations, better models could potentially be fit, but generally speaking, this model works fairly well for these buteo data. 43

44 Fig. 11 Ln(buteos+1) VS. Predicted Mean of Ln(buteos+1) The following table 7.3(1) with p-values shows that the effects day, wind_north_sp and time are significant to predict buteos counts. Again, better models could potentially be fit, but the significant p-values from this model are reliable. This is like using non-parametric methods when data are normal or some other distribution. The statistics are not as efficient as they could be, but significant effects can still be considered significant. Type3TestsofFixedEffects Effect NumDF DenDF FValue Pr>F day <.0001 shore Wind_Prev Precip_Pre wind_east wind_north_sp <.0001 time <.0001 time*time <.0001 Table 7.3(1) Test Results of Fixed Effects 44

45 The day effect has been showed in Fig. 3. More and more buteos migrate as date gets close to winter. In this part, we are more concerned about the wind_north_sp and time effects. To check their effects, LSMEANS statements were added to the previous mixed model, respectively. For example, lsmeans wind_north_sp /obsmargins;. For the mixed model with this LSMEANS statement, the variable wind_east with p value of is not useful to this model. The Fig. 12 is the plot the estimates of against the least squares means of wind_north_sp using each unique value of wind_north_sp as its own effect, using wind_north_sp as a "class" variable, rather than a linear effect. Obviously, there is an increasing trend in wind_north_sp, which further illustrates that north is the best wind direction for buteos migration. Fig. 12 Estimates of VS. LS-means of wind_north_sp For another mixed model with LS-means variable time, wind_east also came up to be an unimportant predictor. The relationship between estimate and time seems to be a parabola, which 45

46 is shown in Fig 13 using each hour of the day as its own effect, day as a "class" variable, rather than a quadratic effect. Basically, the buteo migration peak in a day is in the early afternoon. Fig. 13 Estimates of VS. LS-means of time In nlmixed model, we want to see if there exists a relationship between variance of and mean of, where Y is the buteo counts. We used the idea that in nlmixed model. After running the nlmixed code in section 7.2.2, the estimates of r is 1.8 with a standard erro of In another words, the variance of buteo count is approximately proportional to the square of the mean of buteos counts, so the mixed model which assumes equal variances in the log scale is reasonable. 8 CONCLUSION Count data are commonly studied nowadays. The LM, GLM, MIXED and NLMIXED models can be considered for modeling count data. The outcome from our simulations comes in line with 46

47 some of the results from Ohara and Kotze (2010) that log-transformation of count data performs poorly while negative binomial and Poisson work well, so we do not recommend logtransforming count data with many zero observations. The negative binomial model might perform better than the Poisson model for these kinds of data. The mixed model provides an effective way to analyze count data with complicated random effects instead of NLMIXED model. When applying the NLMIXED model, the main focus should be put on choosing a good approximation method. SAS is a good statistical software to fit these models. In our simulations the negative binomial model did well even for lognormal data. However, in 2008 there were not as pronounced large bursts of buteos during the migration. With more very large count days, the variance may be a higher power of the mean where the nonlinear models would have advantage over the negative binomial models. To answer the questions of research for the hawk data set in this report, the mixed model is used to analyze it. The effects of day, wind_north_sp and time play central roles in estimating buteos count during a certain period of time. It is understandable that there are increasing number of buteos that migrate as time gets closer to winter. Buteos fly from north to south with the benefit of north wind when winter is coming, so it makes sense that wind_north_sp is a significant factor and there exists an increasing trend in wind_north_sp. It is possible that buteos prefer to migrate during a slightly warmer time, in the early afternoon, which can also be seen from Fig. 13 with a downward parabola. For future studies, we can incorporate the mixed model with other bird data, such as accipiters which fly closer to the ground. By analyzing other bird data, we can further see if day, wind_north_sp and time are still significant in predicting other bird s counts. Although the transformation of hawk data, to some extent, supports the mixed model, the generalized mixed 47

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Offset Techniques for Predictive Modeling for Insurance

Offset Techniques for Predictive Modeling for Insurance Offset Techniques for Predictive Modeling for Insurance Matthew Flynn, Ph.D, ISO Innovative Analytics, W. Hartford CT Jun Yan, Ph.D, Deloitte & Touche LLP, Hartford CT ABSTRACT This paper presents the

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR) 2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

More information

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

More information

Simple Linear Regression

Simple Linear Regression STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze

More information

AP Physics 1 and 2 Lab Investigations

AP Physics 1 and 2 Lab Investigations AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Premaster Statistics Tutorial 4 Full solutions

Premaster Statistics Tutorial 4 Full solutions Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for

More information

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4 4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Non-linear functional forms Regression

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce

More information

COLLEGE ALGEBRA. Paul Dawkins

COLLEGE ALGEBRA. Paul Dawkins COLLEGE ALGEBRA Paul Dawkins Table of Contents Preface... iii Outline... iv Preliminaries... Introduction... Integer Exponents... Rational Exponents... 9 Real Exponents...5 Radicals...6 Polynomials...5

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

More information

T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these

More information

Using simulation to calculate the NPV of a project

Using simulation to calculate the NPV of a project Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial

More information

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

More information

Algebra I Vocabulary Cards

Algebra I Vocabulary Cards Algebra I Vocabulary Cards Table of Contents Expressions and Operations Natural Numbers Whole Numbers Integers Rational Numbers Irrational Numbers Real Numbers Absolute Value Order of Operations Expression

More information

PCHS ALGEBRA PLACEMENT TEST

PCHS ALGEBRA PLACEMENT TEST MATHEMATICS Students must pass all math courses with a C or better to advance to the next math level. Only classes passed with a C or better will count towards meeting college entrance requirements. If

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Module 5: Multiple Regression Analysis

Module 5: Multiple Regression Analysis Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

More information

Nonlinear Regression Functions. SW Ch 8 1/54/

Nonlinear Regression Functions. SW Ch 8 1/54/ Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General

More information

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions. Chapter 1 Vocabulary identity - A statement that equates two equivalent expressions. verbal model- A word equation that represents a real-life problem. algebraic expression - An expression with variables.

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

A Deeper Look Inside Generalized Linear Models

A Deeper Look Inside Generalized Linear Models A Deeper Look Inside Generalized Linear Models University of Minnesota February 3 rd, 2012 Nathan Hubbell, FCAS Agenda Property & Casualty (P&C Insurance) in one slide The Actuarial Profession Travelers

More information

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this

More information

Non-Linear Regression 2006-2008 Samuel L. Baker

Non-Linear Regression 2006-2008 Samuel L. Baker NON-LINEAR REGRESSION 1 Non-Linear Regression 2006-2008 Samuel L. Baker The linear least squares method that you have een using fits a straight line or a flat plane to a unch of data points. Sometimes

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Experiment #1, Analyze Data using Excel, Calculator and Graphs.

Experiment #1, Analyze Data using Excel, Calculator and Graphs. Physics 182 - Fall 2014 - Experiment #1 1 Experiment #1, Analyze Data using Excel, Calculator and Graphs. 1 Purpose (5 Points, Including Title. Points apply to your lab report.) Before we start measuring

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

Math 461 Fall 2006 Test 2 Solutions

Math 461 Fall 2006 Test 2 Solutions Math 461 Fall 2006 Test 2 Solutions Total points: 100. Do all questions. Explain all answers. No notes, books, or electronic devices. 1. [105+5 points] Assume X Exponential(λ). Justify the following two

More information

How to Win the Stock Market Game

How to Win the Stock Market Game How to Win the Stock Market Game 1 Developing Short-Term Stock Trading Strategies by Vladimir Daragan PART 1 Table of Contents 1. Introduction 2. Comparison of trading strategies 3. Return per trade 4.

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

More information

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS TEST DESIGN AND FRAMEWORK September 2014 Authorized for Distribution by the New York State Education Department This test design and framework document

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE ACTA UNIVERSITATIS AGRICULTURAE ET SILVICULTURAE MENDELIANAE BRUNENSIS Volume 62 41 Number 2, 2014 http://dx.doi.org/10.11118/actaun201462020383 GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE Silvie Kafková

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

APPLICATION OF LINEAR REGRESSION MODEL FOR POISSON DISTRIBUTION IN FORECASTING

APPLICATION OF LINEAR REGRESSION MODEL FOR POISSON DISTRIBUTION IN FORECASTING APPLICATION OF LINEAR REGRESSION MODEL FOR POISSON DISTRIBUTION IN FORECASTING Sulaimon Mutiu O. Department of Statistics & Mathematics Moshood Abiola Polytechnic, Abeokuta, Ogun State, Nigeria. Abstract

More information

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

More information

Notes on Applied Linear Regression

Notes on Applied Linear Regression Notes on Applied Linear Regression Jamie DeCoster Department of Social Psychology Free University Amsterdam Van der Boechorststraat 1 1081 BT Amsterdam The Netherlands phone: +31 (0)20 444-8935 email:

More information

Homework 8 Solutions

Homework 8 Solutions Math 17, Section 2 Spring 2011 Homework 8 Solutions Assignment Chapter 7: 7.36, 7.40 Chapter 8: 8.14, 8.16, 8.28, 8.36 (a-d), 8.38, 8.62 Chapter 9: 9.4, 9.14 Chapter 7 7.36] a) A scatterplot is given below.

More information

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity

More information

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is

More information

Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

More information

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

 Y. Notation and Equations for Regression Lecture 11/4. Notation: Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

BookTOC.txt. 1. Functions, Graphs, and Models. Algebra Toolbox. Sets. The Real Numbers. Inequalities and Intervals on the Real Number Line

BookTOC.txt. 1. Functions, Graphs, and Models. Algebra Toolbox. Sets. The Real Numbers. Inequalities and Intervals on the Real Number Line College Algebra in Context with Applications for the Managerial, Life, and Social Sciences, 3rd Edition Ronald J. Harshbarger, University of South Carolina - Beaufort Lisa S. Yocco, Georgia Southern University

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Poisson Regression or Regression of Counts (& Rates)

Poisson Regression or Regression of Counts (& Rates) Poisson Regression or Regression of (& Rates) Carolyn J. Anderson Department of Educational Psychology University of Illinois at Urbana-Champaign Generalized Linear Models Slide 1 of 51 Outline Outline

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

STAT 830 Convergence in Distribution

STAT 830 Convergence in Distribution STAT 830 Convergence in Distribution Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Convergence in Distribution STAT 830 Fall 2011 1 / 31

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information

Indiana State Core Curriculum Standards updated 2009 Algebra I

Indiana State Core Curriculum Standards updated 2009 Algebra I Indiana State Core Curriculum Standards updated 2009 Algebra I Strand Description Boardworks High School Algebra presentations Operations With Real Numbers Linear Equations and A1.1 Students simplify and

More information

The program also provides supplemental modules on topics in geometry and probability and statistics.

The program also provides supplemental modules on topics in geometry and probability and statistics. Algebra 1 Course Overview Students develop algebraic fluency by learning the skills needed to solve equations and perform important manipulations with numbers, variables, equations, and inequalities. Students

More information

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important

More information

Forecasting in supply chains

Forecasting in supply chains 1 Forecasting in supply chains Role of demand forecasting Effective transportation system or supply chain design is predicated on the availability of accurate inputs to the modeling process. One of the

More information

Lecture 14: GLM Estimation and Logistic Regression

Lecture 14: GLM Estimation and Logistic Regression Lecture 14: GLM Estimation and Logistic Regression Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South

More information

GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

REVIEW EXERCISES DAVID J LOWRY

REVIEW EXERCISES DAVID J LOWRY REVIEW EXERCISES DAVID J LOWRY Contents 1. Introduction 1 2. Elementary Functions 1 2.1. Factoring and Solving Quadratics 1 2.2. Polynomial Inequalities 3 2.3. Rational Functions 4 2.4. Exponentials and

More information

Is log ratio a good value for measuring return in stock investments

Is log ratio a good value for measuring return in stock investments Is log ratio a good value for measuring return in stock investments Alfred Ultsch Databionics Research Group, University of Marburg, Germany, Contact: ultsch@informatik.uni-marburg.de Measuring the rate

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Calibration and Linear Regression Analysis: A Self-Guided Tutorial

Calibration and Linear Regression Analysis: A Self-Guided Tutorial Calibration and Linear Regression Analysis: A Self-Guided Tutorial Part 1 Instrumental Analysis with Excel: The Basics CHM314 Instrumental Analysis Department of Chemistry, University of Toronto Dr. D.

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information