# Simultaneous Inference in General Parametric Models

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Simultaneous Inference in General Parametric Models Torsten Hothorn Institut für Statistik Ludwig-Maximilians-Universität München Ludwigstraße 33, D München, Germany Frank Bretz Statistical Methodology, Clinical Information Sciences Novartis Pharma AG CH-4002 Basel, Switzerland Peter Westfall Texas Tech University Lubbock, TX 79409, U.S.A March 5, 2015 Abstract Simultaneous inference is a common problem in many areas of application. If multiple null hypotheses are tested simultaneously, the probability of rejecting erroneously at least one of them increases beyond the pre-specified significance level. Simultaneous inference procedures have to be used which adjust for multiplicity and thus control the overall type I error rate. In this paper we describe simultaneous inference procedures in general parametric models, where the experimental questions are specified through a linear combination of elemental model parameters. The framework described here is quite general and extends the canonical theory of multiple comparison procedures in ANOVA models to linear regression problems, generalized linear models, linear mixed effects models, the Cox model, robust linear models, etc. Several examples using a variety of different statistical models illustrate the breadth This is a preprint of an article published in Biometrical Journal, Volume 50, Number 3, Copyright 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim; available online biometrical-journal.com. 1

2 of the results. For the analyses we use the R add-on package multcomp, which provides a convenient interface to the general approach adopted here. Key words: multiple tests, multiple comparisons, simultaneous confidence intervals, adjusted p-values, multivariate normal distribution, robust statistics. 1 Introduction Multiplicity is an intrinsic problem of any simultaneous inference. If each of k, say, null hypotheses is tested at nominal level α, the overall type I error rate can be substantially larger than α. That is, the probability of at least one erroneous rejection is larger than α for k 2. Common multiple comparison procedures adjust for multiplicity and thus ensure that the overall type I error remains below the pre-specified significance level α. Examples of such multiple comparison procedures include Dunnett s many-to-one comparisons, Tukey s all-pairwise comparisons, sequential pairwise contrasts, comparisons with the average, changepoint analyses, dose-response contrasts, etc. These procedures are all well established for classical regression and ANOVA models allowing for covariates and/or factorial treatment structures with i.i.d. normal errors and constant variance, see Bretz et al and the references therein. For a general reading on multiple comparison procedures we refer to Hochberg and Tamhane 1987 and Hsu In this paper we aim at a unified description of simultaneous inference procedures in parametric models with generally correlated parameter estimates. Each individual null hypothesis is specified through a linear combination of elemental model parameters and we allow for k of such null hypotheses to be tested simultaneously, regardless of the number of elemental model parameters p. The general framework described here extends the current canonical theory with respect to the following aspects: i model assumptions such as normality and homoscedasticity are relaxed, thus allowing for simultaneous inference in generalized linear models, mixed effects models, survival models, etc.; ii arbitrary linear functions of the elemental parameters are allowed, not just contrasts of means in ANCOVA models; iii computing the reference distribution is feasible for arbitrary designs, especially for unbalanced designs; and iv a unified implementation is provided which allows for a fast transition of the theoretical results to the desks of data analysts interested in simultaneous inferences for multiple hypotheses. Accordingly, the paper is organized as follows. Section 2 defines the general model and obtains the asymptotic or exact distribution of linear functions of elemental model parameters under rather weak conditions. In Section 3 we describe the framework for simultaneous inference procedures in general parametric models. An overview about important applications of the methodology is given in Section 4 followed by a short discussion of the software implementation in Section 5. Most interesting from a practical point of view is Section 6 where we analyze four rather challenging problems with the tools developed in this paper.

3 2 Model and Parameters In this section we introduce the underlying model assumptions and derive some asymptotic results necessary in the subsequent sections. The results from this section form the basis for the simultaneous inference procedures described in Section 3. Let MZ 1,...,Z n,θ,η denote a semi-parametric statistical model. The set of n observations is described by Z 1,...,Z n. The model contains fixed but unknown elemental parameters θ R p and other random or nuisance parameters η. We are primarily interested in the linear functions ϑ := Kθ of the parameter vector θ as specified through the constant matrix K R k,p. In what follows we describe the underlying model assumptions, the limiting distribution of estimates of our parameters of interest ϑ, as well as the corresponding test statistics for hypotheses about ϑ and their limiting joint distribution. Suppose ˆθ n R p is an estimate of θ and S n R p,p is an estimate of covˆθ n with a n S n P Σ R p,p 1 for some positive, nondecreasing sequence a n. Furthermore, we assume that a multivariate central limit theorem holds, i.e., an 1/2 d ˆθ n θ N p 0,Σ. 2 Ifboth1and2arefulfilledwewrite ˆθ n a N p θ,s n. Then, bytheorem3.3.ainserfling 1980, the linear function ˆϑ n = Kˆθ n, i.e., an estimate of our parameters of interest, also follows an approximate multivariate normal distribution ˆϑ n = Kˆθ n a N k ϑ,s n with covariance matrix S n := KS n K for any fixed matrix K R k,p. Thus we need not to distinguish between elemental parameters θ or derived parameters ϑ = Kθ that are of interest to the researcher. Instead we simply assume for the moment that we have in analogy to 1 and 2 ˆϑ n a N k ϑ,s n with a n S n P Σ := KΣK R k,k 3 and that the k parameters in ϑ are themselves the parameters of interest to the researcher. It is assumed that the diagonal elements of the covariance matrix are positive, i.e., Σ jj > 0 for j = 1,...,k. Then, the standardized estimator ˆϑ n is again asymptotically normally distributed T n := D 1/2 n ˆϑ n ϑ a N k 0,R n 4 where D n = diags n is the diagonal matrix given by the diagonal elements of S n and R n = Dn 1/2 S ndn 1/2 R k,k 1

4 is the correlation matrix of the k-dimensional statistic T n. To demonstrate 4, note that with 3 we have a n S P n Σ P and a n D n diagσ. Define the sequence ã n needed to establish ã-convergence in 4 by ã n 1. Then we have ã n R n = D 1/2 n S nd 1/2 n = a n D n 1/2 a n S na n D n 1/2 P diagσ 1/2 Σ diagσ 1/2 =: R R k,k where the convergence in probability to a constant follows from Slutzky s Theorem Theorem 1.5.4, Serfling, 1980 and therefore 4 holds. To finish note that T n = Dn 1/2 ˆϑ n ϑ = a n D n 1/2 an 1/2 d ˆϑ n ϑ N k 0,R. For the purposes of multiple comparisons, we need convergence of multivariate probabilities calculated for the vector T n when T n is assumed normally distributed with R n treated as if it were the true correlation matrix. However, such probabilities Pmax T n t P are continuous functions of R n and a critical value t which converge by R n R as a consequence of Theorem 1.7 in Serfling In cases where T n is assumed multivariate t distributed with R n treated as the estimated correlation matrix, we have similar convergence as the degrees of freedom approach infinity. Since we only assume that the parameter estimates are asymptotically normally distributed with a consistent estimate of the associated covariance matrix being available, our framework covers a large class of statistical models, including linear regression and ANOVA models, generalized linear models, linear mixed effects models, the Cox model, robust linear models, etc. Standard software packages can be used to fit such models and obtain the estimates ˆθ n and S n which are essentially the only two quantities that are needed for what follows in Section 3. It should be noted that the elemental parameters θ are not necessarily means or differences of means in ANCOVA models. Also, we do not restrict our attention to contrasts of such means, but allow for any set of constants leading to the linear functions ϑ = Kθ of interest. Specific examples for K and θ will be given later in Sections 4 and 6. 3 Global and Simultaneous Inference Based on the results from Section 2, we now focus on the derivation of suitable inference procedures. We start considering the general linear hypothesis Searle, 1971 formulated in terms of our parameters of interest ϑ H 0 : ϑ := Kθ = m. 2

5 Under the conditions of H 0 it follows from Section 2 that T n = D 1/2 n ˆϑ n m a N k 0,R n. This approximating distribution will now be used as the reference distribution when constructing the inference procedures. The global hypothesis H 0 can be tested using standard global tests, such as the F- or the χ 2 -test. An alternative approach is to use maximum tests, as explained in Subsection 3.1. Note that a small global p-value obtained from one of these procedures leading to a rejection of H 0 does not give further indication about the nature of the significant result. Therefore, one is often interested in the individual null hypotheses H j 0 : ϑ j = m j. Testing the hypotheses set {H 1 0,...,H k 0} simultaneously thus requires the individual assessments while maintaining the familywise error rate, as discussed in Subsection 3.2 At this point it is worth considering two special cases. A stronger assumption than asymptotic normality of ˆθ n in 2 is exact normality, i.e., ˆθ n N p θ,σ. If the covariance matrix Σ is known, it follows by standard arguments that T n N k 0,R, when T n is normalized using fixed, known variances. Otherwise, in the typical situation of linear models with normal i.i.d. errors, Σ = σ 2 A, where σ 2 is unknown but A is fixed and known, the exact distribution of T n is a k-dimensional multivariate t k ν,r distribution with ν degrees of freedom ν = n p 1 for linear models, see Tong Global Inference The F- and the χ 2 -test are classical approaches to assess the global null hypothesis H 0. Standard results such as Theorem 3.5, Serfling, 1980 ensure that X 2 = T nr + nt n d χ 2 RankR when ˆθ n a N p θ,s n F = T nr + T n RankR FRankR,ν when ˆθ n N p θ,σ 2 A, where RankR and ν are the corresponding degrees of freedom of the χ 2 and F distribution, respectively. Furthermore, RankR n + denotes the Moore-Penrose inverse of the correlation matrix RankR. Another suitable scalar test statistic for testing the global hypothesis H 0 is to consider the maximum of the individual test statistics T 1,n,...,T k,n of the multivariate statistic T n = T 1,n,...,T k,n, leading to a max-t type test statistic max T n. The distribution of this statistic under the conditions of H 0 can be handled through the k-dimensional distribution t t Pmax T n t = ϕ k x 1,...,x k ;R,νdx 1 dx k =: g ν R,t 5 t t 3

6 for some t R, where ϕ k is the density function of either the limiting k-dimensional multivariate normal with ν = and the operator or the exact multivariate t k ν,r- distribution with ν < and the = operator. Since R is usually unknown, we plug-in the consistent estimate R n as discussed in Section 2. The resulting global p-value exact or approximate, depending on context for H 0 is 1 g ν R n,max t when T = t has been observed. Efficient methods for approximating the above multivariate normal and t integrals are described in Genz 1992; Genz and Bretz 1999; Bretz et al and Genz and Bretz In contrast to the global F- or χ 2 -test, the max-t test based on the test statistic max T n also provides information, which of the k individual null hypotheses H j 0,j = 1,...,k is significant, as well as simultaneous confidence intervals, as shown in the next subsection. 3.2 Simultaneous Inference We now consider testing the k null hypotheses H 1 0,...,H k 0 individually and require that the familywise error rate, i.e., the probability of falsely rejecting at least one true null hypothesis, is bounded by the nominal significance level α 0,1. In what follows we use adjusted p-values to describe the decision rules. Adjusted p-values are defined as the smallest significance level for which one still rejects an individual hypothesis H j 0, given a particular multiple test procedure. In the present context of single-step tests, the at least asymptotic adjusted p-value for the jth individual two-sided hypothesis H j 0 : ϑ j = m j,j = 1,...,k, is given by p j = 1 g ν R n, t j, where t 1,...,t k denote the observed test statistics. By construction, we can reject an individual null hypothesis H j 0, j = 1,...,k, whenever the associated adjusted p-value is less than or equal to the pre-specified significance level α, i.e., p j α. The adjusted p-values are calculated from expression 5. Similar results also hold for one-sided testing problems. The adjusted p-values for onesided cases are defined analogously, using one-sided multidimensional integrals instead of the two-sided integrals 5. Again, we refer to Genz 1992; Genz and Bretz 1999; Bretz et al and Genz and Bretz 2002 for the numerical details. In addition to a simultaneous test procedure, a at least approximate simultaneous 1 2α 100% confidence interval for ϑ is given by ˆϑ n ±q α D 1/2 n where q α is the 1 α quantile of the distribution asymptotic, if necessary of T n. This quantile can be calculated or approximated via 5, i.e., q α is chosen such that g ν R n,q α = 1 α. The corresponding one-sided versions are defined analogously. 4

7 It should be noted that the simultaneous inference procedures described so far belong to the class of single-step procedures, since a common critical value q α is used for the individual tests. Single-step procedures have the advantage that corresponding simultaneous confidence intervals are easily available, as previously noted. However, single-step procedures can always be improved by stepwise extensions based on the closed test procedure. That is, for a given family of null hypotheses H 1 0,...,H k 0, an individual hypothesis H j 0 is rejected only if all intersection hypotheses H J = i J Hi 0 with j J {1,...,k} are rejected Marcus et al., Such stepwise extensions can thus be applied to any of the methods discussed in this paper, see for example Westfall 1997 and Westfall and Tobias Applications The methodological framework described in Sections 2 and 3 is very general and thus applicable to a wide range of statistical models. Many estimation techniques, such as restricted maximum likelihood and M-estimation, provide at least asymptotically normal estimates of the elemental parameters together with consistent estimates of their covariance matrix. In this section we illustrate the generality of the methodology by reviewing some potential applications. Detailed numerical examples are discussed in Section 6. In what follows, we assume m = 0 only for the sake of simplicity. The next paragraphs highlight a subjective selection of some special cases of practical importance. Multiple Linear Regression. In standard regression models the observations Z i of subject i = 1,...,n consist of a response variable Y i and a vector of covariates X i = X i1,...,x iq, such that Z i = Y i,x i and p = q+1. The response is modelled by a linear combination of the covariates with normal error ε i and constant variance σ 2, Y i = β 0 + q β j X ij +σε i, j=1 whereε = ε 1,...,ε n N n 0,I n.theelementalparametervectorisθ = β 0,β 1,...,β q, which is usually estimated by ˆθ n = X X 1 X Y N q+1 θ,σ 2 X X 1, where Y = Y 1,...,Y n denotes the response vector and X = 1,X ij ij denotes the design matrix, i = 1,...,n,j = 1,...,q. Thus, for every matrix K R k,q+1 of constants determining the experimental questions of interest we have ˆϑ n = Kˆθ n N k Kθ,σ 2 K X X 1 K. 5

8 Under the null hypothesis ϑ = 0 the standardized test statistic follows a multivariate t distribution T n = Dn 1/2 ˆϑ n t q+1 n q,r, where D n = ˆσ 2 diagk X X 1 K is the diagonal matrix of the estimated variances of Kˆθ and R is the correlation matrix as given in Section 3. The body fat prediction example presented in Subsection 6.2 illustrates the application of simultaneous inference procedures in the context of variable selection in linear regression models. One-way ANOVA. Consider a one-way ANOVA model for a factor measured at q levels with a continuous response Y ij = µ+γ j +ε ij 6 and independent normal errors ε ij N 1 0,σ 2,j = 1,...,q,i = 1,...,n j. Note that the model description in 6 is overparameterized. A standard approach is to consider a suitable re-parametrization. The so-called treatment contrast vector θ = µ,γ 2 γ 1,γ 3 γ 1,...,γ q γ 1 is, forexample, the defaultre-parametrization usedas elemental parameters in the R-system for statistical computing R Development Core Team, Many classical multiple comparison procedures can be embedded into this framework, including Dunnett s many-to-one comparisons and Tukey s all-pairwise comparisons. For Dunnett s procedure, the differences γ j γ 1 are tested for all j = 2,...,q, where γ 1 denotes the mean treatment effect of a control group. In the notation from Section 2 we thus have resulting in the parameters of interest K Dunnett = 0,diagq ϑ Dunnett = K Dunnett θ = γ 2 γ 1,γ 3 γ 1,...,γ q γ 1 of interest. For Tukey s procedure, the interest is in all-pairwise comparisons of the parameters γ 1,...,γ q. For q = 3, for example, we have K Tukey = with parameters of interest ϑ Tukey = K Tukey θ = γ 2 γ 1,γ 3 γ 1,γ 2 γ 3. Many further multiple comparison procedures have been investigated in the past, which all fit into this framework. We refer to Bretz et al for a related comprehensive list. 6

9 Note that under the standard ANOVA assumptions of i.i.d. normal errors with constant variance the vector of test statistics T n follows a multivariate t distribution. Thus, related simultaneous tests and confidence intervals do not rely on asymptotics and can be computed analytically instead, as shown in Section 3. To illustrate simultaneous inference procedures in one-way ANOVA models, we consider all pairwise comparisons of expression levels for various genetic conditions of alcoholism in Subsection 6.1. Further parametric models. In generalized linear models, the exact distribution of the parameter estimates is usually unknown and thus the asymptotic normal distribution is the basis for all inference procedures. When we are interested in inference about model parameters corresponding to levels of a certain factor, the same multiple comparison procedures as sketched above are available. Linear and non-linear mixed effects models fitted by restricted maximum-likelihood provide the data analyst with asymptotically normal estimates and a consistent covariance matrix as well so that all assumptions of our framework are met and one can set up simultaneous inference procedures for these models as well. The same is true for the Cox model or other parametric survival models such as the Weibull model. We use logistic regression models to estimated the probability of suffering from Alzheimer s disease in Subsection 6.3, compare several risk factors for survival of leukemia patients by means of a Weibull model in Subsection 6.4 and obtain probability estimates of deer browsing for various tree species from mixed models in Subsection 6.5. Robust simultaneous inference. Yet another application is to use robust variants of the previously discussed statistical models. One possibility is to consider the use of sandwich estimators S n for the covariance matrix covˆθ n when, for example, the variance homogeneity assumption is violated. An alternative is to apply robust estimation techniques in linear models, for example S-, M- or MM-estimation see Rousseeuw and Leroy, 2003; Stefanski and Boos, 2002; Yohai, 1987, for example, which again provide us with asymptotically normal estimates. The reader is referred to Subsection 6.2 for some numerical examples illustrating these ideas. 5 Implementation The multcomp package Hothorn et al., 2008 in R R Development Core Team, 2008 provides a general implementation of the framework for simultaneous inference in semiparametric models described in Sections 2 and 3. The numerical examples in Section 6 will all be analyzed using the multcomp package. In this section we briefly introduce the userinterface and refer the reader to the online documentation of the package for the technical details. 7

10 Estimated model coefficients ˆθ n and their estimated covariance matrix S n are accessible in R via coef and vcov methods available for most statistical models in R, such as objects of class lm, glm, coxph, nlme, mer or survreg. Having this information at hand, the glht function sets up the general linear hypothesis for a model model and a representation of the matrix K via its linfct argument: glhtmodel, linfct, alternative = c"two.sided", "less", "greater", rhs = 0,... The two remaining arguments alternative and rhs define the direction of the alternative see Section 3 and m, respectively. The matrix K can be described in three different ways: ❼ by a matrix with lengthcoefmodel columns, or ❼ by an expression or character vector giving a symbolic description of the linear functions of interest, or ❼ by an object of class mcp for multiple comparison procedure. The last alternative is convenient when contrasts of factor levels are to be compared and the model contrasts used to define the design matrix of the model have to be taken into account. The mcp function takes the name of the factor to be tested as an argument as well as a character defining the type of comparisons as its value. For example, mcptreat = "Tukey" sets up a matrix K for Tukey s all-pairwise comparisons among the levels of the factor treat, which has to appear on right hand side of the model formula of model. In this particular case, we need to assume that model.frame and model.matrix methods for model are available as well. The mcp function must be used with care when defining parameters of interest in two-way ANOVA or ANCOVA models. Here, the definition of treatment differencessuch as Tukey s all-pair comparisons or Dunnett s comparison with a control might be problem-specific. For example, in an ANCOVA model here without intercept term Y ij = γ j +β j X i +ε ij ; j = 1,...,q,i = 1,...,n j the parameters of interest might be γ j γ 1 +β j x β 1 x for some value x of the continuous covariate X rather than the comparisons with a control γ j γ 1 that would be computed by mcp with "Dunnett" option. The same problem occurs when interaction terms are present in a two-way ANOVA model, where the hypotheses might depend on the sample sizes. Because it is impossible to determine the parameters of interest automatically in this case, mcp in multcomp will by default generate comparisons for the main effects γ j only, ignoring covariates and interactions. Since version 1.1-2, one can specify to average over interaction terms and covariates using arguments interaction_average = TRUE and covariate_average = TRUE respectively, whereas versions older than automatically averaged over interaction terms. We suggest to the users, however, that they write out, 8

11 manually, the set of contrasts they want. One should do this whenever there is doubt about what the default contrasts measure, which typically happens in models with higher order interaction terms. We refer to Hsu 1996, Chapter 7, and Searle 1971, Chapter 7.3, for further discussions and examples on this issue. Objects of class glht returned by glht include coef and vcov methods to compute ˆϑ n and S n. Furthermore, a summary method is available to perform different tests max t, χ 2 and F-tests and p-value adjustments, including those taking logical constraints into account Shaffer, 1986; Westfall, In addition, the confint method applied to objects of class glht returns simultaneous confidence intervals and allows for a graphical representation of the results. The numerical accuracy of adjusted p-values and simultaneous confidence intervals implemented in multcomp is continuously checked against results reported by Westfall et al Illustrations 6.1 Genetic Components of Alcoholism Various studies have linked alcohol dependence phenotypes to chromosome 4. One candidate gene is NACP non-amyloid component of plaques, coding for alpha synuclein. Bönsch et al found longer alleles of NACP-REP1 in alcohol-dependent patients compared with healthy controls and report that the allele lengths show some association with levels of expressed alpha synuclein mrna in alcohol-dependent subjects see Figure 1. Allele length is measured as a sum score built from additive dinucleotide repeat length and categorized into three groups: short 0 4, n = 24, intermediate 5 9, n = 58, and long 10 12, n = 15. The data are available from package coin. Here, we are interested in comparing the distribution of the expression level of alpha synuclein mrna in three groups of subjects defined by the allele length. Thus, we fit a simple one-way ANOVA model to the data and define K such that Kθ contains all three group differences Tukey s all-pairwise comparisons: R> data"alpha", package = "coin" R> amod <- aovelevel ~ alength, data = alpha R> amod_glht <- glhtamod, linfct = mcpalength = "Tukey" R> amod_glht\$linfct Intercept alengthintermediate alengthlong intermediate - short long - short long - intermediate attr,"type" [1] "Tukey" 9

12 n = 24 n = 58 n = 15 Expression Level short intermediate long NACP REP1 Allele Length Figure 1: alpha data: Distribution of levels of expressed alpha synuclein mrna in three groups defined by the NACP-REP1 allele lengths. The amod_glht object now contains information about the estimated linear function ˆϑ n andtheircovariancematrixs n whichcanbeinspectedviathecoefandvcovmethods: R> coefamod_glht intermediate - short long - short long - intermediate R> vcovamod_glht intermediate - short long - short long - intermediate intermediate - short long - short long - intermediate The summary and confint methods can be used to compute a summary statistic including adjusted p-values and simultaneous confidence intervals, respectively: R> confintamod_glht Simultaneous Confidence Intervals Multiple Comparisons of Means: Tukey Contrasts Fit: aovformula = elevel ~ alength, data = alpha Quantile = % family-wise confidence level 10

13 Linear Hypotheses: Estimate lwr upr intermediate - short == long - short == long - intermediate == R> summaryamod_glht Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: aovformula = elevel ~ alength, data = alpha Linear Hypotheses: Estimate Std. Error t value Pr> t intermediate - short == long - short == long - intermediate == Signif. codes: 0 *** ** 0.01 * Adjusted p values reported -- single-step method Because of the variance heterogeneity that can be observed in Figure 1, one might be concerned with the validity of the above results stating that there is no difference between any combination of the three allele lengths. A sandwich estimator S n might be more appropriate in this situation, and the vcov argument can be used to specify a function to compute some alternative covariance estimator S n as follows: R> amod_glht_sw <- glhtamod, linfct = mcpalength = "Tukey", + vcov = sandwich R> summaryamod_glht_sw Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: aovformula = elevel ~ alength, data = alpha Linear Hypotheses: Estimate Std. Error t value Pr> t intermediate - short == long - short == * long - intermediate == Signif. codes: 0 *** ** 0.01 *

14 Adjusted p values reported -- single-step method We used the sandwich function from package sandwich Zeileis, 2004, 2006 which provides us with a heteroscedasticity-consistent estimator of the covariance matrix. This result is more in line with previously published findings for this study obtained from nonparametric test procedures such as the Kruskal-Wallis test. A comparison of the simultaneous confidence intervals calculated based on the ordinary and sandwich estimator is given in Figure 2. Tukey ordinary S n Tukey sandwich S n intermediate short intermediate short long short long short long intermediate long intermediate Difference Difference Figure 2: alpha data: Simultaneous confidence intervals based on the ordinary covariance matrix left and a sandwich estimator right. 6.2 Prediction of Total Body Fat Garcia et al report on the development of predictive regression equations for body fat content by means of p = 9 common anthropometric measurements which were obtained for n = 71 healthy German women. In addition, the women s body composition was measured by Dual Energy X-Ray Absorptiometry DXA. This reference method is very accurate in measuring body fat but finds little applicability in practical environments, mainly because of high costs and the methodological efforts needed. Therefore, a simple regression equation for predicting DXA measurements of body fat is of special interest for the practitioner. Backward-elimination was applied to select important variables from the available anthropometrical measurements and Garcia et al report a final linear model utilizing hip circumference, knee breadth and a compound covariate which is defined as the sum of log chin skinfold, log triceps skinfold and log subscapular skinfold. Here, we fit the saturated model to the data and use the max-t test over all t-statistics to select important variables based on adjusted p-values. The linear model including all covariates and the classical unadjusted p-values are given by 12

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Multiple Testing. Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf. Abstract

Multiple Testing Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf Abstract Multiple testing refers to any instance that involves the simultaneous testing of more than one hypothesis. If decisions about

### Generalized Linear Models

Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

### Simultaneous Inference in General Parametric Models

Simultaneous Inference in General Parametric Models Torsten Hothorn Institut für Statistik Ludwig-Maximilians-Universität München Ludwigstraße 33, D 80539 München, Germany Frank Bretz Statistical Methodology,

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### ANOVA. February 12, 2015

ANOVA February 12, 2015 1 ANOVA models Last time, we discussed the use of categorical variables in multivariate regression. Often, these are encoded as indicator columns in the design matrix. In [1]: %%R

### Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

### Multiple Linear Regression

Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

### Multivariate Logistic Regression

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

### A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

### Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Recall this chart that showed how most of our course would be organized:

Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Uncertainty quantification for the family-wise error rate in multivariate copula models

Uncertainty quantification for the family-wise error rate in multivariate copula models Thorsten Dickhaus (joint work with Taras Bodnar, Jakob Gierl and Jens Stange) University of Bremen Institute for

### Classical Hypothesis Testing in R. R can do all of the common analyses that are available in SPSS, including:

Classical Hypothesis Testing in R R can do all of the common analyses that are available in SPSS, including: Classical Hypothesis Testing in R R can do all of the common analyses that are available in

### Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

### Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and

### Chapter 4: Statistical Hypothesis Testing

Chapter 4: Statistical Hypothesis Testing Christophe Hurlin November 20, 2015 Christophe Hurlin () Advanced Econometrics - Master ESA November 20, 2015 1 / 225 Section 1 Introduction Christophe Hurlin

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Regression in ANOVA. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Regression in ANOVA James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) 1 / 30 Regression in ANOVA 1 Introduction 2 Basic Linear

### Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate

### Data Analysis. Lecture Empirical Model Building and Methods (Empirische Modellbildung und Methoden) SS Analysis of Experiments - Introduction

Data Analysis Lecture Empirical Model Building and Methods (Empirische Modellbildung und Methoden) Prof. Dr. Dr. h.c. Dieter Rombach Dr. Andreas Jedlitschka SS 2014 Analysis of Experiments - Introduction

### Analysis of Variance. MINITAB User s Guide 2 3-1

3 Analysis of Variance Analysis of Variance Overview, 3-2 One-Way Analysis of Variance, 3-5 Two-Way Analysis of Variance, 3-11 Analysis of Means, 3-13 Overview of Balanced ANOVA and GLM, 3-18 Balanced

### Implementing a Class of Structural Change Tests: An Econometric Computing Approach

Implementing a Class of Structural Change Tests: An Econometric Computing Approach Achim Zeileis http://www.ci.tuwien.ac.at/~zeileis/ Contents Why should we want to do tests for structural change, econometric

### An analysis method for a quantitative outcome and two categorical explanatory variables.

Chapter 11 Two-Way ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that

### STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section: 1 2 3 4

STATISTICS 8, FINAL EXAM NAME: KEY Seat Number: Last six digits of Student ID#: Circle your Discussion Section: 1 2 3 4 Make sure you have 8 pages. You will be provided with a table as well, as a separate

### Multiple Comparisons Using R

Multiple Comparisons Using R Multiple Comparisons Using R Frank Bretz Torsten Hothorn Peter Westfall Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### 11. Analysis of Case-control Studies Logistic Regression

Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

### Maximally Selected Rank Statistics in R

Maximally Selected Rank Statistics in R by Torsten Hothorn and Berthold Lausen This document gives some examples on how to use the maxstat package and is basically an extention to Hothorn and Lausen (2002).

### Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### 11/20/2014. Correlational research is used to describe the relationship between two or more naturally occurring variables.

Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

### Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### Week 5: Multiple Linear Regression

BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School

### E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

### COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies

### Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

### An example ANOVA situation. 1-Way ANOVA. Some notation for ANOVA. Are these differences significant? Example (Treating Blisters)

An example ANOVA situation Example (Treating Blisters) 1-Way ANOVA MATH 143 Department of Mathematics and Statistics Calvin College Subjects: 25 patients with blisters Treatments: Treatment A, Treatment

### We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Quadratic Models We extended the additive model in two variables to the interaction model by adding a third term to the equation. Similarly, we can extend the linear model in one variable to the quadratic

### Hypothesis Testing Level I Quantitative Methods. IFT Notes for the CFA exam

Hypothesis Testing 2014 Level I Quantitative Methods IFT Notes for the CFA exam Contents 1. Introduction... 3 2. Hypothesis Testing... 3 3. Hypothesis Tests Concerning the Mean... 10 4. Hypothesis Tests

### Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

Survival Analysis of Left Truncated Income Protection Insurance Data [March 29, 2012] 1 Qing Liu 2 David Pitt 3 Yan Wang 4 Xueyuan Wu Abstract One of the main characteristics of Income Protection Insurance

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Tutorial 5: Hypothesis Testing

Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................

### Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used

### From the help desk: Bootstrapped standard errors

The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

### R: A Free Software Project in Statistical Computing

R: A Free Software Project in Statistical Computing Achim Zeileis Institut für Statistik & Wahrscheinlichkeitstheorie http://www.ci.tuwien.ac.at/~zeileis/ Acknowledgments Thanks: Alex Smola & Machine Learning

### UNDERSTANDING THE TWO-WAY ANOVA

UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables

### Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

### Chapter 7. One-way ANOVA

Chapter 7 One-way ANOVA One-way ANOVA examines equality of population means for a quantitative outcome and a single categorical explanatory variable with any number of levels. The t-test of Chapter 6 looks

### Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)

### data visualization and regression

data visualization and regression Sepal.Length 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 I. setosa I. versicolor I. virginica I. setosa I. versicolor I. virginica Species Species

### Applications of R Software in Bayesian Data Analysis

Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx

### Econometric Analysis of Cross Section and Panel Data Second Edition. Jeffrey M. Wooldridge. The MIT Press Cambridge, Massachusetts London, England

Econometric Analysis of Cross Section and Panel Data Second Edition Jeffrey M. Wooldridge The MIT Press Cambridge, Massachusetts London, England Preface Acknowledgments xxi xxix I INTRODUCTION AND BACKGROUND

### Statistics Review PSY379

Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses

### Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### ANOVA Analysis of Variance

ANOVA Analysis of Variance What is ANOVA and why do we use it? Can test hypotheses about mean differences between more than 2 samples. Can also make inferences about the effects of several different IVs,

### Chapter 7 Section 7.1: Inference for the Mean of a Population

Chapter 7 Section 7.1: Inference for the Mean of a Population Now let s look at a similar situation Take an SRS of size n Normal Population : N(, ). Both and are unknown parameters. Unlike what we used

### INTRODUCTORY STATISTICS

INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

### paircompviz: An R Package for Visualization of Multiple Pairwise Comparison Test Results

paircompviz: An R Package for Visualization of Multiple Pairwise Comparison Test Results Michal Burda University of Ostrava Abstract The aim of this paper is to describe an R package for visualization

### Biostatistics: Types of Data Analysis

Biostatistics: Types of Data Analysis Theresa A Scott, MS Vanderbilt University Department of Biostatistics theresa.scott@vanderbilt.edu http://biostat.mc.vanderbilt.edu/theresascott Theresa A Scott, MS

### Correlation and Simple Linear Regression

Correlation and Simple Linear Regression We are often interested in studying the relationship among variables to determine whether they are associated with one another. When we think that changes in a

### Lecture 8: Gamma regression

Lecture 8: Gamma regression Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Models with constant coefficient of variation Gamma regression: estimation and testing

### SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

### Psychology 205: Research Methods in Psychology

Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready

### Inferential Statistics

Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

### False Discovery Rates

False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving

### 10. Analysis of Longitudinal Studies Repeat-measures analysis

Research Methods II 99 10. Analysis of Longitudinal Studies Repeat-measures analysis This chapter builds on the concepts and methods described in Chapters 7 and 8 of Mother and Child Health: Research methods.

### GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

### INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) As with other parametric statistics, we begin the one-way ANOVA with a test of the underlying assumptions. Our first assumption is the assumption of

### Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

### DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

### N-Way Analysis of Variance

N-Way Analysis of Variance 1 Introduction A good example when to use a n-way ANOVA is for a factorial design. A factorial design is an efficient way to conduct an experiment. Each observation has data

### CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,

### Nonlinear Regression:

Zurich University of Applied Sciences School of Engineering IDP Institute of Data Analysis and Process Design Nonlinear Regression: A Powerful Tool With Considerable Complexity Half-Day : Improved Inference

### The Variability of P-Values. Summary

The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report

### Partial Least Squares (PLS) Regression.

Partial Least Squares (PLS) Regression. Hervé Abdi 1 The University of Texas at Dallas Introduction Pls regression is a recent technique that generalizes and combines features from principal component

### Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### SCHOOL OF MATHEMATICS AND STATISTICS

RESTRICTED OPEN BOOK EXAMINATION (Not to be removed from the examination hall) Data provided: Statistics Tables by H.R. Neave MAS5052 SCHOOL OF MATHEMATICS AND STATISTICS Basic Statistics Spring Semester

### II. DISTRIBUTIONS distribution normal distribution. standard scores

Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

### Permutation Tests for Comparing Two Populations

Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of

### GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

### Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Publication List Chen Zehua Department of Statistics & Applied Probability National University of Singapore Publications Journal Papers 1. Y. He and Z. Chen (2014). A sequential procedure for feature selection

Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

### 7. Tests of association and Linear Regression

7. Tests of association and Linear Regression In this chapter we consider 1. Tests of Association for 2 qualitative variables. 2. Measures of the strength of linear association between 2 quantitative variables.

### BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

### Using R for Linear Regression

Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional

### X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

### For example, enter the following data in three COLUMNS in a new View window.

Statistics with Statview - 18 Paired t-test A paired t-test compares two groups of measurements when the data in the two groups are in some way paired between the groups (e.g., before and after on the