Tests in a case control design including relatives Stefanie Biedermann i, Eva Nagel i, Axel Munk ii, Hajo Holzmann ii, Ansgar Steland i Abstract We present a new approach to handle dependent data arising from relatives in case control studies in genetic epidemiology. A general theorem is shown characterizing the joint asymptotic distribution of the entries in the associated contingency table under the null hypothesis of independence of phenotype and genotype at the observed locus, allowing to handle several common test statistics, namely the attributable risk, the relative risk, and the odds ratio. Simulations reveal good finite sample behavior of these test statistics. Our results can also be applied to more general settings in medicine or natural science whenever data from related subjects are involved. Keywords and Phrases: allele frequency, case control design, causal locus, contingency tables, dependent data, likelihood ratio test, risk measures. 1 Introduction The analysis of contingency tables is an important tool in epidemiological studies; see, e.g., [12]. For the typical situation of comparing the relative frequency of a certain feature among two samples of individuals (cases and controls), there is an extensive amount of literature on statistical tests within this setting as long as the data are independent. When data from relatives are included in a case control study, the use of the present methods is subject to severe limitations. i Ruhr-Universität Bochum, Fakultät für Mathematik, 44780 Bochum, Germany ii Georg-August-Universität Göttingen, Institut für Mathematische Stochastik, 37073 Göttingen, Germany 1
The aim of this article is to give a new approach to handle data of relatives within a case control study that is based on tests in contingency tables considering the well-known risk measures such as the odds ratio, the attributable risk and the relative risk from a new point of view. We prove a result on the asymptotic distribution of the entries of the contingency table under the null hypothesis that the feature under consideration does not influence the phenotype. From this we derive the asymptotic distributions of the test statistics under consideration, enabling us to derive critical values for the test problem. Our method is widely applicable to all kinds of tests that can be derived from contingency tables and to data of relatives from all fields of science. It is still a major challenge of genetic research to explore the genetic component of common diseases such as hypertension or asthma. Therefore, there is considerable interest in statistical methods to test candidate markers and to find predisposing alleles. Much effort has been devoted to this problem, deriving several methods under the assumption of independent data; see, e.g., [2], [7], [9], [11], [16] or [19]. This assumption, however, is violated if two individuals in the sample are related. Particularly in case of a rare disease, researchers cannot afford to include only unrelated subjects in a study, since this will inevitably result in a severe loss of power. In recent years, some methods have been published, which take dependencies among data into account; see, e.g., [18] and [3], who pursue likelihood-based approaches, or [4], [5] and [17], who consider score tests derived from identity by descent (IBD) data from affected sib-pairs. In [15] the test for trend in proportions introduced by [1] is extended to the case where family data must be accounted for by computing the exact variance of the test statistic. The approaches described above depend sensitively on several nuisance parameters from the underlying genetic model as well as the model itself. To provide nonparametric asymptotic level α tests we propose nonparametric estimators for these parameters, and simultaneously allow for dependend data. This article is organized as follows. Section 2 deals with a more detailed discussion of the genetic model under consideration. The proposed tests and their asymptotic 2
distributions are given in Section 3. Section 4 provides various important extensions of our method, e.g., the inclusion of more different types of relatives, missing data, polygenic disorders or different inheritance models. The finite sample properties of the tests under parameter settings of practical interest are studied by simulations in Section 5, inlcuding comparisons to the corresponding tests using data from independent individuals only. The proofs of our results, finally, are deferred to an appendix. 2 Genetic model and assumptions At first glance it seems that our assumptions concerning the number of candidate loci (one) and the mode of inheritance (dominant) imposed at this stage are quite restrictive, but we show in Section 4 how to apply our approach to an arbitrary number of markers and predisposing loci as well as different modes of inheritance. Hence, for simplicity of motivation and derivation of results, let us assume that there are two biallelic loci L1 and L2 with alleles A and Ā. L1 is the candidate locus where data are observed, whereas L2 denotes the true but unknown causal locus. We want to ascertain if there is significant evidence for linkage disequilibrium between L1 and L2 by comparing the relative frequencies of allele A at L1 in the case and the control group, respectively. We assume an dominant inheritance model, i.e. for any individual the probability of being affected given there is at least one allele A present at locus L2 is equal to the penetrance f, whereas individuals without any allele A at L2 are affected with probability zero. In The penetrance, f, is a positive but unknown constat, f 1. We consider the following study setup. A data set containing data from n 1 affected children is randomly sampled from the population as part of the cases group. Similarly, a random sample of n 2 unaffected children forms the basis of the control group. Then the total number of children is n = n 1 + n 2. There are no degrees of relationship allowed among these children to avoid dependencies within the data at this stage of the experiment. The sample consisting of data of unrelated children only will be 3
denoted by basic sample in the following. The children in both groups are then tested on the existence of allele A at locus L1, which is suspected to have an impact on developing the disease. More precisely, to each child we assigned a random variable C i, i = 1,..., n, which counts the number of occurrences of allele A at the candidate locus L1, i.e. 2 : Allele A occurs twice at locus L1 of child i C i = 1 : Allele A occurs exactly once at locus L1 of child i 0 : otherwise. The classic case control design introduced so far will now be enriched by dependent data from relatives from the associated nuclear families, i.e. the data consists of the children and their parents, bringing the total number of individuals taking part in the study to 3n. The random variables corresponding to parental observations are defined equivalently and denoted by A i and B i, where A i is given by 2 : Allele A occurs twice at locus L1 of the first parent of child i A i = 1 : Allele A occurs exactly once at locus L1 of the first parent of child i 0 : otherwise and B i is defined analogously for the second parent. From the above study setup, it follows that there are dependencies among the data and, as a further complication, the number of parents belonging to either the case or the control group is also random. More precisely, the numbers of affected or unaffected first and second parents conditioned on the children s phenotype are random numbers given by R 1 = number of affected first parents with affected children, R 2 = number of affected first parents with unaffected children, R 3 = number of unaffected first parents with unaffected children, 4
R 4 = number of unaffected first parents with affected children, and S 1,..., S 4 are defined analogously for the second parent. Evidently, the random variables R k, S k, k = 1,..., 4, are linked by the equalities R 1 +R 4 = n 1, R 2 +R 3 = n 2, S 1 + S 4 = n 1 and S 2 + S 3 = n 2. Furthermore, these variables follow binomial distributions specified as follows: R 1 Bin(n 1, p 1 ), R 2 Bin(n 2, p 2 ) and S 1 Bin(n 1, p 1 ), S 2 Bin(n 2, p 2 ), where the parameters p 1, p 2 denote the probabilities that a parent is affected given the corresponding child is affected or unaffected, respectively. Finally, the covariance structure among these random variables is characterized by the parameters p 3 and p 4 with Cov(R 1, S 1 ) = n 1 p 3 and Cov(R 2, S 2 ) = n 2 p 4. Furthermore, in this article we suppose that the following standard assumptions on the various processes involved in inheritance are satisfied. (A1) The random processes yielding the phenotype given the genotype are independent for different individuals. (A2) Gamete formation is independent of phenotype. (A3) Hardy-Weinberg equilibrium holds for each locus. (A4) Offspring data are randomly sampled from the two groups in the population. Our interest is on testing whether there is an influence of the observed genotype at locus L1 on the phenotype. We therefore test the null hypothesis H 0 that there is no linkage disequilibrium between the observed locus L1 and the disease locus L2 against the alternative H 1 of the existence of linkage disequilibrium (LD) between these two loci. Note that the linkage disequilibrium coefficient δ describing the LD between alleles at the two loci L1 and L2 within the population is given by δ = δ AA = h AA p A0 p 2A0, where h AA denotes the haplotype frequency of alleles A, A at loci L1 and L2, and p A0, p 2A0 stand for the allele frequencies at L1 and L2, respectively. Recall that a 5
single parameter is sufficient to describe the LD between two biallelic loci since the LD coefficients between arbitrary alleles at these loci only differ in sign. In terms of the LD coefficient δ, the testing problem is thus given by testing the null hypothesis (1) H 0 : δ = 0 against either the one-sided or the two-sided alternative (2) H 1 : δ > 0 or H 1 : δ 0. In order to construct tests for these hypotheses on the basis of observations, i.e. realizations of the random variables C i, A i, i = 1,..., n, we reformulate the above hypotheses (1) and (2) in terms of the allele frequencies of A at L1 in the two respective groups. Let p v denote the frequency of A at L1 among the affected individuals in the population, and p w the corresponding term among the unaffected individuals. The parameters p v and p w can be expressed by δ, p A0 and p 2A0 as follows (3) p v p w = p A0 + δ = p A0 δ 1 p 2A0 p 2A0 (2 p 2A0 ) f(1 p 2A0 ) 1 fp 2A0 (2 p 2A0 ). We note that p w also depends on the value of the penetrance f. The hypotheses (1) and (2) can thus be reformulated equivalently in the following way: (4) H 0 : p v = p w = p A0 (5) H 1 : p v > p w or H 1 : p v p w. 3 Test statistics and their asymptotic distributions The null hypothesis H 0 implies that the allele frequency of A at locus L1 is the same among affected and unaffected individuals in the population. It is therefore reasonable 6
to consider test statistics based on differences or ratios of estimators for the allele frequencies p v and p w in the case and the control group to detect deviations from the null hypothesis. We choose as estimators the empirical counterparts of p v and p w in the two respective groups, which can easily be obtained from the corresponding contingency table (see Table 1). (The sets S A, S B, occurring in Table 1 are defined as the sets of indices i, 1 i n so that the i th first or second parent is affected, respectively, i.e. S A = {i i th first parent affected, 1 i n}, S B = {i i th second parent affected, 1 i n}. Insert Table 1 The empirical allele frequency ˆp v and ˆp w within the control group can be expressed as functions of the entries of Table 1 as follows. ˆp v = N 1 /(N 1 + N 3 ), ˆp w = N 2 /(6n N 1 N 3 ) From Table 1, numerous test statistics can be derived. Since our aim is to compare in some sense the empirical allele frequencies ˆp v and ˆp w in the two respective groups, we focus on functions which measure the difference between two probabilities. Comparing probabilities such as allele frequencies is usually done by calculating the values of the corresponding risk measures, i.e. the attributable risk, the odds ratio, and the relative risk. If the sample allele frequencies instead of the true but unknown allele frequencies are inserted, we can use the risk measures as test statistics. This procedure yields the following test statistics attributable risk T n1 = N 1 N 2 = ˆp v ˆp w, N 1 + N 3 6n N 1 N 3 odds ratio T n2 = N 1(6n N 1 N 2 N 3 ) N 2 N 3 = ˆp v(1 ˆp w ) ˆp w (1 ˆp v ), relative risk T n3 = N 1(6n N 1 N 3 ) N 2 (N 1 + N 3 ) = ˆp v ˆp w. We are now ready to present the main result of this article, which gives the joint asymptotic distribution of the empirical allele frequencies ˆp v and ˆp w under the null 7
hypothesis H 0. This result enables us to calculate the asymptotic distributions of the test statistics T n1, T n2 and T n3, respectively, thus giving the asymptotic critical values. The proof of Theorem 1 is deferred to the appendix. Since under H 0 we have Var(C 1 ) = Var(A 1 ) = Var(B 1 ) and Cov(C 1, A 1 ) = Cov(C 1, B 1 ) the results are given in terms of Var(C 1 ) and Cov(C 1, A 1 ) instead of these five different expressions for brevity of notations. Theorem 1 Assume (A1)-(A4) and that n 1 /n tends to c (0, 1). Under the null hypothesis δ = 0, the asymptotic distribition of the allele frequencies is given by n ˆp v p A0 (6) D N (0, Σ), ˆp w p A0 where the entries of the covariance matrix Σ are given by Σ 1,1 = Var(C 1 )/(12 t 1 ) + c p 1 Cov(C 1, A 1 )/(9 t 2 1) Σ 1,2 = Σ 2,1 = Cov(C 1, A 1 )(c(1 p 1 ) + (1 c)p 2 )/(18 t 1 (1 t 1 )) Σ 2,2 = Var(C 1 )/(12 (1 t 1 )) + (1 c)(1 p 2 )Cov(C 1, A 1 )/(9 (1 t 1 ) 2 ) with Var(C 1 ) = 2p A0 (1 p A0 ) and Cov(C 1, A 1 ) = p A0 (1 p A0 ). The parameter t 1 denotes the asymptotic expected percentage of affected individuals in the study, i.e. t 1 = lim n E[(n 1 + R 1 + R 2 + S 1 + S 2 )/(3n)] = (c + 2c p 1 + 2(1 c)p 2 )/3. To find critical values for the tests under consideration, we will utilize the above result to derive the asymptotic distributions of the test statistics T n1, T n2 and T n3. 3.1 Asymptotic distributions of the test statistics The asymptotic distributions of the statistics T n1, T n2 and T n3 (attributable risk, odds ratio, and the relative risk,) are given in the following Corollary which is an application of the -method. We consider the logarithm of T n2 and T n3, since these test statistics were found to preserve the nominal significance level more precisely than the original versions. 8
Corollary 2 Under the assumptions of Theorem 1 the asymptotic distributions of the test statistics T n1, log(t n2 ) and log(t n3 ) under H 0 are given by (7) n Tn1 D N (0, σ1), 2 D n log(t ni ) N (0, σi 2 ), i = 1, 2, where the asymptotic variances σ 2 1, σ 2 2 and σ 2 3 are obtained as σ1 2 = Var(C 1) 12 t 1 (1 t 1 ) + Cov(C 1, A 1 )(cp 1 (c + cp 1 + (1 c)p 2 )t 1 + t 2 1) 9 t 2 1(1 t 1 ) 2 = p A0(1 p A0 ) ( ) 2cp 18 t 2 1(1 t 1 ) 2 1 + (3 c)t 1 4t 2 1, σ2 2 = (8) = σ3 2 = = Var(C 1 ) 12 t 1 (1 t 1 )p 2 A0 (1 p A0) + Cov(C 1, A 1 )(cp 1 (c + cp 1 + (1 c)p 2 )t 1 + t 2 1) 2 9 t 2 1(1 t 1 ) 2 p 2 A0 (1 p A0) 2 1 ( ) 2cp 18 t 2 1(1 t 1 ) 2 1 + (3 c)t 1 4t 2 1, p A0 (1 p A0 ) Var(C 1 ) + Cov(C 1, A 1 )(cp 1 (c + cp 1 + (1 c)p 2 )t 1 + t 2 1) 12 t 1 (1 t 1 )p 2 A0 9 t 2 1(1 t 1 ) 2 p 2 A0 (1 p A0 ) ( ) 2cp 18 t 2 1(1 t 1 ) 2 1 + (3 c)t 1 4t 2 1 p A0 Remark 3 If only the independent random variables C i, i = 1,..., n, from the basic sample are included in this test the distributions of T n1, log(t n2 ) and log(t n3 ) under the null hypothesis H 0 also feature n-convergence to centered normal distributions with asymptotic variances σ1,c, 2 σ2,c 2 and σ3,c 2 given by (9) σ 2 1,c = Var(C 1) 4 c(1 c), σ2 2,c = Var(C 1 ) 4 c(1 c)p 2 A0 (1 p A0), 2 σ2 3,c = Var(C 1). 4 c(1 c)p 2 A0 Since c = lim n n 1 /n denotes the asymptotic fraction of affected individuals from the basic sample this quantity exactly corresponds to the expression t 1 in the variances σ1, 2 σ2 2 and σ3 2 from Corollary 2. Keeping in mind that the overall number of individuals taking part in the study has tripled by including the parental data A i, B i, i = 1,..., n, the first addend of σi 2, i = 1, 2, 3, from (8) is consistent with σi,c, 2 i = 1, 2, 3, from (9), respectively. The second addend of σi 2, i = 1, 2, 3, involving the covariance between 9
offspring and parental data has no corresponding term in σi,c, 2 i = 1, 2, 3, due to the requirement that the individuals from the basic sample are unrelated. We can thus find the expression for σi,c, 2 i = 1, 2, 3, as a special case of σi 2, i = 1, 2, 3, given the parental (and therefore dependent) data are excluded. 3.2 Estimation of unknown parameters To construct statistical tests using the asymptotic normality of the test statistics T n1, log(t n2 ) and log(t n3 ), we have to estimate the asymptotic variances. Note that σ 2 1, σ 2 2 and σ 2 3 depend on the unknown parameters p A0, p 1 and p 2, but these parameters can be estimated in practice. By Slutsky s Theorem, the results given in Corollary 2 still hold if σ 2 1, σ 2 2 and σ 2 3 are replaced by consistent estimators. Under the null hypothesis H 0, the allele frequency p A0 of A at L1 in the two respective groups and the probabilities p 1 and p 2 of a parent being affected given the corresponding child is affected or unaffected, respectively, can be estimated unbiasedly and n consistently by the sample means of the corresponding random variables, i.e. ˆp A0 = 1/(6n) n (C i + A i + B i ), ˆp 1 = (R 1 + S 1 )/(2n 1 ), ˆp 2 = (R 2 + S 2 )/(2n 2 ). Note that for the estimation of p A0 we have to divide by 6n since C i, A i and B i, i = 1,..., n, all have expectation 2p A0. For estimating the variance Var(C 1 ) = Var(A 1 ) = Var(B 1 ) and the covariance Cov(C 1, A 1 ) = Cov(C 1, B 1 ), there are several possibilities. Firstly, one could simply replace p A0 by its estimator ˆp A0 in the relations Var(C 1 ) = 2p A0 (1 p A0 ) and Cov(C 1, A 1 ) = p A0 (1 p A0 ). Alternatively, one could use the estimators ˆv for Var(C 1 ) and cov ˆ for Cov(C 1, A 1 ) given by n n ˆv = 1/(3n) (Ci 2 + A 2 i + Bi 2 ) 1/(9n(n 1)) n (C i + A i + B i )(C j + A j + B j ) i cov ˆ = 1/(2n) n (C i A i + C i B i ) 1/(9n(n 1)) n n (C i + A i + B i )(C j + A j + B j ). j i 10
Straightforward calculations show that these estimators are unbiased with variance converging to 0 at a rate of 1/n. Finally note that the parameters p 3 and p 4 describing the dependency structure between parental phenotypes given the offspring phenotype do not appear in the asymptotic variances of the test statistics in (8). 4 Extensions From a practical viewpoint the generalization to more general types of relatives than trios as the additional inclusion of sibs data may be a concern. Following [11], [16] and [10], a gain in power can be achieved by using data from related cases and unrelated controls to increase the allele frequency difference between the two groups. In this design only affected relatives of affected children are genotyped to enrich the data set. It is even possible to allow for missing data under a missing at random assumption. Since then additional nuisance parameters appear, more detailed investigations of these issues should be a subject of future research. We introduced our method for a monogenic disease and a single marker locus for the sake of clearness and brevity of this article. We will now explain, how the proposed tests can be extended to genetically more complex models. Assume that a set of k 1 candidate loci has been chosen. We can then test if a specific allele combination (genotype), say G = (g 1,..., g k ), at these k loci contributes to the phenotype expression as follows. Define a virtual biallelic locus L with alleles G and Ḡ where Ḡ consists of all theoretically possible allele combinations specified by the k markers except the candidate genotype G. The proposed tests can then be applied to L where the candidate allele A from the original test is replaced by G. The only difference with respect to the original procedure is the fact that genotypes can only appear once for each individual so that the random variables C i, A i, B i describing genotype features as the sum of two indicators on A in the one locus case must be replaced by indicators on G, yielding a modified variance of the statistic. Using that approach one may test all relevant allele 11
combinations. If the k markers are located on the same chromosome one could also think of conducting test procedures on haplotypes to take account of cis and trans effects, see [13]. To test the effect of a specific haplotype, say H, on the phenotype, define a virtual biallelic locus with alleles H and H and applying our method. If the candidate loci are not very close, we have to adjust the expression for the covariance between parental and offspring (virtual) genotypes (corresponding to Cov(C 1, A 1 ) in the original tests) by some nuisance parameters to allow for recombination, i.e. the analogon to Cov(C 1, A 1 ) will not exactly be equal to p A0 (1 p A0 ) but include the aforementioned parameters. This is, however, no obstacle for the use of these tests since the covariance will be estimated anyway. The estimator cov ˆ from section 3.2 will still provide a n-consistent estimate for the covariance. If the DNA sequence containing the markers is not too long the method of molecular haplotyping can be used to identify the haplotypes; see [8]. Otherwise haplotypes can either be determined by pedigree analysis or (if that is not possible) haplotype frequencies can be estimated for example by an EM-algorithm; see [6]. A more detailed exploration of this subject will be given in a subsequent article. An important question is whether the proposed tests are also valid under more general inheritance modes. A general model attributes different risks f 0, f 1, f 2, to each genotype with respect to L2, the index corresponding to the number of alleles A. Then f 0 = 0, f 1 = f 2 = f yields the dominant and f 0 = f 1 = 0, f 2 = f the recessive model. Theorem 1 still remains valid even in this general setting. Note that although the values of the parameters p 1 and p 2, which denote the probabilities that a parent is affected given that the child is affected or not, depend on the mode of inheritance, these parameters are still consistently estimated by the estimators ˆp 1 and ˆp 2. However, under the alternative hypothesis H 1 the allele frequencies ˆp v and ˆp w of A at L1 in the two different groups depend on the different values of f 0, f 1 and f 2. Therefore when simulating the power of the tests, the mode of inheritance has to be taken into account. 12
5 Simulation Study We explored by a simulation study whether there is a substantial increase in power when parental data are included, and to which extent the actual level of significance is affected by the dependency structure of the data. We considered testing H 0 : δ = 0 against H 1 : δ > 0 at a nominal level α = 0.05. To simulate random variables with the same distributions as C i, A i and B i, i = 1,..., n, four parameters have to be specified in advance. We fixed the values of the two allele frequencies p A0, p 2A0, the penetrance f and the LD coefficient δ. Since the LD coefficient δ depends on the haplotype frequency and the allele frequencies, which complicates comparisons between different pairs of loci, it is better to consider the maximal LD coefficient δ = δ min{p A0,p 2A0 } p A0 p 2A0, which is standardized in such a way that its values lie in the unit interval [0, 1] if there is positive or no LD δ 0 between alleles A at L1 and A at L2. In the following, we use δ instead of δ. Each parameter constellation was simulated using 10,000 runs for sample sizes n = 60, n = 100 and n = 200. Here n 1 was chosen according to the equal expected allocation rule, i.e. the expected number of affected individuals in the sample is equal to the expected number of unaffected individuals. For some parameter combinations, this choice was not admissible due to the restriction that the value of n 1 must not be greater than n. In these cases, we simulated the tests for several choices of n 1 with 0.75n n 1 0.95n. We found that the tests are not very sensitive with respect to the particular choice of n 1 within this range. The simulated rejection rates under H 0 : δ = 0 are given in Table 2 and Table 3). Since in the following we will compare the performance of our tests with the performance of the corresponding tests including either data from unrelated subjects only or data from children with one parent each, we label by ( ) the tests including data from both parents for each child. Tables 2 and 3 show that the tests preserve the α-level relatively well, even in small samples (n = 60). We simulated various further parameter combinations and always 13
received similar results. Only for very small (< 0.05) allele frequencies p A0, the tests exceed the α-level when the sample size is small due to estimating p A0 to zero in many runs. In this case, the test based on the attributable risk turns out to be the most robust one among the three tests under consideration. Tables 4, 5 and 6 show the simulated powers of the tests when the alternative hypothesis H 1 is valid. For comparison, we also display the corresponding values when only data of unrelated subjects are included in the test statistics. In this case, the number of affected individuals n 1 was chosen according to the equal allocation rule n 1 = n/2. The tests including offspring data only are not given any label. The tests labeled by ( ) denote the tests including the data from the basic sample plus the data from one randomly chosen parent for each child. For these tests, the expected equal allocation rule was used to determine n 1. In Table 4, we chose relatively small allele frequencies of A at the two loci L1 and L2, a small amount of LD measured by δ = 0.2 and a high penetrance of f = 0.6. We observe that the inclusion of parental data increases the powers of the tests considerably. The same holds true for medium scale values of the allele frequencies p A0 and p 2A0 as can be seen from Table 5. Table 6, finally, displays the values of the simulated powers when the value for the penetrance f is chosen relatively small. Further,we observed virtually no difference in performance between the three tests under consideration so that any of these tests could be used in practice. 6 Discussion We introduce new contingency table tests for a case-control design including related subjects whether a candidate locus is in linkage disequlibrium with a predisposing locus. Starting from a basic sample of i.i.d. cases and controls, parental genotype and phenotype data are added to the sample. To define asymptotic level α tests we provide the asymptotic distributions of the attributable, the relative risk, and the odds ra- 14
tio. For the nuisance parameters induced by genetic theory, we propose nonparametric which are consistent for any underlying genetic inheritance model. Thus, our tests are indeed asymptotic nonparametric level α tests. Simulations show that the tests under consideration preserve the nominal significance level very well and are quite robust with respect to the genetic parameters. The statistical power to detect linkage disequilibrium, i.e. to detect predisposing genes, is substantially increased by including parental information. The considerations made in Section 4 show that the tests are also applicable to test for linkage disequilibrium between several markers and causal loci in case of polygenic diseases if some slight modifications are implemented. Acknowledgements: This research was supported by the Deutsche Forschungsgemeinschaft (project: Statistical methods for the analysis of association studies in human genetics). H. Holzmann and A. Munk acknowledge financial support of the Deutsche Forschungsgemeinschaft, MU 1230/8-1. The authors are also grateful to S. Böhringer for some interesting discussions and helpful comments on an earlier version of this article. Appendix: Proofs Proof of Theorem 1: The main difficulty in the proof is to show that the components N 1, N 2 and N 3 of Table 1 are jointly asymptotically normally distributed. This is not evident since N 1 and N 2 contain sums of randomly chosen variables A i and B i. However, these can be reconstructed as sums of deterministically chosen random variables as follows. To each affected child, we assign a seven-dimensional random vector X i = (X (1) i, X (2) i, X (3) i, X (4) i, X (5) i, X (6) i, X (7) i ) T, i = 1,..., n 1, where X (2) i = I{first parent of child i is affected}, 15
X (5) i = I{second parent of child i is affected}, and X (1) i = C i, X (3) i = A i, X (4) i = X (2) i X (3) i, X (6) i = B i and X (7) i = X (5) i X (6) i. Analogously, we define the vectors Y j = (Y (1) j, Y (2) j, Y (3) j, Y (4) j, Y (5) j, Y (6) j, Y (7) j ) T, j = 1,..., n 2, for each child from the control group. We have thus arranged dependent random variables corresponding to related individuals within the vectors X i and Y j. Since we did not allow for any degree of relationship among the children, the sequences X i, i = 1,..., n 1, and Y j, j = 1,..., n 2, are independent, and the random vectors X i, i = 1,..., n 1, as well as Y j, j = 1,..., n 2, are independent identically distributed. For the sake of brevity of notation, we denote the expectation of C i, A i and B i, i = 1,..., n, by p A, i.e. p A = 2 p A0. Lemma 4 Let n 1 /n tend to some constant c (0, 1) for n. Denote by m x and m y the means of the X 1 and Y 1, respectively, and by K x, K y the corresponding covariance matrices. Then under H 0 (10) 1 n1 n n X i n 1 n m x 1 n n2 Y i n 2 n m y D N (0, Σ K ), where the covariance matrix Σ K, given by a block matrix with blocks ck x and (1 c)k y, is non-degenerate. Explicitely, we have that m x = (p A, p 1, p A, p 1 p A, p 1, p A, p 1 p A ) T, m y = (p A, p 2, p A, p 2 p A, p 2, p A, p 2 p A ) T, Var(C 1 ) 0 Cov(C 1, A 1 ) p 1 Cov(C 1, A 1 ) 0 Cov(C 1, B 1 ) p 1 Cov(C 1, B 1 ) 0 p 1 (1 p 1 ) 0 p A p 1 (1 p 1 ) p 3 0 p 3 p A Cov(C 1, A 1 ) 0 Var(C 1 ) p 1 Var(C 1 ) 0 0 0 K x = p 1 Cov(C 1, A 1 ) p A p 1 (1 p 1 ) p 1 Var(C 1 ) p 1 p A (1 + (0.5 p 1 )p A ) p 3 p A 0 p 3 p 2. A 0 p 3 0 p 3 p A p 1 (1 p 1 ) 0 p A p 1 (1 p 1 ) Cov(C 1, B 1 ) 0 0 0 0 Var(C 1 ) p 1 Var(C 1 ) p 1 Cov(C 1, B 1 ) p 3 p A 0 p 3 p 2 A p A p 1 (1 p 1 ) p 1 Var(C 1 ) p 1 p A (1 + (0.5 p 1 )p A ) 16
Var(C 1 ) 0 Cov(C 1, A 1 ) p 2 Cov(C 1, A 1 ) 0 Cov(C 1, B 1 ) p 2 Cov(C 1, B 1 ) 0 p 2 (1 p 2 ) 0 p A p 2 (1 p 2 ) p 4 0 p 4 p A Cov(C 1, A 1 ) 0 Var(C 1 ) p 2 Var(C 1 ) 0 0 0 K y = p 2 Cov(C 1, A 1 ) p A p 2 (1 p 2 ) p 2 Var(C 1 ) p 2 p A (1 + (0.5 p 2 )p A ) p 4 p A 0 p 4 p 2. A 0 p 4 0 p 4 p A p 2 (1 p 2 ) 0 p A p 2 (1 p 2 ) Cov(C 1, B 1 ) 0 0 0 0 Var(C 1 ) p 2 Var(C 1 ) p 2 Cov(C 1, B 1 ) p 4 p A 0 p 4 p 2 A p A p 2 (1 p 2 ) p 2 Var(C 1 ) p 2 p A (1 + (0.5 p 2 )p A ) Proof of Lemma 4: The expectations and covariance matrices of X 1 and Y 1 under H 0 are obtained by straightforward calculations. Note that under the null hypothesis the random variables X (2) i, X (5) i and Y (2) j, Y (5) j corresponding to phenotype data are independent of the random variables X (1) i, X (3) i, X (6) i and Y (1) j, Y (3) j, Y (6) j describing genotype features of the individuals. Asymptotic normality follows from an application of the multivariate central limit theorem for iid vectors to both sequences separately, getting asymptotic results for the sums of the vectors X i and Y j. Since the sequences X i, i = 1,..., n 1, and Y j, j = 1,..., n 2, are independent, it remains to put the sums into one vector and standardize accordingly. To prove that both covariance matrices K x and K y are non-degenerate we calculate the determinants of the respective matrices and obtain K x = 0.5p 2 1(1 p 1 ) 2 Var(C 1 ) 5 (p 2 1(1 p 1 ) 2 p 2 3) K y = 0.5p 2 2(1 p 2 ) 2 Var(C 1 ) 5 (p 2 2(1 p 2 ) 2 p 2 4) where we used that Var(C 1 ) = 2 p A0 (1 p A0 ) and Cov(C 1, A 1 ) = p A0 (1 p A0 ). Since p A0, p 1 and p 2 are assumed to lie in (0, 1) it remains to show that p 2 3 < p 2 1(1 p 1 ) 2 and p 2 4 < p 2 2(1 p 2 ) 2. Recall that the term p 2 1(1 p 1 ) 2 p 2 3 is given by p 2 1(1 p 1 ) 2 p 2 3 = Var(X (2) 1 )Var(X (5) 1 ) [Cov(X (2) 1, X (5) 1 )] 2 0 by Hölder s inequality. As X (2) 1 and X (5) 1 are not linearly dependent (the four possible outcome combinations for X (2) 1 and X (5) 1 each occur with positive probability) even 17
the strict inequality holds and thus the determinant of K x is positive. An analogous reasoning applies to show that K y is also greater than zero. An application of the -method will now give us the joint asymptotic distribution of the entries of the contingency table under H 0. Lemma 5 Let, again, n 1 /n tend to some arbitrary constant c (0, 1) for n. Then the following assertion holds under the null hypothesis H 0. (11) with expectations n N 1 E[N 1] n n N 2 E[N 2] n n N 3 E[N 3] n n D N (0, Σ N ), E[N 1 ] = (n 1 + 2n 1 p 1 + 2n 2 p 2 )p A, E[N 2 ] = (3n n 1 2n 1 p 1 2n 2 p 2 )p A, E[N 3 ] = (n 1 + 2n 1 p 1 + 2n 2 p 2 )(2 p A ), and the entries Σ N,i,j, i, j = 1, 2, 3, of the covariance matrix Σ N are given by Σ N,1,1 = Var(C 1 ){c + 2cp 1 + 2(1 c)p 2 } + 4Cov(C 1, A 1 )cp 1 +2p 2 A{cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 } Σ N,1,2 = 2Cov(C 1, A 1 ){c(1 p 1 ) + (1 c)p 2 } 2p 2 A{cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 } Σ N,1,3 = Var(C 1 ){c + 2cp 1 + 2(1 c)p 2 } 4Cov(C 1, A 1 )cp 1 +2p A (2 p A ){cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 } Σ N,2,2 = Var(C 1 ){3 c 2cp 1 2(1 c)p 2 } + 4Cov(C 1, A 1 )(1 c)(1 p 2 ) +2p 2 A{cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 } Σ N,2,3 = 2Cov(C 1, A 1 ){c(1 p 1 ) + (1 c)p 2 } 2p A (2 p A ){cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 } 18
Σ N,3,3 = Var(C 1 ){c + 2cp 1 + 2(1 c)p 2 } + 4Cov(C 1, A 1 )cp 1 +2(2 p A ) 2 {cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 }. Proof of Lemma 5: We can express the entries of the contingency table 1 as functions of the sums of the entries of the vectors X i, i = 1,..., n 1, Y j, j = 1,..., n 2, i.e. N 1 = N 2 = + n 1 n 2 j=1 n 1 X (1) i + Y (1) j + X (6) i n 1 n 1 n 1 X (4) i + X (3) i X (7) i + n 2 j=1 n 1 n 2 j=1 Y (4) j + X (4) i + Y (6) j n 1 n 2 j=1 n 2 j=1 X (7) i + Y (3) j Y (7) j n 2 j=1 n 2 j=1 Y (7) j Y (4) j N 3 n 1 = 2n 1 n 1 + 2 X (5) i n 1 i + 2 X (1) n 1 X (7) X (2) i n 2 i + 2 j=1 n 1 Y (5) j n 2 i + 2 X (4) n 2 j=1 j=1 Y (7) j. Y (2) j n 2 j=1 Y (4) j Interpreting the vector (N 1, N 2, N 3 ) T as a (measurable and differentiable) function from IR 14 to IR 3, we obtain the statement of Lemma 5 by application of the -method. Application of the -method to the function (ˆp v, ˆp w ) T, where ˆp v = N 1 /(N 1 + N 3 ) and ˆp w = N 2 /(6n N 1 N 3 ) then yields (6). The formulas for Var(C 1 ), Cov(C 1, A 1 ) and t 1 are obtained by straightforward calculations. References [1] Armitage P. (1955). Tests for linear trends in proportions and frequencies. Biometrics, 11, 375 386. [2] Böhringer S., Hardt C., Miterski B., Steland A. and Epplen J.T. (2003). Multilocus statistics to uncover epistasis and heterogeneity in complex diseases: revisiting a set of multiple sclerosis data. European Journal of Human Genetics, 11, 573 584. 19
[3] Böhringer S. and Steland A. (2004). Testing genetic causality for binary traits using parent-offspring data. Under revision in: Biometrical Journal. [4] Dudoit S. and Speed T.P. (1999). A score test for linkage using identity by descent data from sibships, Annals of Statistics, 27, 943 986. [5] Dudoit S. and Speed T.P. (2000). A score test for the linkage analysis of qualitative and quantitative traits based on identity by descent data from sib-pairs, Biostatistics, 1, 1 26. [6] Fallin D., Cohen A., Essioux L., Chumakov I., Blumenfeld M., Cohen D. and Schork N.J. (2001). Genetic analysis of case/control data using estimated haplotype frequencies: Application to APOE locus variation and Alzheimer s disease. Genome Research, 11, 143 151. [7] Hoh J., Wille A. and Ott J. (2001). Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Research, 11, 2115 2119. [8] Michalatos-Beloin S., Tishkoff S., Bentley K., Kidd K. and Ruano G. (1996). Molecular haplotyping of genetic markers 10 kb apart by allele-specific long-range PCR. Nucleic Acids Research, 24, 4841 4843. [9] Nelson M.R., Kardia S.L., Ferrell R.E. and Sing C.F. (2001). A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research, 11, 458 470. [10] Risch N. (2000). Searching for genetic determinants in the new millenium, Nature, 405, 847 856. [11] Risch N. and Teng J. (1998). The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases. I. DNA pooling. Genome Research, 8, 1273 1288. 20
[12] Rothman K.J. and Greenland S. (1998). Modern Epidemiology. Lipincott-Raven, PA. [13] Schaid D.J., Rowland C.M., Tines D.E., Jacobson R.M. and Poland G.A. (2002). Score tests for association between traits and haplotypes when linkage phase is ambiguous. American Journal of Human Genetics, 70, 425 434. [14] Serfling R. J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley & Sons Inc., New York. [15] Slager S.L. and Schaid D.J. (2001). Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. American Journal of Human Genetics, 68, 1457 1462. [16] Teng J. and Risch N. (1999). The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases. II. Individual genotyping. Genome Research, 9, 234 241. [17] Whittemore A. S. and Halpern J. (1994). A class of tests for linkage using affected pedigree members Biometrics, 50, 118 127. [18] Whittemore A.S. and Tu I.P. (2000). Detection of disease genes by use of family data. I. Likelihood-based theory. American Journal of Human Genetics, 66, 1328 1340. [19] Wille A., Hoh J. and Ott J. (2003). Sum statistics for the joint detection of multiple disease loci in case-control association studies with SNP markers. Genetic Epidemiology, 25, 350 359. 21
Table 1: Two-by-two contingency table displaying the respective numbers of alleles A / Ā at L1 in the case / control group. number of alleles A at L1 number of alleles Ā at L1 case N 1 := n 1 C i + i S A A i + i S B B i N 3 := 2n 1 + 2R 1 + 2R 2 + 2S 1 + 2S 2 N 1 control N 2 := n i=n 1 +1 C i + i/ S A A i + i/ S B B i N 4 := 6n N 1 N 2 N 3 Table 2: p A0 = 0.2, p 2A0 = 0.015, δ = 0, f = 0.5 test n = 60 n = 100 n = 200 attributable risk ( ) 5.59% 5.56% 5.53% log odds ratio ( ) 4.73% 4.74% 4.88% log relative risk ( ) 4.91% 5.01% 5.04% Table 3: p A0 = 0.2, p 2A0 = 0.1, δ = 0, f = 0.3 test n = 60 n = 100 n = 200 attributable risk ( ) 5.52% 5.18% 5.04% log odds ratio ( ) 5.12% 4.86% 4.78% log relative risk ( ) 5.13% 4.95% 4.86% 22
Table 4: p A0 = 0.05, p 2A0 = 0.001, δ = 0.2, f = 0.6 test n = 60 n = 100 n = 200 attributable risk 58.44% 76.65% 95.95% attributable risk ( ) 69.46% 86.61% 98.71% attributable risk ( ) 80.53% 94.19% 99.79% log odds ratio 58.68% 77.72% 95.85% log odds ratio ( ) 69.76% 86.21% 98.65% log odds ratio ( ) 76.75% 92.29% 99.73% log relative risk 60.82% 78.11% 95.97% log relative risk ( ) 70.70% 86.72% 98.70% log relative risk ( ) 79.05% 93.38% 99.76% Table 5: p A0 = 0.2, p 2A0 = 0.1, δ = 0.2, f = 0.5 test n = 60 n = 100 n = 200 attributable risk 30.08% 40.67% 64.36% attributable risk ( ) 36.42% 51.80% 76.94% attributable risk ( ) 46.45% 63.84% 86.52% log odds ratio 30.30% 41.10% 63.29% log odds ratio ( ) 36.50% 51.29% 76.47% log odds ratio ( ) 42.43% 60.02% 84.51% log relative risk 30.82% 41.41% 64.21% log relative risk ( ) 36.99% 51.82% 76.68% log relative risk ( ) 44.56% 62.33% 85.54% 23
Table 6: p A0 = 0.4,p 2A0 = 0.2,δ = 0.4,f = 0.2 test n = 60 n = 100 n = 200 attributable risk 36.51% 50.62% 75.41% attributable risk ( ) 40.07% 57.96% 81.76% attributable risk ( ) 55.72% 74.72% 94.18% log odds ratio 35.28% 49.06% 73.88% log odds ratio ( ) 38.90% 56.10% 80.45% log odds ratio ( ) 51.05% 71.16% 93.01% log relative risk 38.99% 49.97% 74.04% log relative risk ( ) 40.47% 57.51% 81.54% log relative risk ( ) 51.47% 71.87% 93.25% 24