Tests in a case control design including relatives


 Julian Francis Crawford
 1 years ago
 Views:
Transcription
1 Tests in a case control design including relatives Stefanie Biedermann i, Eva Nagel i, Axel Munk ii, Hajo Holzmann ii, Ansgar Steland i Abstract We present a new approach to handle dependent data arising from relatives in case control studies in genetic epidemiology. A general theorem is shown characterizing the joint asymptotic distribution of the entries in the associated contingency table under the null hypothesis of independence of phenotype and genotype at the observed locus, allowing to handle several common test statistics, namely the attributable risk, the relative risk, and the odds ratio. Simulations reveal good finite sample behavior of these test statistics. Our results can also be applied to more general settings in medicine or natural science whenever data from related subjects are involved. Keywords and Phrases: allele frequency, case control design, causal locus, contingency tables, dependent data, likelihood ratio test, risk measures. 1 Introduction The analysis of contingency tables is an important tool in epidemiological studies; see, e.g., [12]. For the typical situation of comparing the relative frequency of a certain feature among two samples of individuals (cases and controls), there is an extensive amount of literature on statistical tests within this setting as long as the data are independent. When data from relatives are included in a case control study, the use of the present methods is subject to severe limitations. i RuhrUniversität Bochum, Fakultät für Mathematik, Bochum, Germany ii GeorgAugustUniversität Göttingen, Institut für Mathematische Stochastik, Göttingen, Germany 1
2 The aim of this article is to give a new approach to handle data of relatives within a case control study that is based on tests in contingency tables considering the wellknown risk measures such as the odds ratio, the attributable risk and the relative risk from a new point of view. We prove a result on the asymptotic distribution of the entries of the contingency table under the null hypothesis that the feature under consideration does not influence the phenotype. From this we derive the asymptotic distributions of the test statistics under consideration, enabling us to derive critical values for the test problem. Our method is widely applicable to all kinds of tests that can be derived from contingency tables and to data of relatives from all fields of science. It is still a major challenge of genetic research to explore the genetic component of common diseases such as hypertension or asthma. Therefore, there is considerable interest in statistical methods to test candidate markers and to find predisposing alleles. Much effort has been devoted to this problem, deriving several methods under the assumption of independent data; see, e.g., [2], [7], [9], [11], [16] or [19]. This assumption, however, is violated if two individuals in the sample are related. Particularly in case of a rare disease, researchers cannot afford to include only unrelated subjects in a study, since this will inevitably result in a severe loss of power. In recent years, some methods have been published, which take dependencies among data into account; see, e.g., [18] and [3], who pursue likelihoodbased approaches, or [4], [5] and [17], who consider score tests derived from identity by descent (IBD) data from affected sibpairs. In [15] the test for trend in proportions introduced by [1] is extended to the case where family data must be accounted for by computing the exact variance of the test statistic. The approaches described above depend sensitively on several nuisance parameters from the underlying genetic model as well as the model itself. To provide nonparametric asymptotic level α tests we propose nonparametric estimators for these parameters, and simultaneously allow for dependend data. This article is organized as follows. Section 2 deals with a more detailed discussion of the genetic model under consideration. The proposed tests and their asymptotic 2
3 distributions are given in Section 3. Section 4 provides various important extensions of our method, e.g., the inclusion of more different types of relatives, missing data, polygenic disorders or different inheritance models. The finite sample properties of the tests under parameter settings of practical interest are studied by simulations in Section 5, inlcuding comparisons to the corresponding tests using data from independent individuals only. The proofs of our results, finally, are deferred to an appendix. 2 Genetic model and assumptions At first glance it seems that our assumptions concerning the number of candidate loci (one) and the mode of inheritance (dominant) imposed at this stage are quite restrictive, but we show in Section 4 how to apply our approach to an arbitrary number of markers and predisposing loci as well as different modes of inheritance. Hence, for simplicity of motivation and derivation of results, let us assume that there are two biallelic loci L1 and L2 with alleles A and Ā. L1 is the candidate locus where data are observed, whereas L2 denotes the true but unknown causal locus. We want to ascertain if there is significant evidence for linkage disequilibrium between L1 and L2 by comparing the relative frequencies of allele A at L1 in the case and the control group, respectively. We assume an dominant inheritance model, i.e. for any individual the probability of being affected given there is at least one allele A present at locus L2 is equal to the penetrance f, whereas individuals without any allele A at L2 are affected with probability zero. In The penetrance, f, is a positive but unknown constat, f 1. We consider the following study setup. A data set containing data from n 1 affected children is randomly sampled from the population as part of the cases group. Similarly, a random sample of n 2 unaffected children forms the basis of the control group. Then the total number of children is n = n 1 + n 2. There are no degrees of relationship allowed among these children to avoid dependencies within the data at this stage of the experiment. The sample consisting of data of unrelated children only will be 3
4 denoted by basic sample in the following. The children in both groups are then tested on the existence of allele A at locus L1, which is suspected to have an impact on developing the disease. More precisely, to each child we assigned a random variable C i, i = 1,..., n, which counts the number of occurrences of allele A at the candidate locus L1, i.e. 2 : Allele A occurs twice at locus L1 of child i C i = 1 : Allele A occurs exactly once at locus L1 of child i 0 : otherwise. The classic case control design introduced so far will now be enriched by dependent data from relatives from the associated nuclear families, i.e. the data consists of the children and their parents, bringing the total number of individuals taking part in the study to 3n. The random variables corresponding to parental observations are defined equivalently and denoted by A i and B i, where A i is given by 2 : Allele A occurs twice at locus L1 of the first parent of child i A i = 1 : Allele A occurs exactly once at locus L1 of the first parent of child i 0 : otherwise and B i is defined analogously for the second parent. From the above study setup, it follows that there are dependencies among the data and, as a further complication, the number of parents belonging to either the case or the control group is also random. More precisely, the numbers of affected or unaffected first and second parents conditioned on the children s phenotype are random numbers given by R 1 = number of affected first parents with affected children, R 2 = number of affected first parents with unaffected children, R 3 = number of unaffected first parents with unaffected children, 4
5 R 4 = number of unaffected first parents with affected children, and S 1,..., S 4 are defined analogously for the second parent. Evidently, the random variables R k, S k, k = 1,..., 4, are linked by the equalities R 1 +R 4 = n 1, R 2 +R 3 = n 2, S 1 + S 4 = n 1 and S 2 + S 3 = n 2. Furthermore, these variables follow binomial distributions specified as follows: R 1 Bin(n 1, p 1 ), R 2 Bin(n 2, p 2 ) and S 1 Bin(n 1, p 1 ), S 2 Bin(n 2, p 2 ), where the parameters p 1, p 2 denote the probabilities that a parent is affected given the corresponding child is affected or unaffected, respectively. Finally, the covariance structure among these random variables is characterized by the parameters p 3 and p 4 with Cov(R 1, S 1 ) = n 1 p 3 and Cov(R 2, S 2 ) = n 2 p 4. Furthermore, in this article we suppose that the following standard assumptions on the various processes involved in inheritance are satisfied. (A1) The random processes yielding the phenotype given the genotype are independent for different individuals. (A2) Gamete formation is independent of phenotype. (A3) HardyWeinberg equilibrium holds for each locus. (A4) Offspring data are randomly sampled from the two groups in the population. Our interest is on testing whether there is an influence of the observed genotype at locus L1 on the phenotype. We therefore test the null hypothesis H 0 that there is no linkage disequilibrium between the observed locus L1 and the disease locus L2 against the alternative H 1 of the existence of linkage disequilibrium (LD) between these two loci. Note that the linkage disequilibrium coefficient δ describing the LD between alleles at the two loci L1 and L2 within the population is given by δ = δ AA = h AA p A0 p 2A0, where h AA denotes the haplotype frequency of alleles A, A at loci L1 and L2, and p A0, p 2A0 stand for the allele frequencies at L1 and L2, respectively. Recall that a 5
6 single parameter is sufficient to describe the LD between two biallelic loci since the LD coefficients between arbitrary alleles at these loci only differ in sign. In terms of the LD coefficient δ, the testing problem is thus given by testing the null hypothesis (1) H 0 : δ = 0 against either the onesided or the twosided alternative (2) H 1 : δ > 0 or H 1 : δ 0. In order to construct tests for these hypotheses on the basis of observations, i.e. realizations of the random variables C i, A i, i = 1,..., n, we reformulate the above hypotheses (1) and (2) in terms of the allele frequencies of A at L1 in the two respective groups. Let p v denote the frequency of A at L1 among the affected individuals in the population, and p w the corresponding term among the unaffected individuals. The parameters p v and p w can be expressed by δ, p A0 and p 2A0 as follows (3) p v p w = p A0 + δ = p A0 δ 1 p 2A0 p 2A0 (2 p 2A0 ) f(1 p 2A0 ) 1 fp 2A0 (2 p 2A0 ). We note that p w also depends on the value of the penetrance f. The hypotheses (1) and (2) can thus be reformulated equivalently in the following way: (4) H 0 : p v = p w = p A0 (5) H 1 : p v > p w or H 1 : p v p w. 3 Test statistics and their asymptotic distributions The null hypothesis H 0 implies that the allele frequency of A at locus L1 is the same among affected and unaffected individuals in the population. It is therefore reasonable 6
7 to consider test statistics based on differences or ratios of estimators for the allele frequencies p v and p w in the case and the control group to detect deviations from the null hypothesis. We choose as estimators the empirical counterparts of p v and p w in the two respective groups, which can easily be obtained from the corresponding contingency table (see Table 1). (The sets S A, S B, occurring in Table 1 are defined as the sets of indices i, 1 i n so that the i th first or second parent is affected, respectively, i.e. S A = {i i th first parent affected, 1 i n}, S B = {i i th second parent affected, 1 i n}. Insert Table 1 The empirical allele frequency ˆp v and ˆp w within the control group can be expressed as functions of the entries of Table 1 as follows. ˆp v = N 1 /(N 1 + N 3 ), ˆp w = N 2 /(6n N 1 N 3 ) From Table 1, numerous test statistics can be derived. Since our aim is to compare in some sense the empirical allele frequencies ˆp v and ˆp w in the two respective groups, we focus on functions which measure the difference between two probabilities. Comparing probabilities such as allele frequencies is usually done by calculating the values of the corresponding risk measures, i.e. the attributable risk, the odds ratio, and the relative risk. If the sample allele frequencies instead of the true but unknown allele frequencies are inserted, we can use the risk measures as test statistics. This procedure yields the following test statistics attributable risk T n1 = N 1 N 2 = ˆp v ˆp w, N 1 + N 3 6n N 1 N 3 odds ratio T n2 = N 1(6n N 1 N 2 N 3 ) N 2 N 3 = ˆp v(1 ˆp w ) ˆp w (1 ˆp v ), relative risk T n3 = N 1(6n N 1 N 3 ) N 2 (N 1 + N 3 ) = ˆp v ˆp w. We are now ready to present the main result of this article, which gives the joint asymptotic distribution of the empirical allele frequencies ˆp v and ˆp w under the null 7
8 hypothesis H 0. This result enables us to calculate the asymptotic distributions of the test statistics T n1, T n2 and T n3, respectively, thus giving the asymptotic critical values. The proof of Theorem 1 is deferred to the appendix. Since under H 0 we have Var(C 1 ) = Var(A 1 ) = Var(B 1 ) and Cov(C 1, A 1 ) = Cov(C 1, B 1 ) the results are given in terms of Var(C 1 ) and Cov(C 1, A 1 ) instead of these five different expressions for brevity of notations. Theorem 1 Assume (A1)(A4) and that n 1 /n tends to c (0, 1). Under the null hypothesis δ = 0, the asymptotic distribition of the allele frequencies is given by n ˆp v p A0 (6) D N (0, Σ), ˆp w p A0 where the entries of the covariance matrix Σ are given by Σ 1,1 = Var(C 1 )/(12 t 1 ) + c p 1 Cov(C 1, A 1 )/(9 t 2 1) Σ 1,2 = Σ 2,1 = Cov(C 1, A 1 )(c(1 p 1 ) + (1 c)p 2 )/(18 t 1 (1 t 1 )) Σ 2,2 = Var(C 1 )/(12 (1 t 1 )) + (1 c)(1 p 2 )Cov(C 1, A 1 )/(9 (1 t 1 ) 2 ) with Var(C 1 ) = 2p A0 (1 p A0 ) and Cov(C 1, A 1 ) = p A0 (1 p A0 ). The parameter t 1 denotes the asymptotic expected percentage of affected individuals in the study, i.e. t 1 = lim n E[(n 1 + R 1 + R 2 + S 1 + S 2 )/(3n)] = (c + 2c p 1 + 2(1 c)p 2 )/3. To find critical values for the tests under consideration, we will utilize the above result to derive the asymptotic distributions of the test statistics T n1, T n2 and T n Asymptotic distributions of the test statistics The asymptotic distributions of the statistics T n1, T n2 and T n3 (attributable risk, odds ratio, and the relative risk,) are given in the following Corollary which is an application of the method. We consider the logarithm of T n2 and T n3, since these test statistics were found to preserve the nominal significance level more precisely than the original versions. 8
9 Corollary 2 Under the assumptions of Theorem 1 the asymptotic distributions of the test statistics T n1, log(t n2 ) and log(t n3 ) under H 0 are given by (7) n Tn1 D N (0, σ1), 2 D n log(t ni ) N (0, σi 2 ), i = 1, 2, where the asymptotic variances σ 2 1, σ 2 2 and σ 2 3 are obtained as σ1 2 = Var(C 1) 12 t 1 (1 t 1 ) + Cov(C 1, A 1 )(cp 1 (c + cp 1 + (1 c)p 2 )t 1 + t 2 1) 9 t 2 1(1 t 1 ) 2 = p A0(1 p A0 ) ( ) 2cp 18 t 2 1(1 t 1 ) (3 c)t 1 4t 2 1, σ2 2 = (8) = σ3 2 = = Var(C 1 ) 12 t 1 (1 t 1 )p 2 A0 (1 p A0) + Cov(C 1, A 1 )(cp 1 (c + cp 1 + (1 c)p 2 )t 1 + t 2 1) 2 9 t 2 1(1 t 1 ) 2 p 2 A0 (1 p A0) 2 1 ( ) 2cp 18 t 2 1(1 t 1 ) (3 c)t 1 4t 2 1, p A0 (1 p A0 ) Var(C 1 ) + Cov(C 1, A 1 )(cp 1 (c + cp 1 + (1 c)p 2 )t 1 + t 2 1) 12 t 1 (1 t 1 )p 2 A0 9 t 2 1(1 t 1 ) 2 p 2 A0 (1 p A0 ) ( ) 2cp 18 t 2 1(1 t 1 ) (3 c)t 1 4t 2 1 p A0 Remark 3 If only the independent random variables C i, i = 1,..., n, from the basic sample are included in this test the distributions of T n1, log(t n2 ) and log(t n3 ) under the null hypothesis H 0 also feature nconvergence to centered normal distributions with asymptotic variances σ1,c, 2 σ2,c 2 and σ3,c 2 given by (9) σ 2 1,c = Var(C 1) 4 c(1 c), σ2 2,c = Var(C 1 ) 4 c(1 c)p 2 A0 (1 p A0), 2 σ2 3,c = Var(C 1). 4 c(1 c)p 2 A0 Since c = lim n n 1 /n denotes the asymptotic fraction of affected individuals from the basic sample this quantity exactly corresponds to the expression t 1 in the variances σ1, 2 σ2 2 and σ3 2 from Corollary 2. Keeping in mind that the overall number of individuals taking part in the study has tripled by including the parental data A i, B i, i = 1,..., n, the first addend of σi 2, i = 1, 2, 3, from (8) is consistent with σi,c, 2 i = 1, 2, 3, from (9), respectively. The second addend of σi 2, i = 1, 2, 3, involving the covariance between 9
10 offspring and parental data has no corresponding term in σi,c, 2 i = 1, 2, 3, due to the requirement that the individuals from the basic sample are unrelated. We can thus find the expression for σi,c, 2 i = 1, 2, 3, as a special case of σi 2, i = 1, 2, 3, given the parental (and therefore dependent) data are excluded. 3.2 Estimation of unknown parameters To construct statistical tests using the asymptotic normality of the test statistics T n1, log(t n2 ) and log(t n3 ), we have to estimate the asymptotic variances. Note that σ 2 1, σ 2 2 and σ 2 3 depend on the unknown parameters p A0, p 1 and p 2, but these parameters can be estimated in practice. By Slutsky s Theorem, the results given in Corollary 2 still hold if σ 2 1, σ 2 2 and σ 2 3 are replaced by consistent estimators. Under the null hypothesis H 0, the allele frequency p A0 of A at L1 in the two respective groups and the probabilities p 1 and p 2 of a parent being affected given the corresponding child is affected or unaffected, respectively, can be estimated unbiasedly and n consistently by the sample means of the corresponding random variables, i.e. ˆp A0 = 1/(6n) n (C i + A i + B i ), ˆp 1 = (R 1 + S 1 )/(2n 1 ), ˆp 2 = (R 2 + S 2 )/(2n 2 ). Note that for the estimation of p A0 we have to divide by 6n since C i, A i and B i, i = 1,..., n, all have expectation 2p A0. For estimating the variance Var(C 1 ) = Var(A 1 ) = Var(B 1 ) and the covariance Cov(C 1, A 1 ) = Cov(C 1, B 1 ), there are several possibilities. Firstly, one could simply replace p A0 by its estimator ˆp A0 in the relations Var(C 1 ) = 2p A0 (1 p A0 ) and Cov(C 1, A 1 ) = p A0 (1 p A0 ). Alternatively, one could use the estimators ˆv for Var(C 1 ) and cov ˆ for Cov(C 1, A 1 ) given by n n ˆv = 1/(3n) (Ci 2 + A 2 i + Bi 2 ) 1/(9n(n 1)) n (C i + A i + B i )(C j + A j + B j ) i cov ˆ = 1/(2n) n (C i A i + C i B i ) 1/(9n(n 1)) n n (C i + A i + B i )(C j + A j + B j ). j i 10
11 Straightforward calculations show that these estimators are unbiased with variance converging to 0 at a rate of 1/n. Finally note that the parameters p 3 and p 4 describing the dependency structure between parental phenotypes given the offspring phenotype do not appear in the asymptotic variances of the test statistics in (8). 4 Extensions From a practical viewpoint the generalization to more general types of relatives than trios as the additional inclusion of sibs data may be a concern. Following [11], [16] and [10], a gain in power can be achieved by using data from related cases and unrelated controls to increase the allele frequency difference between the two groups. In this design only affected relatives of affected children are genotyped to enrich the data set. It is even possible to allow for missing data under a missing at random assumption. Since then additional nuisance parameters appear, more detailed investigations of these issues should be a subject of future research. We introduced our method for a monogenic disease and a single marker locus for the sake of clearness and brevity of this article. We will now explain, how the proposed tests can be extended to genetically more complex models. Assume that a set of k 1 candidate loci has been chosen. We can then test if a specific allele combination (genotype), say G = (g 1,..., g k ), at these k loci contributes to the phenotype expression as follows. Define a virtual biallelic locus L with alleles G and Ḡ where Ḡ consists of all theoretically possible allele combinations specified by the k markers except the candidate genotype G. The proposed tests can then be applied to L where the candidate allele A from the original test is replaced by G. The only difference with respect to the original procedure is the fact that genotypes can only appear once for each individual so that the random variables C i, A i, B i describing genotype features as the sum of two indicators on A in the one locus case must be replaced by indicators on G, yielding a modified variance of the statistic. Using that approach one may test all relevant allele 11
12 combinations. If the k markers are located on the same chromosome one could also think of conducting test procedures on haplotypes to take account of cis and trans effects, see [13]. To test the effect of a specific haplotype, say H, on the phenotype, define a virtual biallelic locus with alleles H and H and applying our method. If the candidate loci are not very close, we have to adjust the expression for the covariance between parental and offspring (virtual) genotypes (corresponding to Cov(C 1, A 1 ) in the original tests) by some nuisance parameters to allow for recombination, i.e. the analogon to Cov(C 1, A 1 ) will not exactly be equal to p A0 (1 p A0 ) but include the aforementioned parameters. This is, however, no obstacle for the use of these tests since the covariance will be estimated anyway. The estimator cov ˆ from section 3.2 will still provide a nconsistent estimate for the covariance. If the DNA sequence containing the markers is not too long the method of molecular haplotyping can be used to identify the haplotypes; see [8]. Otherwise haplotypes can either be determined by pedigree analysis or (if that is not possible) haplotype frequencies can be estimated for example by an EMalgorithm; see [6]. A more detailed exploration of this subject will be given in a subsequent article. An important question is whether the proposed tests are also valid under more general inheritance modes. A general model attributes different risks f 0, f 1, f 2, to each genotype with respect to L2, the index corresponding to the number of alleles A. Then f 0 = 0, f 1 = f 2 = f yields the dominant and f 0 = f 1 = 0, f 2 = f the recessive model. Theorem 1 still remains valid even in this general setting. Note that although the values of the parameters p 1 and p 2, which denote the probabilities that a parent is affected given that the child is affected or not, depend on the mode of inheritance, these parameters are still consistently estimated by the estimators ˆp 1 and ˆp 2. However, under the alternative hypothesis H 1 the allele frequencies ˆp v and ˆp w of A at L1 in the two different groups depend on the different values of f 0, f 1 and f 2. Therefore when simulating the power of the tests, the mode of inheritance has to be taken into account. 12
13 5 Simulation Study We explored by a simulation study whether there is a substantial increase in power when parental data are included, and to which extent the actual level of significance is affected by the dependency structure of the data. We considered testing H 0 : δ = 0 against H 1 : δ > 0 at a nominal level α = To simulate random variables with the same distributions as C i, A i and B i, i = 1,..., n, four parameters have to be specified in advance. We fixed the values of the two allele frequencies p A0, p 2A0, the penetrance f and the LD coefficient δ. Since the LD coefficient δ depends on the haplotype frequency and the allele frequencies, which complicates comparisons between different pairs of loci, it is better to consider the maximal LD coefficient δ = δ min{p A0,p 2A0 } p A0 p 2A0, which is standardized in such a way that its values lie in the unit interval [0, 1] if there is positive or no LD δ 0 between alleles A at L1 and A at L2. In the following, we use δ instead of δ. Each parameter constellation was simulated using 10,000 runs for sample sizes n = 60, n = 100 and n = 200. Here n 1 was chosen according to the equal expected allocation rule, i.e. the expected number of affected individuals in the sample is equal to the expected number of unaffected individuals. For some parameter combinations, this choice was not admissible due to the restriction that the value of n 1 must not be greater than n. In these cases, we simulated the tests for several choices of n 1 with 0.75n n n. We found that the tests are not very sensitive with respect to the particular choice of n 1 within this range. The simulated rejection rates under H 0 : δ = 0 are given in Table 2 and Table 3). Since in the following we will compare the performance of our tests with the performance of the corresponding tests including either data from unrelated subjects only or data from children with one parent each, we label by ( ) the tests including data from both parents for each child. Tables 2 and 3 show that the tests preserve the αlevel relatively well, even in small samples (n = 60). We simulated various further parameter combinations and always 13
14 received similar results. Only for very small (< 0.05) allele frequencies p A0, the tests exceed the αlevel when the sample size is small due to estimating p A0 to zero in many runs. In this case, the test based on the attributable risk turns out to be the most robust one among the three tests under consideration. Tables 4, 5 and 6 show the simulated powers of the tests when the alternative hypothesis H 1 is valid. For comparison, we also display the corresponding values when only data of unrelated subjects are included in the test statistics. In this case, the number of affected individuals n 1 was chosen according to the equal allocation rule n 1 = n/2. The tests including offspring data only are not given any label. The tests labeled by ( ) denote the tests including the data from the basic sample plus the data from one randomly chosen parent for each child. For these tests, the expected equal allocation rule was used to determine n 1. In Table 4, we chose relatively small allele frequencies of A at the two loci L1 and L2, a small amount of LD measured by δ = 0.2 and a high penetrance of f = 0.6. We observe that the inclusion of parental data increases the powers of the tests considerably. The same holds true for medium scale values of the allele frequencies p A0 and p 2A0 as can be seen from Table 5. Table 6, finally, displays the values of the simulated powers when the value for the penetrance f is chosen relatively small. Further,we observed virtually no difference in performance between the three tests under consideration so that any of these tests could be used in practice. 6 Discussion We introduce new contingency table tests for a casecontrol design including related subjects whether a candidate locus is in linkage disequlibrium with a predisposing locus. Starting from a basic sample of i.i.d. cases and controls, parental genotype and phenotype data are added to the sample. To define asymptotic level α tests we provide the asymptotic distributions of the attributable, the relative risk, and the odds ra 14
15 tio. For the nuisance parameters induced by genetic theory, we propose nonparametric which are consistent for any underlying genetic inheritance model. Thus, our tests are indeed asymptotic nonparametric level α tests. Simulations show that the tests under consideration preserve the nominal significance level very well and are quite robust with respect to the genetic parameters. The statistical power to detect linkage disequilibrium, i.e. to detect predisposing genes, is substantially increased by including parental information. The considerations made in Section 4 show that the tests are also applicable to test for linkage disequilibrium between several markers and causal loci in case of polygenic diseases if some slight modifications are implemented. Acknowledgements: This research was supported by the Deutsche Forschungsgemeinschaft (project: Statistical methods for the analysis of association studies in human genetics). H. Holzmann and A. Munk acknowledge financial support of the Deutsche Forschungsgemeinschaft, MU 1230/81. The authors are also grateful to S. Böhringer for some interesting discussions and helpful comments on an earlier version of this article. Appendix: Proofs Proof of Theorem 1: The main difficulty in the proof is to show that the components N 1, N 2 and N 3 of Table 1 are jointly asymptotically normally distributed. This is not evident since N 1 and N 2 contain sums of randomly chosen variables A i and B i. However, these can be reconstructed as sums of deterministically chosen random variables as follows. To each affected child, we assign a sevendimensional random vector X i = (X (1) i, X (2) i, X (3) i, X (4) i, X (5) i, X (6) i, X (7) i ) T, i = 1,..., n 1, where X (2) i = I{first parent of child i is affected}, 15
16 X (5) i = I{second parent of child i is affected}, and X (1) i = C i, X (3) i = A i, X (4) i = X (2) i X (3) i, X (6) i = B i and X (7) i = X (5) i X (6) i. Analogously, we define the vectors Y j = (Y (1) j, Y (2) j, Y (3) j, Y (4) j, Y (5) j, Y (6) j, Y (7) j ) T, j = 1,..., n 2, for each child from the control group. We have thus arranged dependent random variables corresponding to related individuals within the vectors X i and Y j. Since we did not allow for any degree of relationship among the children, the sequences X i, i = 1,..., n 1, and Y j, j = 1,..., n 2, are independent, and the random vectors X i, i = 1,..., n 1, as well as Y j, j = 1,..., n 2, are independent identically distributed. For the sake of brevity of notation, we denote the expectation of C i, A i and B i, i = 1,..., n, by p A, i.e. p A = 2 p A0. Lemma 4 Let n 1 /n tend to some constant c (0, 1) for n. Denote by m x and m y the means of the X 1 and Y 1, respectively, and by K x, K y the corresponding covariance matrices. Then under H 0 (10) 1 n1 n n X i n 1 n m x 1 n n2 Y i n 2 n m y D N (0, Σ K ), where the covariance matrix Σ K, given by a block matrix with blocks ck x and (1 c)k y, is nondegenerate. Explicitely, we have that m x = (p A, p 1, p A, p 1 p A, p 1, p A, p 1 p A ) T, m y = (p A, p 2, p A, p 2 p A, p 2, p A, p 2 p A ) T, Var(C 1 ) 0 Cov(C 1, A 1 ) p 1 Cov(C 1, A 1 ) 0 Cov(C 1, B 1 ) p 1 Cov(C 1, B 1 ) 0 p 1 (1 p 1 ) 0 p A p 1 (1 p 1 ) p 3 0 p 3 p A Cov(C 1, A 1 ) 0 Var(C 1 ) p 1 Var(C 1 ) K x = p 1 Cov(C 1, A 1 ) p A p 1 (1 p 1 ) p 1 Var(C 1 ) p 1 p A (1 + (0.5 p 1 )p A ) p 3 p A 0 p 3 p 2. A 0 p 3 0 p 3 p A p 1 (1 p 1 ) 0 p A p 1 (1 p 1 ) Cov(C 1, B 1 ) Var(C 1 ) p 1 Var(C 1 ) p 1 Cov(C 1, B 1 ) p 3 p A 0 p 3 p 2 A p A p 1 (1 p 1 ) p 1 Var(C 1 ) p 1 p A (1 + (0.5 p 1 )p A ) 16
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More informationThe Capital Asset Pricing Model: Some Empirical Tests
The Capital Asset Pricing Model: Some Empirical Tests Fischer Black* Deceased Michael C. Jensen Harvard Business School MJensen@hbs.edu and Myron Scholes Stanford University  Graduate School of Business
More informationSome Practical Guidance for the Implementation of Propensity Score Matching
DISCUSSION PAPER SERIES IZA DP No. 1588 Some Practical Guidance for the Implementation of Propensity Score Matching Marco Caliendo Sabine Kopeinig May 2005 Forschungsinstitut zur Zukunft der Arbeit Institute
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationCan political science literatures be believed? A study of publication bias in the APSR and the AJPS
Can political science literatures be believed? A study of publication bias in the APSR and the AJPS Alan Gerber Yale University Neil Malhotra Stanford University Abstract Despite great attention to the
More informationarxiv:1309.4656v1 [math.st] 18 Sep 2013
The Annals of Statistics 2013, Vol. 41, No. 1, 1716 1741 DOI: 10.1214/13AOS1123 c Institute of Mathematical Statistics, 2013 UNIFORMLY MOST POWERFUL BAYESIAN TESTS arxiv:1309.4656v1 [math.st] 18 Sep 2013
More informationSteering User Behavior with Badges
Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to
More informationUnrealistic Optimism About Future Life Events
Journal of Personality and Social Psychology 1980, Vol. 39, No. 5, 806820 Unrealistic Optimism About Future Life Events Neil D. Weinstein Department of Human Ecology and Social Sciences Cook College,
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationMisunderstandings between experimentalists and observationalists about causal inference
J. R. Statist. Soc. A (2008) 171, Part 2, pp. 481 502 Misunderstandings between experimentalists and observationalists about causal inference Kosuke Imai, Princeton University, USA Gary King Harvard University,
More informationAdaptive linear stepup procedures that control the false discovery rate
Biometrika (26), 93, 3, pp. 491 57 26 Biometrika Trust Printed in Great Britain Adaptive linear stepup procedures that control the false discovery rate BY YOAV BENJAMINI Department of Statistics and Operations
More informationRECENTLY, there has been a great deal of interest in
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 1, JANUARY 1999 187 An Affine Scaling Methodology for Best Basis Selection Bhaskar D. Rao, Senior Member, IEEE, Kenneth KreutzDelgado, Senior Member,
More informationAn Introduction to Regression Analysis
The Inaugural Coase Lecture An Introduction to Regression Analysis Alan O. Sykes * Regression analysis is a statistical tool for the investigation of relationships between variables. Usually, the investigator
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationHighRate Codes That Are Linear in Space and Time
1804 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 7, JULY 2002 HighRate Codes That Are Linear in Space and Time Babak Hassibi and Bertrand M Hochwald Abstract Multipleantenna systems that operate
More informationAn Introduction to Variable and Feature Selection
Journal of Machine Learning Research 3 (23) 11571182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478151, USA
More informationSample Attrition Bias in Randomized Experiments: A Tale of Two Surveys
DISCUSSION PAPER SERIES IZA DP No. 4162 Sample Attrition Bias in Randomized Experiments: A Tale of Two Surveys Luc Behaghel Bruno Crépon Marc Gurgand Thomas Le Barbanchon May 2009 Forschungsinstitut zur
More informationLearning to Select Features using their Properties
Journal of Machine Learning Research 9 (2008) 23492376 Submitted 8/06; Revised 1/08; Published 10/08 Learning to Select Features using their Properties Eyal Krupka Amir Navot Naftali Tishby School of
More informationHow to Deal with The LanguageasFixedEffect Fallacy : Common Misconceptions and Alternative Solutions
Journal of Memory and Language 41, 416 46 (1999) Article ID jmla.1999.650, available online at http://www.idealibrary.com on How to Deal with The LanguageasFixedEffect Fallacy : Common Misconceptions
More informationA WithinSubject Analysis of OtherRegarding Preferences
No 06 A WithinSubject Analysis of OtherRegarding Preferences Mariana Blanco, Dirk Engelmann, HansTheo Normann September 2010 IMPRINT DICE DISCUSSION PAPER Published by HeinrichHeineUniversität Düsseldorf,
More informationAn Account of Global Factor Trade
An Account of Global Factor Trade By DONALD R. DAVIS AND DAVID E. WEINSTEIN* A half century of empirical work attempting to predict the factor content of trade in goods has failed to bring theory and data
More informationDISCUSSION PAPER SERIES. IZA DP No. 1938
DISCUSSION PAPER SERIES IZA DP No. 1938 American Exceptionalism in a New Light: A Comparison of Intergenerational Earnings Mobility in the Nordic Countries, the United Kingdom and the United States Markus
More informationBeyond Correlation: Extreme Comovements Between Financial Assets
Beyond Correlation: Extreme Comovements Between Financial Assets Roy Mashal Columbia University Assaf Zeevi Columbia University This version: October 14, 2002 Abstract This paper investigates the potential
More informationFast Solution of l 1 norm Minimization Problems When the Solution May be Sparse
Fast Solution of l 1 norm Minimization Problems When the Solution May be Sparse David L. Donoho and Yaakov Tsaig October 6 Abstract The minimum l 1 norm solution to an underdetermined system of linear
More informationDecoding by Linear Programming
Decoding by Linear Programming Emmanuel Candes and Terence Tao Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 Department of Mathematics, University of California, Los Angeles, CA 90095
More informationGraphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations
Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations Jure Leskovec Carnegie Mellon University jure@cs.cmu.edu Jon Kleinberg Cornell University kleinber@cs.cornell.edu Christos
More informationNCSS Statistical Software
Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, twosample ttests, the ztest, the
More informationRevisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations
Revisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations Melanie Mitchell 1, Peter T. Hraber 1, and James P. Crutchfield 2 In Complex Systems, 7:8913, 1993 Abstract We present
More informationBIS RESEARCH PAPER NO. 112 THE IMPACT OF UNIVERSITY DEGREES ON THE LIFECYCLE OF EARNINGS: SOME FURTHER ANALYSIS
BIS RESEARCH PAPER NO. 112 THE IMPACT OF UNIVERSITY DEGREES ON THE LIFECYCLE OF EARNINGS: SOME FURTHER ANALYSIS AUGUST 2013 1 THE IMPACT OF UNIVERSITY DEGREES ON THE LIFECYCLE OF EARNINGS: SOME FURTHER
More information