Tests in a case control design including relatives


 Julian Francis Crawford
 3 years ago
 Views:
Transcription
1 Tests in a case control design including relatives Stefanie Biedermann i, Eva Nagel i, Axel Munk ii, Hajo Holzmann ii, Ansgar Steland i Abstract We present a new approach to handle dependent data arising from relatives in case control studies in genetic epidemiology. A general theorem is shown characterizing the joint asymptotic distribution of the entries in the associated contingency table under the null hypothesis of independence of phenotype and genotype at the observed locus, allowing to handle several common test statistics, namely the attributable risk, the relative risk, and the odds ratio. Simulations reveal good finite sample behavior of these test statistics. Our results can also be applied to more general settings in medicine or natural science whenever data from related subjects are involved. Keywords and Phrases: allele frequency, case control design, causal locus, contingency tables, dependent data, likelihood ratio test, risk measures. 1 Introduction The analysis of contingency tables is an important tool in epidemiological studies; see, e.g., [12]. For the typical situation of comparing the relative frequency of a certain feature among two samples of individuals (cases and controls), there is an extensive amount of literature on statistical tests within this setting as long as the data are independent. When data from relatives are included in a case control study, the use of the present methods is subject to severe limitations. i RuhrUniversität Bochum, Fakultät für Mathematik, Bochum, Germany ii GeorgAugustUniversität Göttingen, Institut für Mathematische Stochastik, Göttingen, Germany 1
2 The aim of this article is to give a new approach to handle data of relatives within a case control study that is based on tests in contingency tables considering the wellknown risk measures such as the odds ratio, the attributable risk and the relative risk from a new point of view. We prove a result on the asymptotic distribution of the entries of the contingency table under the null hypothesis that the feature under consideration does not influence the phenotype. From this we derive the asymptotic distributions of the test statistics under consideration, enabling us to derive critical values for the test problem. Our method is widely applicable to all kinds of tests that can be derived from contingency tables and to data of relatives from all fields of science. It is still a major challenge of genetic research to explore the genetic component of common diseases such as hypertension or asthma. Therefore, there is considerable interest in statistical methods to test candidate markers and to find predisposing alleles. Much effort has been devoted to this problem, deriving several methods under the assumption of independent data; see, e.g., [2], [7], [9], [11], [16] or [19]. This assumption, however, is violated if two individuals in the sample are related. Particularly in case of a rare disease, researchers cannot afford to include only unrelated subjects in a study, since this will inevitably result in a severe loss of power. In recent years, some methods have been published, which take dependencies among data into account; see, e.g., [18] and [3], who pursue likelihoodbased approaches, or [4], [5] and [17], who consider score tests derived from identity by descent (IBD) data from affected sibpairs. In [15] the test for trend in proportions introduced by [1] is extended to the case where family data must be accounted for by computing the exact variance of the test statistic. The approaches described above depend sensitively on several nuisance parameters from the underlying genetic model as well as the model itself. To provide nonparametric asymptotic level α tests we propose nonparametric estimators for these parameters, and simultaneously allow for dependend data. This article is organized as follows. Section 2 deals with a more detailed discussion of the genetic model under consideration. The proposed tests and their asymptotic 2
3 distributions are given in Section 3. Section 4 provides various important extensions of our method, e.g., the inclusion of more different types of relatives, missing data, polygenic disorders or different inheritance models. The finite sample properties of the tests under parameter settings of practical interest are studied by simulations in Section 5, inlcuding comparisons to the corresponding tests using data from independent individuals only. The proofs of our results, finally, are deferred to an appendix. 2 Genetic model and assumptions At first glance it seems that our assumptions concerning the number of candidate loci (one) and the mode of inheritance (dominant) imposed at this stage are quite restrictive, but we show in Section 4 how to apply our approach to an arbitrary number of markers and predisposing loci as well as different modes of inheritance. Hence, for simplicity of motivation and derivation of results, let us assume that there are two biallelic loci L1 and L2 with alleles A and Ā. L1 is the candidate locus where data are observed, whereas L2 denotes the true but unknown causal locus. We want to ascertain if there is significant evidence for linkage disequilibrium between L1 and L2 by comparing the relative frequencies of allele A at L1 in the case and the control group, respectively. We assume an dominant inheritance model, i.e. for any individual the probability of being affected given there is at least one allele A present at locus L2 is equal to the penetrance f, whereas individuals without any allele A at L2 are affected with probability zero. In The penetrance, f, is a positive but unknown constat, f 1. We consider the following study setup. A data set containing data from n 1 affected children is randomly sampled from the population as part of the cases group. Similarly, a random sample of n 2 unaffected children forms the basis of the control group. Then the total number of children is n = n 1 + n 2. There are no degrees of relationship allowed among these children to avoid dependencies within the data at this stage of the experiment. The sample consisting of data of unrelated children only will be 3
4 denoted by basic sample in the following. The children in both groups are then tested on the existence of allele A at locus L1, which is suspected to have an impact on developing the disease. More precisely, to each child we assigned a random variable C i, i = 1,..., n, which counts the number of occurrences of allele A at the candidate locus L1, i.e. 2 : Allele A occurs twice at locus L1 of child i C i = 1 : Allele A occurs exactly once at locus L1 of child i 0 : otherwise. The classic case control design introduced so far will now be enriched by dependent data from relatives from the associated nuclear families, i.e. the data consists of the children and their parents, bringing the total number of individuals taking part in the study to 3n. The random variables corresponding to parental observations are defined equivalently and denoted by A i and B i, where A i is given by 2 : Allele A occurs twice at locus L1 of the first parent of child i A i = 1 : Allele A occurs exactly once at locus L1 of the first parent of child i 0 : otherwise and B i is defined analogously for the second parent. From the above study setup, it follows that there are dependencies among the data and, as a further complication, the number of parents belonging to either the case or the control group is also random. More precisely, the numbers of affected or unaffected first and second parents conditioned on the children s phenotype are random numbers given by R 1 = number of affected first parents with affected children, R 2 = number of affected first parents with unaffected children, R 3 = number of unaffected first parents with unaffected children, 4
5 R 4 = number of unaffected first parents with affected children, and S 1,..., S 4 are defined analogously for the second parent. Evidently, the random variables R k, S k, k = 1,..., 4, are linked by the equalities R 1 +R 4 = n 1, R 2 +R 3 = n 2, S 1 + S 4 = n 1 and S 2 + S 3 = n 2. Furthermore, these variables follow binomial distributions specified as follows: R 1 Bin(n 1, p 1 ), R 2 Bin(n 2, p 2 ) and S 1 Bin(n 1, p 1 ), S 2 Bin(n 2, p 2 ), where the parameters p 1, p 2 denote the probabilities that a parent is affected given the corresponding child is affected or unaffected, respectively. Finally, the covariance structure among these random variables is characterized by the parameters p 3 and p 4 with Cov(R 1, S 1 ) = n 1 p 3 and Cov(R 2, S 2 ) = n 2 p 4. Furthermore, in this article we suppose that the following standard assumptions on the various processes involved in inheritance are satisfied. (A1) The random processes yielding the phenotype given the genotype are independent for different individuals. (A2) Gamete formation is independent of phenotype. (A3) HardyWeinberg equilibrium holds for each locus. (A4) Offspring data are randomly sampled from the two groups in the population. Our interest is on testing whether there is an influence of the observed genotype at locus L1 on the phenotype. We therefore test the null hypothesis H 0 that there is no linkage disequilibrium between the observed locus L1 and the disease locus L2 against the alternative H 1 of the existence of linkage disequilibrium (LD) between these two loci. Note that the linkage disequilibrium coefficient δ describing the LD between alleles at the two loci L1 and L2 within the population is given by δ = δ AA = h AA p A0 p 2A0, where h AA denotes the haplotype frequency of alleles A, A at loci L1 and L2, and p A0, p 2A0 stand for the allele frequencies at L1 and L2, respectively. Recall that a 5
6 single parameter is sufficient to describe the LD between two biallelic loci since the LD coefficients between arbitrary alleles at these loci only differ in sign. In terms of the LD coefficient δ, the testing problem is thus given by testing the null hypothesis (1) H 0 : δ = 0 against either the onesided or the twosided alternative (2) H 1 : δ > 0 or H 1 : δ 0. In order to construct tests for these hypotheses on the basis of observations, i.e. realizations of the random variables C i, A i, i = 1,..., n, we reformulate the above hypotheses (1) and (2) in terms of the allele frequencies of A at L1 in the two respective groups. Let p v denote the frequency of A at L1 among the affected individuals in the population, and p w the corresponding term among the unaffected individuals. The parameters p v and p w can be expressed by δ, p A0 and p 2A0 as follows (3) p v p w = p A0 + δ = p A0 δ 1 p 2A0 p 2A0 (2 p 2A0 ) f(1 p 2A0 ) 1 fp 2A0 (2 p 2A0 ). We note that p w also depends on the value of the penetrance f. The hypotheses (1) and (2) can thus be reformulated equivalently in the following way: (4) H 0 : p v = p w = p A0 (5) H 1 : p v > p w or H 1 : p v p w. 3 Test statistics and their asymptotic distributions The null hypothesis H 0 implies that the allele frequency of A at locus L1 is the same among affected and unaffected individuals in the population. It is therefore reasonable 6
7 to consider test statistics based on differences or ratios of estimators for the allele frequencies p v and p w in the case and the control group to detect deviations from the null hypothesis. We choose as estimators the empirical counterparts of p v and p w in the two respective groups, which can easily be obtained from the corresponding contingency table (see Table 1). (The sets S A, S B, occurring in Table 1 are defined as the sets of indices i, 1 i n so that the i th first or second parent is affected, respectively, i.e. S A = {i i th first parent affected, 1 i n}, S B = {i i th second parent affected, 1 i n}. Insert Table 1 The empirical allele frequency ˆp v and ˆp w within the control group can be expressed as functions of the entries of Table 1 as follows. ˆp v = N 1 /(N 1 + N 3 ), ˆp w = N 2 /(6n N 1 N 3 ) From Table 1, numerous test statistics can be derived. Since our aim is to compare in some sense the empirical allele frequencies ˆp v and ˆp w in the two respective groups, we focus on functions which measure the difference between two probabilities. Comparing probabilities such as allele frequencies is usually done by calculating the values of the corresponding risk measures, i.e. the attributable risk, the odds ratio, and the relative risk. If the sample allele frequencies instead of the true but unknown allele frequencies are inserted, we can use the risk measures as test statistics. This procedure yields the following test statistics attributable risk T n1 = N 1 N 2 = ˆp v ˆp w, N 1 + N 3 6n N 1 N 3 odds ratio T n2 = N 1(6n N 1 N 2 N 3 ) N 2 N 3 = ˆp v(1 ˆp w ) ˆp w (1 ˆp v ), relative risk T n3 = N 1(6n N 1 N 3 ) N 2 (N 1 + N 3 ) = ˆp v ˆp w. We are now ready to present the main result of this article, which gives the joint asymptotic distribution of the empirical allele frequencies ˆp v and ˆp w under the null 7
8 hypothesis H 0. This result enables us to calculate the asymptotic distributions of the test statistics T n1, T n2 and T n3, respectively, thus giving the asymptotic critical values. The proof of Theorem 1 is deferred to the appendix. Since under H 0 we have Var(C 1 ) = Var(A 1 ) = Var(B 1 ) and Cov(C 1, A 1 ) = Cov(C 1, B 1 ) the results are given in terms of Var(C 1 ) and Cov(C 1, A 1 ) instead of these five different expressions for brevity of notations. Theorem 1 Assume (A1)(A4) and that n 1 /n tends to c (0, 1). Under the null hypothesis δ = 0, the asymptotic distribition of the allele frequencies is given by n ˆp v p A0 (6) D N (0, Σ), ˆp w p A0 where the entries of the covariance matrix Σ are given by Σ 1,1 = Var(C 1 )/(12 t 1 ) + c p 1 Cov(C 1, A 1 )/(9 t 2 1) Σ 1,2 = Σ 2,1 = Cov(C 1, A 1 )(c(1 p 1 ) + (1 c)p 2 )/(18 t 1 (1 t 1 )) Σ 2,2 = Var(C 1 )/(12 (1 t 1 )) + (1 c)(1 p 2 )Cov(C 1, A 1 )/(9 (1 t 1 ) 2 ) with Var(C 1 ) = 2p A0 (1 p A0 ) and Cov(C 1, A 1 ) = p A0 (1 p A0 ). The parameter t 1 denotes the asymptotic expected percentage of affected individuals in the study, i.e. t 1 = lim n E[(n 1 + R 1 + R 2 + S 1 + S 2 )/(3n)] = (c + 2c p 1 + 2(1 c)p 2 )/3. To find critical values for the tests under consideration, we will utilize the above result to derive the asymptotic distributions of the test statistics T n1, T n2 and T n Asymptotic distributions of the test statistics The asymptotic distributions of the statistics T n1, T n2 and T n3 (attributable risk, odds ratio, and the relative risk,) are given in the following Corollary which is an application of the method. We consider the logarithm of T n2 and T n3, since these test statistics were found to preserve the nominal significance level more precisely than the original versions. 8
9 Corollary 2 Under the assumptions of Theorem 1 the asymptotic distributions of the test statistics T n1, log(t n2 ) and log(t n3 ) under H 0 are given by (7) n Tn1 D N (0, σ1), 2 D n log(t ni ) N (0, σi 2 ), i = 1, 2, where the asymptotic variances σ 2 1, σ 2 2 and σ 2 3 are obtained as σ1 2 = Var(C 1) 12 t 1 (1 t 1 ) + Cov(C 1, A 1 )(cp 1 (c + cp 1 + (1 c)p 2 )t 1 + t 2 1) 9 t 2 1(1 t 1 ) 2 = p A0(1 p A0 ) ( ) 2cp 18 t 2 1(1 t 1 ) (3 c)t 1 4t 2 1, σ2 2 = (8) = σ3 2 = = Var(C 1 ) 12 t 1 (1 t 1 )p 2 A0 (1 p A0) + Cov(C 1, A 1 )(cp 1 (c + cp 1 + (1 c)p 2 )t 1 + t 2 1) 2 9 t 2 1(1 t 1 ) 2 p 2 A0 (1 p A0) 2 1 ( ) 2cp 18 t 2 1(1 t 1 ) (3 c)t 1 4t 2 1, p A0 (1 p A0 ) Var(C 1 ) + Cov(C 1, A 1 )(cp 1 (c + cp 1 + (1 c)p 2 )t 1 + t 2 1) 12 t 1 (1 t 1 )p 2 A0 9 t 2 1(1 t 1 ) 2 p 2 A0 (1 p A0 ) ( ) 2cp 18 t 2 1(1 t 1 ) (3 c)t 1 4t 2 1 p A0 Remark 3 If only the independent random variables C i, i = 1,..., n, from the basic sample are included in this test the distributions of T n1, log(t n2 ) and log(t n3 ) under the null hypothesis H 0 also feature nconvergence to centered normal distributions with asymptotic variances σ1,c, 2 σ2,c 2 and σ3,c 2 given by (9) σ 2 1,c = Var(C 1) 4 c(1 c), σ2 2,c = Var(C 1 ) 4 c(1 c)p 2 A0 (1 p A0), 2 σ2 3,c = Var(C 1). 4 c(1 c)p 2 A0 Since c = lim n n 1 /n denotes the asymptotic fraction of affected individuals from the basic sample this quantity exactly corresponds to the expression t 1 in the variances σ1, 2 σ2 2 and σ3 2 from Corollary 2. Keeping in mind that the overall number of individuals taking part in the study has tripled by including the parental data A i, B i, i = 1,..., n, the first addend of σi 2, i = 1, 2, 3, from (8) is consistent with σi,c, 2 i = 1, 2, 3, from (9), respectively. The second addend of σi 2, i = 1, 2, 3, involving the covariance between 9
10 offspring and parental data has no corresponding term in σi,c, 2 i = 1, 2, 3, due to the requirement that the individuals from the basic sample are unrelated. We can thus find the expression for σi,c, 2 i = 1, 2, 3, as a special case of σi 2, i = 1, 2, 3, given the parental (and therefore dependent) data are excluded. 3.2 Estimation of unknown parameters To construct statistical tests using the asymptotic normality of the test statistics T n1, log(t n2 ) and log(t n3 ), we have to estimate the asymptotic variances. Note that σ 2 1, σ 2 2 and σ 2 3 depend on the unknown parameters p A0, p 1 and p 2, but these parameters can be estimated in practice. By Slutsky s Theorem, the results given in Corollary 2 still hold if σ 2 1, σ 2 2 and σ 2 3 are replaced by consistent estimators. Under the null hypothesis H 0, the allele frequency p A0 of A at L1 in the two respective groups and the probabilities p 1 and p 2 of a parent being affected given the corresponding child is affected or unaffected, respectively, can be estimated unbiasedly and n consistently by the sample means of the corresponding random variables, i.e. ˆp A0 = 1/(6n) n (C i + A i + B i ), ˆp 1 = (R 1 + S 1 )/(2n 1 ), ˆp 2 = (R 2 + S 2 )/(2n 2 ). Note that for the estimation of p A0 we have to divide by 6n since C i, A i and B i, i = 1,..., n, all have expectation 2p A0. For estimating the variance Var(C 1 ) = Var(A 1 ) = Var(B 1 ) and the covariance Cov(C 1, A 1 ) = Cov(C 1, B 1 ), there are several possibilities. Firstly, one could simply replace p A0 by its estimator ˆp A0 in the relations Var(C 1 ) = 2p A0 (1 p A0 ) and Cov(C 1, A 1 ) = p A0 (1 p A0 ). Alternatively, one could use the estimators ˆv for Var(C 1 ) and cov ˆ for Cov(C 1, A 1 ) given by n n ˆv = 1/(3n) (Ci 2 + A 2 i + Bi 2 ) 1/(9n(n 1)) n (C i + A i + B i )(C j + A j + B j ) i cov ˆ = 1/(2n) n (C i A i + C i B i ) 1/(9n(n 1)) n n (C i + A i + B i )(C j + A j + B j ). j i 10
11 Straightforward calculations show that these estimators are unbiased with variance converging to 0 at a rate of 1/n. Finally note that the parameters p 3 and p 4 describing the dependency structure between parental phenotypes given the offspring phenotype do not appear in the asymptotic variances of the test statistics in (8). 4 Extensions From a practical viewpoint the generalization to more general types of relatives than trios as the additional inclusion of sibs data may be a concern. Following [11], [16] and [10], a gain in power can be achieved by using data from related cases and unrelated controls to increase the allele frequency difference between the two groups. In this design only affected relatives of affected children are genotyped to enrich the data set. It is even possible to allow for missing data under a missing at random assumption. Since then additional nuisance parameters appear, more detailed investigations of these issues should be a subject of future research. We introduced our method for a monogenic disease and a single marker locus for the sake of clearness and brevity of this article. We will now explain, how the proposed tests can be extended to genetically more complex models. Assume that a set of k 1 candidate loci has been chosen. We can then test if a specific allele combination (genotype), say G = (g 1,..., g k ), at these k loci contributes to the phenotype expression as follows. Define a virtual biallelic locus L with alleles G and Ḡ where Ḡ consists of all theoretically possible allele combinations specified by the k markers except the candidate genotype G. The proposed tests can then be applied to L where the candidate allele A from the original test is replaced by G. The only difference with respect to the original procedure is the fact that genotypes can only appear once for each individual so that the random variables C i, A i, B i describing genotype features as the sum of two indicators on A in the one locus case must be replaced by indicators on G, yielding a modified variance of the statistic. Using that approach one may test all relevant allele 11
12 combinations. If the k markers are located on the same chromosome one could also think of conducting test procedures on haplotypes to take account of cis and trans effects, see [13]. To test the effect of a specific haplotype, say H, on the phenotype, define a virtual biallelic locus with alleles H and H and applying our method. If the candidate loci are not very close, we have to adjust the expression for the covariance between parental and offspring (virtual) genotypes (corresponding to Cov(C 1, A 1 ) in the original tests) by some nuisance parameters to allow for recombination, i.e. the analogon to Cov(C 1, A 1 ) will not exactly be equal to p A0 (1 p A0 ) but include the aforementioned parameters. This is, however, no obstacle for the use of these tests since the covariance will be estimated anyway. The estimator cov ˆ from section 3.2 will still provide a nconsistent estimate for the covariance. If the DNA sequence containing the markers is not too long the method of molecular haplotyping can be used to identify the haplotypes; see [8]. Otherwise haplotypes can either be determined by pedigree analysis or (if that is not possible) haplotype frequencies can be estimated for example by an EMalgorithm; see [6]. A more detailed exploration of this subject will be given in a subsequent article. An important question is whether the proposed tests are also valid under more general inheritance modes. A general model attributes different risks f 0, f 1, f 2, to each genotype with respect to L2, the index corresponding to the number of alleles A. Then f 0 = 0, f 1 = f 2 = f yields the dominant and f 0 = f 1 = 0, f 2 = f the recessive model. Theorem 1 still remains valid even in this general setting. Note that although the values of the parameters p 1 and p 2, which denote the probabilities that a parent is affected given that the child is affected or not, depend on the mode of inheritance, these parameters are still consistently estimated by the estimators ˆp 1 and ˆp 2. However, under the alternative hypothesis H 1 the allele frequencies ˆp v and ˆp w of A at L1 in the two different groups depend on the different values of f 0, f 1 and f 2. Therefore when simulating the power of the tests, the mode of inheritance has to be taken into account. 12
13 5 Simulation Study We explored by a simulation study whether there is a substantial increase in power when parental data are included, and to which extent the actual level of significance is affected by the dependency structure of the data. We considered testing H 0 : δ = 0 against H 1 : δ > 0 at a nominal level α = To simulate random variables with the same distributions as C i, A i and B i, i = 1,..., n, four parameters have to be specified in advance. We fixed the values of the two allele frequencies p A0, p 2A0, the penetrance f and the LD coefficient δ. Since the LD coefficient δ depends on the haplotype frequency and the allele frequencies, which complicates comparisons between different pairs of loci, it is better to consider the maximal LD coefficient δ = δ min{p A0,p 2A0 } p A0 p 2A0, which is standardized in such a way that its values lie in the unit interval [0, 1] if there is positive or no LD δ 0 between alleles A at L1 and A at L2. In the following, we use δ instead of δ. Each parameter constellation was simulated using 10,000 runs for sample sizes n = 60, n = 100 and n = 200. Here n 1 was chosen according to the equal expected allocation rule, i.e. the expected number of affected individuals in the sample is equal to the expected number of unaffected individuals. For some parameter combinations, this choice was not admissible due to the restriction that the value of n 1 must not be greater than n. In these cases, we simulated the tests for several choices of n 1 with 0.75n n n. We found that the tests are not very sensitive with respect to the particular choice of n 1 within this range. The simulated rejection rates under H 0 : δ = 0 are given in Table 2 and Table 3). Since in the following we will compare the performance of our tests with the performance of the corresponding tests including either data from unrelated subjects only or data from children with one parent each, we label by ( ) the tests including data from both parents for each child. Tables 2 and 3 show that the tests preserve the αlevel relatively well, even in small samples (n = 60). We simulated various further parameter combinations and always 13
14 received similar results. Only for very small (< 0.05) allele frequencies p A0, the tests exceed the αlevel when the sample size is small due to estimating p A0 to zero in many runs. In this case, the test based on the attributable risk turns out to be the most robust one among the three tests under consideration. Tables 4, 5 and 6 show the simulated powers of the tests when the alternative hypothesis H 1 is valid. For comparison, we also display the corresponding values when only data of unrelated subjects are included in the test statistics. In this case, the number of affected individuals n 1 was chosen according to the equal allocation rule n 1 = n/2. The tests including offspring data only are not given any label. The tests labeled by ( ) denote the tests including the data from the basic sample plus the data from one randomly chosen parent for each child. For these tests, the expected equal allocation rule was used to determine n 1. In Table 4, we chose relatively small allele frequencies of A at the two loci L1 and L2, a small amount of LD measured by δ = 0.2 and a high penetrance of f = 0.6. We observe that the inclusion of parental data increases the powers of the tests considerably. The same holds true for medium scale values of the allele frequencies p A0 and p 2A0 as can be seen from Table 5. Table 6, finally, displays the values of the simulated powers when the value for the penetrance f is chosen relatively small. Further,we observed virtually no difference in performance between the three tests under consideration so that any of these tests could be used in practice. 6 Discussion We introduce new contingency table tests for a casecontrol design including related subjects whether a candidate locus is in linkage disequlibrium with a predisposing locus. Starting from a basic sample of i.i.d. cases and controls, parental genotype and phenotype data are added to the sample. To define asymptotic level α tests we provide the asymptotic distributions of the attributable, the relative risk, and the odds ra 14
15 tio. For the nuisance parameters induced by genetic theory, we propose nonparametric which are consistent for any underlying genetic inheritance model. Thus, our tests are indeed asymptotic nonparametric level α tests. Simulations show that the tests under consideration preserve the nominal significance level very well and are quite robust with respect to the genetic parameters. The statistical power to detect linkage disequilibrium, i.e. to detect predisposing genes, is substantially increased by including parental information. The considerations made in Section 4 show that the tests are also applicable to test for linkage disequilibrium between several markers and causal loci in case of polygenic diseases if some slight modifications are implemented. Acknowledgements: This research was supported by the Deutsche Forschungsgemeinschaft (project: Statistical methods for the analysis of association studies in human genetics). H. Holzmann and A. Munk acknowledge financial support of the Deutsche Forschungsgemeinschaft, MU 1230/81. The authors are also grateful to S. Böhringer for some interesting discussions and helpful comments on an earlier version of this article. Appendix: Proofs Proof of Theorem 1: The main difficulty in the proof is to show that the components N 1, N 2 and N 3 of Table 1 are jointly asymptotically normally distributed. This is not evident since N 1 and N 2 contain sums of randomly chosen variables A i and B i. However, these can be reconstructed as sums of deterministically chosen random variables as follows. To each affected child, we assign a sevendimensional random vector X i = (X (1) i, X (2) i, X (3) i, X (4) i, X (5) i, X (6) i, X (7) i ) T, i = 1,..., n 1, where X (2) i = I{first parent of child i is affected}, 15
16 X (5) i = I{second parent of child i is affected}, and X (1) i = C i, X (3) i = A i, X (4) i = X (2) i X (3) i, X (6) i = B i and X (7) i = X (5) i X (6) i. Analogously, we define the vectors Y j = (Y (1) j, Y (2) j, Y (3) j, Y (4) j, Y (5) j, Y (6) j, Y (7) j ) T, j = 1,..., n 2, for each child from the control group. We have thus arranged dependent random variables corresponding to related individuals within the vectors X i and Y j. Since we did not allow for any degree of relationship among the children, the sequences X i, i = 1,..., n 1, and Y j, j = 1,..., n 2, are independent, and the random vectors X i, i = 1,..., n 1, as well as Y j, j = 1,..., n 2, are independent identically distributed. For the sake of brevity of notation, we denote the expectation of C i, A i and B i, i = 1,..., n, by p A, i.e. p A = 2 p A0. Lemma 4 Let n 1 /n tend to some constant c (0, 1) for n. Denote by m x and m y the means of the X 1 and Y 1, respectively, and by K x, K y the corresponding covariance matrices. Then under H 0 (10) 1 n1 n n X i n 1 n m x 1 n n2 Y i n 2 n m y D N (0, Σ K ), where the covariance matrix Σ K, given by a block matrix with blocks ck x and (1 c)k y, is nondegenerate. Explicitely, we have that m x = (p A, p 1, p A, p 1 p A, p 1, p A, p 1 p A ) T, m y = (p A, p 2, p A, p 2 p A, p 2, p A, p 2 p A ) T, Var(C 1 ) 0 Cov(C 1, A 1 ) p 1 Cov(C 1, A 1 ) 0 Cov(C 1, B 1 ) p 1 Cov(C 1, B 1 ) 0 p 1 (1 p 1 ) 0 p A p 1 (1 p 1 ) p 3 0 p 3 p A Cov(C 1, A 1 ) 0 Var(C 1 ) p 1 Var(C 1 ) K x = p 1 Cov(C 1, A 1 ) p A p 1 (1 p 1 ) p 1 Var(C 1 ) p 1 p A (1 + (0.5 p 1 )p A ) p 3 p A 0 p 3 p 2. A 0 p 3 0 p 3 p A p 1 (1 p 1 ) 0 p A p 1 (1 p 1 ) Cov(C 1, B 1 ) Var(C 1 ) p 1 Var(C 1 ) p 1 Cov(C 1, B 1 ) p 3 p A 0 p 3 p 2 A p A p 1 (1 p 1 ) p 1 Var(C 1 ) p 1 p A (1 + (0.5 p 1 )p A ) 16
17 Var(C 1 ) 0 Cov(C 1, A 1 ) p 2 Cov(C 1, A 1 ) 0 Cov(C 1, B 1 ) p 2 Cov(C 1, B 1 ) 0 p 2 (1 p 2 ) 0 p A p 2 (1 p 2 ) p 4 0 p 4 p A Cov(C 1, A 1 ) 0 Var(C 1 ) p 2 Var(C 1 ) K y = p 2 Cov(C 1, A 1 ) p A p 2 (1 p 2 ) p 2 Var(C 1 ) p 2 p A (1 + (0.5 p 2 )p A ) p 4 p A 0 p 4 p 2. A 0 p 4 0 p 4 p A p 2 (1 p 2 ) 0 p A p 2 (1 p 2 ) Cov(C 1, B 1 ) Var(C 1 ) p 2 Var(C 1 ) p 2 Cov(C 1, B 1 ) p 4 p A 0 p 4 p 2 A p A p 2 (1 p 2 ) p 2 Var(C 1 ) p 2 p A (1 + (0.5 p 2 )p A ) Proof of Lemma 4: The expectations and covariance matrices of X 1 and Y 1 under H 0 are obtained by straightforward calculations. Note that under the null hypothesis the random variables X (2) i, X (5) i and Y (2) j, Y (5) j corresponding to phenotype data are independent of the random variables X (1) i, X (3) i, X (6) i and Y (1) j, Y (3) j, Y (6) j describing genotype features of the individuals. Asymptotic normality follows from an application of the multivariate central limit theorem for iid vectors to both sequences separately, getting asymptotic results for the sums of the vectors X i and Y j. Since the sequences X i, i = 1,..., n 1, and Y j, j = 1,..., n 2, are independent, it remains to put the sums into one vector and standardize accordingly. To prove that both covariance matrices K x and K y are nondegenerate we calculate the determinants of the respective matrices and obtain K x = 0.5p 2 1(1 p 1 ) 2 Var(C 1 ) 5 (p 2 1(1 p 1 ) 2 p 2 3) K y = 0.5p 2 2(1 p 2 ) 2 Var(C 1 ) 5 (p 2 2(1 p 2 ) 2 p 2 4) where we used that Var(C 1 ) = 2 p A0 (1 p A0 ) and Cov(C 1, A 1 ) = p A0 (1 p A0 ). Since p A0, p 1 and p 2 are assumed to lie in (0, 1) it remains to show that p 2 3 < p 2 1(1 p 1 ) 2 and p 2 4 < p 2 2(1 p 2 ) 2. Recall that the term p 2 1(1 p 1 ) 2 p 2 3 is given by p 2 1(1 p 1 ) 2 p 2 3 = Var(X (2) 1 )Var(X (5) 1 ) [Cov(X (2) 1, X (5) 1 )] 2 0 by Hölder s inequality. As X (2) 1 and X (5) 1 are not linearly dependent (the four possible outcome combinations for X (2) 1 and X (5) 1 each occur with positive probability) even 17
18 the strict inequality holds and thus the determinant of K x is positive. An analogous reasoning applies to show that K y is also greater than zero. An application of the method will now give us the joint asymptotic distribution of the entries of the contingency table under H 0. Lemma 5 Let, again, n 1 /n tend to some arbitrary constant c (0, 1) for n. Then the following assertion holds under the null hypothesis H 0. (11) with expectations n N 1 E[N 1] n n N 2 E[N 2] n n N 3 E[N 3] n n D N (0, Σ N ), E[N 1 ] = (n 1 + 2n 1 p 1 + 2n 2 p 2 )p A, E[N 2 ] = (3n n 1 2n 1 p 1 2n 2 p 2 )p A, E[N 3 ] = (n 1 + 2n 1 p 1 + 2n 2 p 2 )(2 p A ), and the entries Σ N,i,j, i, j = 1, 2, 3, of the covariance matrix Σ N are given by Σ N,1,1 = Var(C 1 ){c + 2cp 1 + 2(1 c)p 2 } + 4Cov(C 1, A 1 )cp 1 +2p 2 A{cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 } Σ N,1,2 = 2Cov(C 1, A 1 ){c(1 p 1 ) + (1 c)p 2 } 2p 2 A{cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 } Σ N,1,3 = Var(C 1 ){c + 2cp 1 + 2(1 c)p 2 } 4Cov(C 1, A 1 )cp 1 +2p A (2 p A ){cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 } Σ N,2,2 = Var(C 1 ){3 c 2cp 1 2(1 c)p 2 } + 4Cov(C 1, A 1 )(1 c)(1 p 2 ) +2p 2 A{cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 } Σ N,2,3 = 2Cov(C 1, A 1 ){c(1 p 1 ) + (1 c)p 2 } 2p A (2 p A ){cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 } 18
19 Σ N,3,3 = Var(C 1 ){c + 2cp 1 + 2(1 c)p 2 } + 4Cov(C 1, A 1 )cp 1 +2(2 p A ) 2 {cp 1 (1 p 1 ) + (1 c)p 2 (1 p 2 ) + cp 3 + (1 c)p 4 }. Proof of Lemma 5: We can express the entries of the contingency table 1 as functions of the sums of the entries of the vectors X i, i = 1,..., n 1, Y j, j = 1,..., n 2, i.e. N 1 = N 2 = + n 1 n 2 j=1 n 1 X (1) i + Y (1) j + X (6) i n 1 n 1 n 1 X (4) i + X (3) i X (7) i + n 2 j=1 n 1 n 2 j=1 Y (4) j + X (4) i + Y (6) j n 1 n 2 j=1 n 2 j=1 X (7) i + Y (3) j Y (7) j n 2 j=1 n 2 j=1 Y (7) j Y (4) j N 3 n 1 = 2n 1 n X (5) i n 1 i + 2 X (1) n 1 X (7) X (2) i n 2 i + 2 j=1 n 1 Y (5) j n 2 i + 2 X (4) n 2 j=1 j=1 Y (7) j. Y (2) j n 2 j=1 Y (4) j Interpreting the vector (N 1, N 2, N 3 ) T as a (measurable and differentiable) function from IR 14 to IR 3, we obtain the statement of Lemma 5 by application of the method. Application of the method to the function (ˆp v, ˆp w ) T, where ˆp v = N 1 /(N 1 + N 3 ) and ˆp w = N 2 /(6n N 1 N 3 ) then yields (6). The formulas for Var(C 1 ), Cov(C 1, A 1 ) and t 1 are obtained by straightforward calculations. References [1] Armitage P. (1955). Tests for linear trends in proportions and frequencies. Biometrics, 11, [2] Böhringer S., Hardt C., Miterski B., Steland A. and Epplen J.T. (2003). Multilocus statistics to uncover epistasis and heterogeneity in complex diseases: revisiting a set of multiple sclerosis data. European Journal of Human Genetics, 11,
20 [3] Böhringer S. and Steland A. (2004). Testing genetic causality for binary traits using parentoffspring data. Under revision in: Biometrical Journal. [4] Dudoit S. and Speed T.P. (1999). A score test for linkage using identity by descent data from sibships, Annals of Statistics, 27, [5] Dudoit S. and Speed T.P. (2000). A score test for the linkage analysis of qualitative and quantitative traits based on identity by descent data from sibpairs, Biostatistics, 1, [6] Fallin D., Cohen A., Essioux L., Chumakov I., Blumenfeld M., Cohen D. and Schork N.J. (2001). Genetic analysis of case/control data using estimated haplotype frequencies: Application to APOE locus variation and Alzheimer s disease. Genome Research, 11, [7] Hoh J., Wille A. and Ott J. (2001). Trimming, weighting, and grouping SNPs in human casecontrol association studies. Genome Research, 11, [8] MichalatosBeloin S., Tishkoff S., Bentley K., Kidd K. and Ruano G. (1996). Molecular haplotyping of genetic markers 10 kb apart by allelespecific longrange PCR. Nucleic Acids Research, 24, [9] Nelson M.R., Kardia S.L., Ferrell R.E. and Sing C.F. (2001). A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research, 11, [10] Risch N. (2000). Searching for genetic determinants in the new millenium, Nature, 405, [11] Risch N. and Teng J. (1998). The relative power of familybased and casecontrol designs for linkage disequilibrium studies of complex human diseases. I. DNA pooling. Genome Research, 8,
Recombination and Linkage
Recombination and Linkage Karl W Broman Biostatistics & Medical Informatics University of Wisconsin Madison http://www.biostat.wisc.edu/~kbroman The genetic approach Start with the phenotype; find genes
More informationCOMPLEX GENETIC DISEASES
COMPLEX GENETIC DISEASES Date: Sept 28, 2005* Time: 9:30 am 10:20 am* Room: G202 Biomolecular Building Lecturer: David Threadgill 4340 Biomolecular Building dwt@med.unc.edu Office Hours: by appointment
More informationChapter 4. The analysis of Segregation
Chapter 4. The analysis of Segregation 1 The law of segregation (Mendel s rst law) There are two alleles in the two homologous chromosomes at a certain marker. One of the two alleles received from one
More informationHLA data analysis in anthropology: basic theory and practice
HLA data analysis in anthropology: basic theory and practice Alicia SanchezMazas and José Manuel Nunes Laboratory of Anthropology, Genetics and Peopling history (AGP), Department of Anthropology and Ecology,
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationMATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...
MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 20092016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................
More informationCHAPTER NINE. Key Concepts. McNemar s test for matched pairs combining pvalues likelihood ratio criterion
CHAPTER NINE Key Concepts chisquare distribution, chisquare test, degrees of freedom observed and expected values, goodnessoffit tests contingency table, dependent (response) variables, independent
More informationTest of Hypotheses. Since the NeymanPearson approach involves two statistical hypotheses, one has to decide which one
Test of Hypotheses Hypothesis, Test Statistic, and Rejection Region Imagine that you play a repeated Bernoulli game: you win $1 if head and lose $1 if tail. After 10 plays, you lost $2 in net (4 heads
More informationGenetic Linkage Analysis
S SG ection ON tatistical enetics Genetic Linkage Analysis Hemant K Tiwari, Ph.D. Section on Statistical Genetics Department of Biostatistics School of Public Health How do we find genes implicated in
More informationThe Delta Method and Applications
Chapter 5 The Delta Method and Applications 5.1 Linear approximations of functions In the simplest form of the central limit theorem, Theorem 4.18, we consider a sequence X 1, X,... of independent and
More informationOutlier Detection and False Discovery Rates. for Wholegenome DNA Matching
Outlier Detection and False Discovery Rates for Wholegenome DNA Matching Tzeng, JungYing, Byerley, W., Devlin, B., Roeder, Kathryn and Wasserman, Larry Department of Statistics Carnegie Mellon University
More informationMultifactorial Traits. Chapter Seven
Multifactorial Traits Chapter Seven Multifactorial Not all diseases are Mendelian Multifactorial = many factors In Genetics: Multifactorial = both environment and genetics (usually more than one gene)
More informationLecture 7: Binomial Test, Chisquare
Lecture 7: Binomial Test, Chisquare Test, and ANOVA May, 01 GENOME 560, Spring 01 Goals ANOVA Binomial test Chi square test Fisher s exact test Su In Lee, CSE & GS suinlee@uw.edu 1 Whirlwind Tour of One/Two
More informationMultiple Testing. Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf. Abstract
Multiple Testing Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf Abstract Multiple testing refers to any instance that involves the simultaneous testing of more than one hypothesis. If decisions about
More informationTwolocus population genetics
Twolocus population genetics Introduction So far in this course we ve dealt only with variation at a single locus. There are obviously many traits that are governed by more than a single locus in whose
More informationNature of Genetic Material. Nature of Genetic Material
Core Category Nature of Genetic Material Nature of Genetic Material Core Concepts in Genetics (in bold)/example Learning Objectives How is DNA organized? Describe the types of DNA regions that do not encode
More informationQuantitative Genetics: II  Advanced Topics:
1999, 000 Gregory Carey Chapter 19: Advanced Topics  1 Quantitative Genetics: II  Advanced Topics: In this section, mathematical models are developed for the computation of different types of genetic
More informationRecall this chart that showed how most of our course would be organized:
Chapter 4 OneWay ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
More informationGENOMIC SELECTION: THE FUTURE OF MARKER ASSISTED SELECTION AND ANIMAL BREEDING
GENOMIC SELECTION: THE FUTURE OF MARKER ASSISTED SELECTION AND ANIMAL BREEDING Theo Meuwissen Institute for Animal Science and Aquaculture, Box 5025, 1432 Ås, Norway, theo.meuwissen@ihf.nlh.no Summary
More informationGAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters
GAW 15 Problem 3: Simulated Rheumatoid Arthritis Data Full Model and Simulation Parameters Michael B Miller , Michael Li , Gregg Lind , SoonYoung
More informationSample size determination in logistic regression
Sankhyā : The Indian Journal of Statistics 2010, Volume 72B, Part 1, pp. 5875 c 2010, Indian Statistical Institute Sample size determination in logistic regression M. Khorshed Alam University of Cincinnati
More informationPopulation Genetics (Outline)
Population Genetics (Outline) Definition of terms of population genetics: population, species, gene, pool, gene flow Calculation of genotypic of homozygous dominant, recessive, or heterozygous individuals,
More informationGlobally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the
Chapter 5 Analysis of Prostate Cancer Association Study Data 5.1 Risk factors for Prostate Cancer Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the disease has
More informationCONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
More informationThe Variability of PValues. Summary
The Variability of PValues Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 276958203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #47/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationBasic Premises of Population Genetics
Population genetics is concerned with the origin, amount and distribution of genetic variation present in populations of organisms, and the fate of this variation through space and time. The fate of genetic
More informationBIOSTATISTICS QUIZ ANSWERS
BIOSTATISTICS QUIZ ANSWERS 1. When you read scientific literature, do you know whether the statistical tests that were used were appropriate and why they were used? a. Always b. Mostly c. Rarely d. Never
More informationNonAdditive Animal Models
NonAdditive Animal Models 1 NonAdditive Genetic Effects Nonadditive genetic effects (or epistatic effects) are the interactions among loci in the genome. There are many possible degrees of interaction
More informationChapter 25: Population Genetics
Chapter 25: Population Genetics Student Learning Objectives Upon completion of this chapter you should be able to: 1. Understand the concept of a population and polymorphism in populations. 2. Apply the
More informationBasic Principles of Forensic Molecular Biology and Genetics. Population Genetics
Basic Principles of Forensic Molecular Biology and Genetics Population Genetics Significance of a Match What is the significance of: a fiber match? a hair match? a glass match? a DNA match? Meaning of
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models  part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK2800 Kgs. Lyngby
More informationTutorial 5: Hypothesis Testing
Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrclmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................
More information1 Limiting distribution for a Markov chain
Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easytocheck
More informationLOGNORMAL MODEL FOR STOCK PRICES
LOGNORMAL MODEL FOR STOCK PRICES MICHAEL J. SHARPE MATHEMATICS DEPARTMENT, UCSD 1. INTRODUCTION What follows is a simple but important model that will be the basis for a later study of stock prices as
More informationThe VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.
Cointegration The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Economic theory, however, often implies equilibrium
More informationHeritability: Twin Studies. Twin studies are often used to assess genetic effects on variation in a trait
TWINS AND GENETICS TWINS Heritability: Twin Studies Twin studies are often used to assess genetic effects on variation in a trait Comparing MZ/DZ twins can give evidence for genetic and/or environmental
More informationE3: PROBABILITY AND STATISTICS lecture notes
E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................
More informationChapter 7. Sealedbid Auctions
Chapter 7 Sealedbid Auctions An auction is a procedure used for selling and buying items by offering them up for bid. Auctions are often used to sell objects that have a variable price (for example oil)
More informationMCB142/IB163 Mendelian and Population Genetics 9/19/02
MCB142/IB163 Mendelian and Population Genetics 9/19/02 Practice questions lectures 512 (Hardy Weinberg, chisquare test, Mendel s second law, gene interactions, linkage maps, mapping human diseases, nonrandom
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationDepartment of Economics
Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 14730278 On Testing for Diagonality of Large Dimensional
More informationUnraveling versus Unraveling: A Memo on Competitive Equilibriums and Trade in Insurance Markets
Unraveling versus Unraveling: A Memo on Competitive Equilibriums and Trade in Insurance Markets Nathaniel Hendren January, 2014 Abstract Both Akerlof (1970) and Rothschild and Stiglitz (1976) show that
More informationNotes for STA 437/1005 Methods for Multivariate Data
Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.
More informationAP: LAB 8: THE CHISQUARE TEST. Probability, Random Chance, and Genetics
Ms. Foglia Date AP: LAB 8: THE CHISQUARE TEST Probability, Random Chance, and Genetics Why do we study random chance and probability at the beginning of a unit on genetics? Genetics is the study of inheritance,
More informationHuman Mendelian Disorders. Genetic Technology. What is Genetics? Genes are DNA 9/3/2008. Multifactorial Disorders
Human genetics: Why? Human Genetics Introduction Determine genotypic basis of variant phenotypes to facilitate: Understanding biological basis of human genetic diversity Prenatal diagnosis Predictive testing
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationInstitute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
More informationLinkage, recombination and gene mapping. Why are linkage relationships important?
Transmission patterns of linked genes Linkage, recombination and gene mapping Why are linkage relationships important? Linkage and independent assortment  statistical tests of hypotheses Recombination
More informationDoptimal plans in observational studies
Doptimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationTopic 6: Conditional Probability and Independence
Topic 6: September 1520, 2011 One of the most important concepts in the theory of probability is based on the question: How do we modify the probability of an event in light of the fact that something
More informationChapter 4: Vector Autoregressive Models
Chapter 4: Vector Autoregressive Models 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie IV.1 Vector Autoregressive Models (VAR)...
More informationChapter 23. (Mendelian) Population. Gene Pool. Genetic Variation. Population Genetics
30 25 Chapter 23 Population Genetics Frequency 20 15 10 5 0 A B C D F Grade = 57 Avg = 79.5 % (Mendelian) Population A group of interbreeding, sexually reproducing organisms that share a common set of
More information2 GENETIC DATA ANALYSIS
2.1 Strategies for learning genetics 2 GENETIC DATA ANALYSIS We will begin this lecture by discussing some strategies for learning genetics. Genetics is different from most other biology courses you have
More informationVasicek Single Factor Model
Alexandra Kochendörfer 7. Februar 2011 1 / 33 Problem Setting Consider portfolio with N different credits of equal size 1. Each obligor has an individual default probability. In case of default of the
More informationInferential Statistics
Inferential Statistics Sampling and the normal distribution Zscores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are
More informationCONTRIBUTIONS TO ZERO SUM PROBLEMS
CONTRIBUTIONS TO ZERO SUM PROBLEMS S. D. ADHIKARI, Y. G. CHEN, J. B. FRIEDLANDER, S. V. KONYAGIN AND F. PAPPALARDI Abstract. A prototype of zero sum theorems, the well known theorem of Erdős, Ginzburg
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationExtraneous markers used for genetic similarity leads to loss of power in GWAS and heritability determination
Extraneous markers used for genetic similarity leads to loss of power in GWAS and heritability determination Christoph Lippert 1*, Gerald Quon 1, Jennifer Listgarten 1*, and David Heckerman 1* 1 escience
More informationThe Basics of Interest Theory
Contents Preface 3 The Basics of Interest Theory 9 1 The Meaning of Interest................................... 10 2 Accumulation and Amount Functions............................ 14 3 Effective Interest
More informationMATH10212 Linear Algebra. Systems of Linear Equations. Definition. An ndimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0534405967. Systems of Linear Equations Definition. An ndimensional vector is a row or a column
More informationIt is important to bear in mind that one of the first three subscripts is redundant since k = i j +3.
IDENTIFICATION AND ESTIMATION OF AGE, PERIOD AND COHORT EFFECTS IN THE ANALYSIS OF DISCRETE ARCHIVAL DATA Stephen E. Fienberg, University of Minnesota William M. Mason, University of Michigan 1. INTRODUCTION
More information11. Analysis of Casecontrol Studies Logistic Regression
Research methods II 113 11. Analysis of Casecontrol Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
More informationFalse Discovery Rates
False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving
More informationMathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 19967 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
More informationU.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra
U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009 Notes on Algebra These notes contain as little theory as possible, and most results are stated without proof. Any introductory
More information3.6: General Hypothesis Tests
3.6: General Hypothesis Tests The χ 2 goodness of fit tests which we introduced in the previous section were an example of a hypothesis test. In this section we now consider hypothesis tests more generally.
More informationBasics of Marker Assisted Selection
asics of Marker ssisted Selection Chapter 15 asics of Marker ssisted Selection Julius van der Werf, Department of nimal Science rian Kinghorn, Twynam Chair of nimal reeding Technologies University of New
More informationChapter 11: Two Variable Regression Analysis
Department of Mathematics Izmir University of Economics Week 1415 20142015 In this chapter, we will focus on linear models and extend our analysis to relationships between variables, the definitions
More information1 Approximating Set Cover
CS 05: Algorithms (Grad) Feb 224, 2005 Approximating Set Cover. Definition An Instance (X, F ) of the setcovering problem consists of a finite set X and a family F of subset of X, such that every elemennt
More informationNonparametric adaptive age replacement with a onecycle criterion
Nonparametric adaptive age replacement with a onecycle criterion P. CoolenSchrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK email: Pauline.Schrijner@durham.ac.uk
More information5. Convergence of sequences of random variables
5. Convergence of sequences of random variables Throughout this chapter we assume that {X, X 2,...} is a sequence of r.v. and X is a r.v., and all of them are defined on the same probability space (Ω,
More informationC: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)}
C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} 1. EES 800: Econometrics I Simple linear regression and correlation analysis. Specification and estimation of a regression model. Interpretation of regression
More informationIEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem
IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem Time on my hands: Coin tosses. Problem Formulation: Suppose that I have
More informationSECOND PART, LECTURE 4: CONFIDENCE INTERVALS
Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 4: CONFIDENCE INTERVALS Lecture 4: Confidence Intervals
More informationTHE SELECTION OF RETURNS FOR AUDIT BY THE IRS. John P. Hiniker, Internal Revenue Service
THE SELECTION OF RETURNS FOR AUDIT BY THE IRS John P. Hiniker, Internal Revenue Service BACKGROUND The Internal Revenue Service, hereafter referred to as the IRS, is responsible for administering the Internal
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationQuantitative Methods for Finance
Quantitative Methods for Finance Module 1: The Time Value of Money 1 Learning how to interpret interest rates as required rates of return, discount rates, or opportunity costs. 2 Learning how to explain
More informationP (A) = lim P (A) = N(A)/N,
1.1 Probability, Relative Frequency and Classical Definition. Probability is the study of random or nondeterministic experiments. Suppose an experiment can be repeated any number of times, so that we
More informationDETERMINANTS. b 2. x 2
DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in
More informationStatistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics: Behavioural
More informationp Values and Alternative Boundaries
p Values and Alternative Boundaries for CUSUM Tests Achim Zeileis Working Paper No. 78 December 2000 December 2000 SFB Adaptive Information Systems and Modelling in Economics and Management Science Vienna
More informationProbability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0
Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0 This primer provides an overview of basic concepts and definitions in probability and statistics. We shall
More informationStatistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013
Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.11.6) Objectives
More informationProbability and Statistics
CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b  0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute  Systems and Modeling GIGA  Bioinformatics ULg kristel.vansteen@ulg.ac.be
More informationAssessment Schedule 2014 Biology: Demonstrate understanding of genetic variation and change (91157) Evidence Statement
NCEA Level 2 Biology (91157) 2014 page 1 of 5 Assessment Schedule 2014 Biology: Demonstrate understanding of genetic variation and change (91157) Evidence Statement NCEA Level 2 Biology (91157) 2014 page
More informationExact Nonparametric Tests for Comparing Means  A Personal Summary
Exact Nonparametric Tests for Comparing Means  A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola
More informationVI. Introduction to Logistic Regression
VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models
More informationNon Parametric Inference
Maura Department of Economics and Finance Università Tor Vergata Outline 1 2 3 Inverse distribution function Theorem: Let U be a uniform random variable on (0, 1). Let X be a continuous random variable
More informationVilnius University. Faculty of Mathematics and Informatics. Gintautas Bareikis
Vilnius University Faculty of Mathematics and Informatics Gintautas Bareikis CONTENT Chapter 1. SIMPLE AND COMPOUND INTEREST 1.1 Simple interest......................................................................
More informationAssessment Schedule 2012 Science: Demonstrate understanding of biological ideas relating to genetic variation (90948)
NCEA Level 1 Science (90948) 2012 page 1 of 5 Assessment Schedule 2012 Science: Demonstrate understanding of biological ideas relating to genetic variation (90948) Assessment Criteria ONE (a) (b) DNA contains
More informationMINITAB ASSISTANT WHITE PAPER
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. OneWay
More informationAlgebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 201213 school year.
This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Algebra
More informationHypothesis Testing. Chapter Introduction
Contents 9 Hypothesis Testing 553 9.1 Introduction............................ 553 9.2 Hypothesis Test for a Mean................... 557 9.2.1 Steps in Hypothesis Testing............... 557 9.2.2 Diagrammatic
More informationCOMMUTATIVITY DEGREES OF WREATH PRODUCTS OF FINITE ABELIAN GROUPS
COMMUTATIVITY DEGREES OF WREATH PRODUCTS OF FINITE ABELIAN GROUPS IGOR V. EROVENKO AND B. SURY ABSTRACT. We compute commutativity degrees of wreath products A B of finite abelian groups A and B. When B
More information2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
More informationTopic 21: Goodness of Fit
Topic 21: December 5, 2011 A goodness of fit tests examine the case of a sequence of independent observations each of which can have 1 of k possible categories. For example, each of us has one of 4 possible
More informationA study on the biaspect procedure with location and scale parameters
통계연구(2012), 제17권 제1호, 1926 A study on the biaspect procedure with location and scale parameters (Short Title: Biaspect procedure) HyoIl Park 1) Ju Sung Kim 2) Abstract In this research we propose a
More informationINTRODUCTORY STATISTICS
INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore
More information