Allele frequency estimation in the human ABO blood group system



Similar documents
Determining the sample size

1. C. The formula for the confidence interval for a population mean is: x t, which was

1 Correlation and Regression Analysis

Hypothesis testing. Null and alternative hypotheses

Properties of MLE: consistency, asymptotic normality. Fisher information.

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

Lesson 17 Pearson s Correlation Coefficient

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Math C067 Sampling Distributions

I. Chi-squared Distributions

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

One-sample test of proportions

Chapter 14 Nonparametric Statistics

5: Introduction to Estimation


Sampling Distribution And Central Limit Theorem

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

PSYCHOLOGICAL STATISTICS

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

Confidence Intervals for One Mean

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Chapter 7 Methods of Finding Estimators

Practice Problems for Test 3

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Maximum Likelihood Estimators.

Modified Line Search Method for Global Optimization

3 Basic Definitions of Probability Theory

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Confidence Intervals

MEI Structured Mathematics. Module Summary Sheets. Statistics 2 (Version B: reference to new book)

Mann-Whitney U 2 Sample Test (a.k.a. Wilcoxon Rank Sum Test)

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Hypergeometric Distributions

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

1 Computing the Standard Deviation of Sample Means

Statistical inference: example 1. Inferential Statistics

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Lesson 15 ANOVA (analysis of variance)

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Chapter 7: Confidence Interval and Sample Size

A Test of Normality. 1 n S 2 3. n 1. Now introduce two new statistics. The sample skewness is defined as:

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

Measures of Spread and Boxplots Discrete Math, Section 9.4

Quadrat Sampling in Population Ecology

OMG! Excessive Texting Tied to Risky Teen Behaviors

Confidence intervals and hypothesis tests

Asymptotic Growth of Functions

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

The Stable Marriage Problem

Overview of some probability distributions.

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

Incremental calculation of weighted mean and variance

Output Analysis (2, Chapters 10 &11 Law)

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

Exam 3. Instructor: Cynthia Rudin TA: Dimitrios Bisias. November 22, 2011

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

CHAPTER 11 Financial mathematics

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

A probabilistic proof of a binomial identity

Section 11.3: The Integral Test

A Guide to the Pricing Conventions of SFE Interest Rate Products

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

5.3. Generalized Permutations and Combinations

Soving Recurrence Relations

Unit 8: Inference for Proportions. Chapters 8 & 9 in IPS

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

% 60% 70% 80% 90% 95% 96% 98% 99% 99.5% 99.8% 99.9%

4.3. The Integral and Comparison Tests

CHAPTER 3 THE TIME VALUE OF MONEY

Universal coding for classes of sources

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

Convexity, Inequalities, and Norms

3. Greatest Common Divisor - Least Common Multiple

CS103X: Discrete Structures Homework 4 Solutions

Basic Elements of Arithmetic Sequences and Series

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

SPC for Software Reliability: Imperfect Software Debugging Model

Institute of Actuaries of India Subject CT1 Financial Mathematics

Now here is the important step


Normal Distribution.

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

THE HEIGHT OF q-binary SEARCH TREES

CHAPTER 3 DIGITAL CODING OF SIGNALS

LECTURE 13: Cross-validation

Simple Annuities Present Value.

ODBC. Getting Started With Sage Timberline Office ODBC

Factoring x n 1: cyclotomic and Aurifeuillian polynomials Paul Garrett <garrett@math.umn.edu>

CONTROL CHART BASED ON A MULTIPLICATIVE-BINOMIAL DISTRIBUTION

Biology 171L Environment and Ecology Lab Lab 2: Descriptive Statistics, Presenting Data and Graphing Relationships

Baan Service Master Data Management

Present Values, Investment Returns and Discount Rates

Transcription:

Allele frequecy estimatio i the huma AB blood group system Pedro J.N. Silva Faculdade de Ciecias da Uiversidade de Lisboa Campo Grade, C, 4o. piso P-1700 LISBA PRTUGAL Pedro.Silva@fc.ul.pt 00

Table of Cotets VERVIEW 1 THERY Populatio geetics Geetics omeclature The AB system Hardy-Weiberg frequecies 3 AB allele frequecy estimators 3 Berestei (195) 3 Berestei (1930) 4 Wieer (199) 4 Maximum Likelihood ad the EM algorithm 5 Statistics 5 Maximum Likelihood 5 The EM algorithm 5 Log-likelihood ratio test 6 Pearso's χ test 6 S ABESTIMATR 7 Descriptio 7 How to get the latest versio 7 RECMMENDED READING 8

verview We deal here with the estimatio of allele frequecies of the huma AB blood group system. It is assumed that the AB system is determied by three alleles of a sigle gee, call them A, B ad A ad B are codomiat, ad both are domiat over this gee is i Hardy-Weiberg frequecies i the populatio the data are a radom sample from the populatio You should be familiar with classical populatio geetics, maximum likelihood estimatio ad the EM algorithm, as well as statistical testig i geeral ad goodess-of-fit tests i particular. Towards the ed, there is a plug for a computer program that you may fid useful for the actual calculatios. See some suggested bibliography at the ed, ad a brief summary of relevat theory follows. AB allele frequecy estimatio 1

Theory Populatio geetics Geetics omeclature A gee is a uit of hereditary trasmissio (or, as some whould say, a gee is whatever geeticists study...). Differet forms of the same gee are kow as alleles (e.g., A ad a; A, B ad ). Alleles may be combied i geotypes (e.g., AB, or ), which may or may ot have distict pheotypes (e.g., white or red flowers; differet blood groups), depedig o domiace relatioships. For example, sice AA ad A have the same pheotype (blood group A), differet from that of, we say A is domiat over ; o the other had, AA, AB ad BB all have distict pheotypes (blood groups A, AB ad B, respectively), so we say A ad B are codomiat. The relative proportio of each allele i a populatio is called its allele frequecy; similarly, the relative proportio of each geotype is its geotypic frequecy ad, as you ca guess, the relative proportio of each pheotype is the pheotypic frequecy. As log as there is o domiace, the frequecy of oe allele ca be estimated from the geotypic frequecies by addig the homozygote frequecies ad half the heterozygote frequecies (for the respective allele). For example, for two alleles, p A N = AA + N N Aa = AA 1 + Aa. However, if there is domiace we caot distiguish (some of) the homozygotes ad (some of) the heterozygotes, so this simple procedure caot be used, ad we ca ru ito trouble. The AB system The AB is a blood group system otorious for beig resposible for blood trasfusio accidets. It was amog the first huma traits prove to be medelia. It was ofte used i foresic (idetiticatio ad paterity) studies, but has bee superceded i this by other geetic markers. It remais cliically importat, ad a great system for teachig. We assume that the AB system is determied by three alleles of a sigle gee, call them A, B ad A ad B are codomiat, ad both are domiat over this gee is i Hardy-Weiberg frequecies i the populatio Note that these assumptios are ot ecessarily true. Why the Hardy-Weiberg assumptio? For without it, estimatio of allele frequecies is ot possible i this case (because of domiace). Waa try? :-) Because of its importace i the estimatio proceedigs, this assumptio should always be tested. These assumptios, ad some of its cosequeces, are summarized i the followig table, where p, q ad r are the frequecies of alleles A, B ad, respectively: Pedro J.N. Silva

Pheotype Geotype Pheotypic Geotypic Expected (Blood group) frequecy frequecy frequecy --------------------------------------------------------------------------------------------------------------------------------- A AA + A A AA+A p + pr B BB + B B BB+B q pr AB AB AB AB pq --------------------------------------------------------------------------------------------------------------------------------- Total 1 It is iterestig to ote that the geetic basis of the AB system was ot determied by family ivestigatios, as might be expected, but by testig the predictios of the competig geetic hypotheses (two gees with two alleles each vs. the above model) agaist actual populatio data, usig the Hardy-Weiberg law. r + Hardy-Weiberg frequecies While the (complete set of) geotypic frequecies always determie the allelic frequecies, the reverse is ot ecessarily true, that is, we caot always calculate the geotypic frequecies from the allelic. Give some assumptios -- radom uio of gametes (with or without radom matig), very large populatio size (i theory, ifiite), absece of selectio, migratio, etc. --, however, the geotypic frequecies evetually take a form that depeds oly o the allele frequecies. For example, for a autosomal gee with just two alleles (A ad a) with respective frequecies p ad q, we have three geotypes (AA, Aa ad aa), whose frequecies are p, pq ad q. These geotypic frequecies ca be thought of as the developmet of the square of the sum of the allele frequecies: ( p A + q a ) = p A + p Aqa + qa. This result was published idepedetly by the british mathematicia G.H. Hardy ad the germa physicia W. Weiberg i 1908. For more tha two alleles we have ( + p + + p ) = ( p + p p +... + p p + p +... + p +... + p p ) p1... 1 1 1 1. AB allele frequecy estimators Berestei (195) Let us agree to ame the three allele frequecies of the AB system p (of allele A), q (of B) ad r (of you guessed it!). The oldest estimator of the AB allele frequecies is due to Berstei (195), who had determied the geetic basis of this blood group just the year before usig Hardy-Weiberg frequecies. Sice the expected (Hardy-Weiberg) frequecy of idividuals with blood group is r, a fairly obvious estimate of r is AB allele frequecy estimatio 3

r ' = the other had, the expected combied frequecy of blood groups A ad is A + E = ( p + r ) = ( 1 q) ad therefore q ca be estimated by q' = 1 A + ad i a similar way we obtai p ' = 1 B + So, p', q' ad r' are Berstei's 195 estimators. They (ormally) do ot add up to oe, which of course is ot altogether desirable. Berestei (1930) As oted above, Berstei's 195 estimators do ot ecessarily add up to oe. To solve this, we could simply divide them by their sum, but i 1930 Berstei suggested a much better procedure. Let d be the differece betwee the sum of Berstei's 195 estimates ad uity: ( p' + q' ') d = 1 + r The ew estimators are the d p" = p' 1 + d q" = q' 1 + d d r" = r' + 1 + They still do't quite add up to oe, but the differece is much smaller ( d 4, as you should check). I fact, they are usually quite close to the maximum likelihood estimators (especially if the Hardy-Weiberg assumptio holds). Wieer (199) I 199, Wieer suggested a alterative to Berstei's 195 estimators. They are, perhaps, more ituitive, but seldom work ay better (a tribute to Berestei's isight). 4 Pedro J.N. Silva

The estimator of r is actually the same as Berstei's 195. Rememberig that the expected (Hardy- Weiberg) frequecy of idividuals with blood group is r, we get r'"= The expected combied frequecy of blood groups A ad is (still) A + E = ( p + r) so p ca be estimated by p'"= A + ad similarly q'" = B + Like the other heuristic estimators, these do ot ormally add up to oe. Maximum Likelihood ad the EM algorithm See below, uder statistics Statistics Maximum Likelihood Suppose we wat to estimate a parameter from give observatios. If we have a probabilistic model for the estimatio of the data (such as the biomial model for the tossig of a coi), we ca (i priciple) calculate the probability of gettig our observatios for each value that the parameter ca take. The maximum likelihood method cosists i choosig that parameter value that maximizes the probability of the data, also kow as the likelihood of the parameter. The method ca be justified usig Bayes theorem (with uiform priors, or large eough sample sizes to overcome whatever priors we have), or by its results. I fact, maximum likelihood estimators ted to be cosistet ad efficiet, but are ofte biased. Estimatio is ot the oly applicatio of likelihoods. For example, they ca also be used for hypothesis testig. The EM algorithm The EM algorithm is a geeral method to obtai maximum likelihood (ML) estimates, startig from reasoable guesses. It is ot the oly method, ad caot always be used, but whe applicable teds to work well i pactice. Here is a brief descriptio of the EM algorithm applied to the AB case. AB allele frequecy estimatio 5

The geeral idea is simple. We start from estimates of the allele frequecies, ad use them to calculate the expected frequecies of all geotypes (step E of the EM algorithm), assumig them to be Hardy-Weiberg frequecies. The, we use those fake but complete geotypic frequecies to obtai ew estimates of the allele frequecies, usig maximum likelihood (the step M). We the use these ew allele frequecy estimates i a ew E step, ad so forth, i a iterative fashio, util the values coverge or we get tired. Log-likelihood ratio test This test compares the ucostraied likelihood of the data with the (smaller) likelihood imposig the (ull) hypothesis uder test, i our case, that the AB gee is i Hardy-Weiberg frequecies i the populatio. If the hypothesis is true, the differece i likelihoods should be small ad, coversely, if it is false the differece should be large. We ca use the fact that the distributio of twice the differece of the logarithms of the likelihoods (or, which amouts to the same, twice the logarithm of the ratio of the likelihoods) teds asymptotically (as the sample size icreases) to the χ distributio to perform a actual statistical test, i.e., to help us decide whether the differece is large eough to reject the ull. The umber of degrees of freedom depeds o whether the hypothesis is extrisic (fully specified i absece of the sample) or itrisic (deped o parameters that have to be estimated from the sample). To be cocrete, let us thik about goodess-of-fit tests. For extrisic hypotheses, the umber of degrees of freedom is simply the umber of classes mius oe; for itrisic hypotheses it is usually determied as the umber of classes mius oe mius the umber of idepedet parameters estimated from the sample. Let the data be categorized i k classes, each with i observatios, ad let the correspodig expected (derived from the hypothesis) umbers be Ei. I practice, the goodess-of-fit test statistic ca be computed as G = k i= 1 i i l Ei Pearso's χ test This procedure tests the goodess-of-fit of a give hypothesis to the data. Let the data be categorized i k classes, each with i observatios, ad let the correspodig expected (derived from the hypothesis) umbers be Ei. The test statistic is the X k ( E ) = i E i i i ad its asymptotic (as the sample size icreases) distributio is the χ. The umber of degrees of freedom is calculated as for the log-likelihood ratio test above. 6 Pedro J.N. Silva

S ABestimator Descriptio S ABestimator is a program to estimate the allele frequecies of the AB blood group system, ad perform a couple of statistical tests o the data. It requires MsWidows (95+) or a good emulator. It is very simple to use, ad is meat to be used i teachig, particularly 1. to compare simple heuristic estimates of the allele frequecies. to show the EM algorithm i actio, to obtai maximum likelihood (ML) estimates of the allele frequecies 3. to perform goodess-of-fit tests of the Hardy-Weiberg assumptio Some example data sets are provided to get you started. You ca also use your ow data (or make them up, ad experimet). A rather extesive help file is provided (from which most of this text is excerpted). Commets (pa or praise) welcome! N.B. This program is NT meat to be used i cliical applicatios, or ay life-threatig situatios. It is offered "as is", with o warraties whatsoever. The program is Copyright Pedro J.N. Silva, 000. How to get the latest versio S ABestimator is postcardware meaig, if you like it, you should sed me a ice postcard from your home tow, but you should ever have to pay for it, or ask ayoe to pay you for it. Eve if you are so mea as to actually use the child of my labors ad ot sed me a postcard, please sed me a email, so I kow who is usig the program ad what for, ad ca tell you about bugs ad ew versios. You ca always get the latest versio of S ABestimator from its home page, curretly http://alf1.cii.fc.ul.pt/~pedro/soft/abestimator/ N.B. This program is NT meat to be used i cliical applicatios, or ay life-threatig situatios. It is offered "as is", with o warraties whatsoever. Have much fu! Pedro J.N. Silva Faculdade de Ciecias da Uiversidade de Lisboa Campo Grade, C, 4o. piso P-1700 LISBA PRTUGAL Pedro.Silva@fc.ul.pt AB allele frequecy estimatio 7

Recommeded readig Hartl, D.L. ad Clark, A.G. 1989. Priciples of Populatio Geetics, d ed. Siauer. (Hardy-Weiberg law, EM algorithm) Li, C.C. 1978. First course i Populatio Geetics. Boxwood. (Geetics of the AB system, Hardy- Weiberg law, ML estimatio, goodess-of-fit tests) Sokal, R.R. ad Rohlf, F.J. 1995. Biometry, 3rd ed. Freema. (ML estimatio, goodess-of-fit tests) Vogel, F. ad Motulsky, A.G. 1986. Huma Geetics, d ed. Spriger-Verlag. (Geetics of the AB system, Hardy-Weiberg law, ML estimatio, goodess-of-fit tests) Weir, B. 1998. Geetic data aalysis, d ed. Siauer. (ML estimatio, EM algorithm, goodess-of-fit tests) 8 Pedro J.N. Silva