Allele frequency estimation in the human ABO blood group system

Allele frequecy estimatio i the huma AB blood group system Pedro J.N. Silva Faculdade de Ciecias da Uiversidade de Lisboa Campo Grade, C, 4o. piso P-1700 LISBA PRTUGAL Pedro.Silva@fc.ul.pt 00

Table of Cotets VERVIEW 1 THERY Populatio geetics Geetics omeclature The AB system Hardy-Weiberg frequecies 3 AB allele frequecy estimators 3 Berestei (195) 3 Berestei (1930) 4 Wieer (199) 4 Maximum Likelihood ad the EM algorithm 5 Statistics 5 Maximum Likelihood 5 The EM algorithm 5 Log-likelihood ratio test 6 Pearso's χ test 6 S ABESTIMATR 7 Descriptio 7 How to get the latest versio 7 RECMMENDED READING 8

verview We deal here with the estimatio of allele frequecies of the huma AB blood group system. It is assumed that the AB system is determied by three alleles of a sigle gee, call them A, B ad A ad B are codomiat, ad both are domiat over this gee is i Hardy-Weiberg frequecies i the populatio the data are a radom sample from the populatio You should be familiar with classical populatio geetics, maximum likelihood estimatio ad the EM algorithm, as well as statistical testig i geeral ad goodess-of-fit tests i particular. Towards the ed, there is a plug for a computer program that you may fid useful for the actual calculatios. See some suggested bibliography at the ed, ad a brief summary of relevat theory follows. AB allele frequecy estimatio 1

Theory Populatio geetics Geetics omeclature A gee is a uit of hereditary trasmissio (or, as some whould say, a gee is whatever geeticists study...). Differet forms of the same gee are kow as alleles (e.g., A ad a; A, B ad ). Alleles may be combied i geotypes (e.g., AB, or ), which may or may ot have distict pheotypes (e.g., white or red flowers; differet blood groups), depedig o domiace relatioships. For example, sice AA ad A have the same pheotype (blood group A), differet from that of, we say A is domiat over ; o the other had, AA, AB ad BB all have distict pheotypes (blood groups A, AB ad B, respectively), so we say A ad B are codomiat. The relative proportio of each allele i a populatio is called its allele frequecy; similarly, the relative proportio of each geotype is its geotypic frequecy ad, as you ca guess, the relative proportio of each pheotype is the pheotypic frequecy. As log as there is o domiace, the frequecy of oe allele ca be estimated from the geotypic frequecies by addig the homozygote frequecies ad half the heterozygote frequecies (for the respective allele). For example, for two alleles, p A N = AA + N N Aa = AA 1 + Aa. However, if there is domiace we caot distiguish (some of) the homozygotes ad (some of) the heterozygotes, so this simple procedure caot be used, ad we ca ru ito trouble. The AB system The AB is a blood group system otorious for beig resposible for blood trasfusio accidets. It was amog the first huma traits prove to be medelia. It was ofte used i foresic (idetiticatio ad paterity) studies, but has bee superceded i this by other geetic markers. It remais cliically importat, ad a great system for teachig. We assume that the AB system is determied by three alleles of a sigle gee, call them A, B ad A ad B are codomiat, ad both are domiat over this gee is i Hardy-Weiberg frequecies i the populatio Note that these assumptios are ot ecessarily true. Why the Hardy-Weiberg assumptio? For without it, estimatio of allele frequecies is ot possible i this case (because of domiace). Waa try? :-) Because of its importace i the estimatio proceedigs, this assumptio should always be tested. These assumptios, ad some of its cosequeces, are summarized i the followig table, where p, q ad r are the frequecies of alleles A, B ad, respectively: Pedro J.N. Silva

Pheotype Geotype Pheotypic Geotypic Expected (Blood group) frequecy frequecy frequecy --------------------------------------------------------------------------------------------------------------------------------- A AA + A A AA+A p + pr B BB + B B BB+B q pr AB AB AB AB pq --------------------------------------------------------------------------------------------------------------------------------- Total 1 It is iterestig to ote that the geetic basis of the AB system was ot determied by family ivestigatios, as might be expected, but by testig the predictios of the competig geetic hypotheses (two gees with two alleles each vs. the above model) agaist actual populatio data, usig the Hardy-Weiberg law. r + Hardy-Weiberg frequecies While the (complete set of) geotypic frequecies always determie the allelic frequecies, the reverse is ot ecessarily true, that is, we caot always calculate the geotypic frequecies from the allelic. Give some assumptios -- radom uio of gametes (with or without radom matig), very large populatio size (i theory, ifiite), absece of selectio, migratio, etc. --, however, the geotypic frequecies evetually take a form that depeds oly o the allele frequecies. For example, for a autosomal gee with just two alleles (A ad a) with respective frequecies p ad q, we have three geotypes (AA, Aa ad aa), whose frequecies are p, pq ad q. These geotypic frequecies ca be thought of as the developmet of the square of the sum of the allele frequecies: ( p A + q a ) = p A + p Aqa + qa. This result was published idepedetly by the british mathematicia G.H. Hardy ad the germa physicia W. Weiberg i 1908. For more tha two alleles we have ( + p + + p ) = ( p + p p +... + p p + p +... + p +... + p p ) p1... 1 1 1 1. AB allele frequecy estimators Berestei (195) Let us agree to ame the three allele frequecies of the AB system p (of allele A), q (of B) ad r (of you guessed it!). The oldest estimator of the AB allele frequecies is due to Berstei (195), who had determied the geetic basis of this blood group just the year before usig Hardy-Weiberg frequecies. Sice the expected (Hardy-Weiberg) frequecy of idividuals with blood group is r, a fairly obvious estimate of r is AB allele frequecy estimatio 3

r ' = the other had, the expected combied frequecy of blood groups A ad is A + E = ( p + r ) = ( 1 q) ad therefore q ca be estimated by q' = 1 A + ad i a similar way we obtai p ' = 1 B + So, p', q' ad r' are Berstei's 195 estimators. They (ormally) do ot add up to oe, which of course is ot altogether desirable. Berestei (1930) As oted above, Berstei's 195 estimators do ot ecessarily add up to oe. To solve this, we could simply divide them by their sum, but i 1930 Berstei suggested a much better procedure. Let d be the differece betwee the sum of Berstei's 195 estimates ad uity: ( p' + q' ') d = 1 + r The ew estimators are the d p" = p' 1 + d q" = q' 1 + d d r" = r' + 1 + They still do't quite add up to oe, but the differece is much smaller ( d 4, as you should check). I fact, they are usually quite close to the maximum likelihood estimators (especially if the Hardy-Weiberg assumptio holds). Wieer (199) I 199, Wieer suggested a alterative to Berstei's 195 estimators. They are, perhaps, more ituitive, but seldom work ay better (a tribute to Berestei's isight). 4 Pedro J.N. Silva

The estimator of r is actually the same as Berstei's 195. Rememberig that the expected (Hardy- Weiberg) frequecy of idividuals with blood group is r, we get r'"= The expected combied frequecy of blood groups A ad is (still) A + E = ( p + r) so p ca be estimated by p'"= A + ad similarly q'" = B + Like the other heuristic estimators, these do ot ormally add up to oe. Maximum Likelihood ad the EM algorithm See below, uder statistics Statistics Maximum Likelihood Suppose we wat to estimate a parameter from give observatios. If we have a probabilistic model for the estimatio of the data (such as the biomial model for the tossig of a coi), we ca (i priciple) calculate the probability of gettig our observatios for each value that the parameter ca take. The maximum likelihood method cosists i choosig that parameter value that maximizes the probability of the data, also kow as the likelihood of the parameter. The method ca be justified usig Bayes theorem (with uiform priors, or large eough sample sizes to overcome whatever priors we have), or by its results. I fact, maximum likelihood estimators ted to be cosistet ad efficiet, but are ofte biased. Estimatio is ot the oly applicatio of likelihoods. For example, they ca also be used for hypothesis testig. The EM algorithm The EM algorithm is a geeral method to obtai maximum likelihood (ML) estimates, startig from reasoable guesses. It is ot the oly method, ad caot always be used, but whe applicable teds to work well i pactice. Here is a brief descriptio of the EM algorithm applied to the AB case. AB allele frequecy estimatio 5

The geeral idea is simple. We start from estimates of the allele frequecies, ad use them to calculate the expected frequecies of all geotypes (step E of the EM algorithm), assumig them to be Hardy-Weiberg frequecies. The, we use those fake but complete geotypic frequecies to obtai ew estimates of the allele frequecies, usig maximum likelihood (the step M). We the use these ew allele frequecy estimates i a ew E step, ad so forth, i a iterative fashio, util the values coverge or we get tired. Log-likelihood ratio test This test compares the ucostraied likelihood of the data with the (smaller) likelihood imposig the (ull) hypothesis uder test, i our case, that the AB gee is i Hardy-Weiberg frequecies i the populatio. If the hypothesis is true, the differece i likelihoods should be small ad, coversely, if it is false the differece should be large. We ca use the fact that the distributio of twice the differece of the logarithms of the likelihoods (or, which amouts to the same, twice the logarithm of the ratio of the likelihoods) teds asymptotically (as the sample size icreases) to the χ distributio to perform a actual statistical test, i.e., to help us decide whether the differece is large eough to reject the ull. The umber of degrees of freedom depeds o whether the hypothesis is extrisic (fully specified i absece of the sample) or itrisic (deped o parameters that have to be estimated from the sample). To be cocrete, let us thik about goodess-of-fit tests. For extrisic hypotheses, the umber of degrees of freedom is simply the umber of classes mius oe; for itrisic hypotheses it is usually determied as the umber of classes mius oe mius the umber of idepedet parameters estimated from the sample. Let the data be categorized i k classes, each with i observatios, ad let the correspodig expected (derived from the hypothesis) umbers be Ei. I practice, the goodess-of-fit test statistic ca be computed as G = k i= 1 i i l Ei Pearso's χ test This procedure tests the goodess-of-fit of a give hypothesis to the data. Let the data be categorized i k classes, each with i observatios, ad let the correspodig expected (derived from the hypothesis) umbers be Ei. The test statistic is the X k ( E ) = i E i i i ad its asymptotic (as the sample size icreases) distributio is the χ. The umber of degrees of freedom is calculated as for the log-likelihood ratio test above. 6 Pedro J.N. Silva

S ABestimator Descriptio S ABestimator is a program to estimate the allele frequecies of the AB blood group system, ad perform a couple of statistical tests o the data. It requires MsWidows (95+) or a good emulator. It is very simple to use, ad is meat to be used i teachig, particularly 1. to compare simple heuristic estimates of the allele frequecies. to show the EM algorithm i actio, to obtai maximum likelihood (ML) estimates of the allele frequecies 3. to perform goodess-of-fit tests of the Hardy-Weiberg assumptio Some example data sets are provided to get you started. You ca also use your ow data (or make them up, ad experimet). A rather extesive help file is provided (from which most of this text is excerpted). Commets (pa or praise) welcome! N.B. This program is NT meat to be used i cliical applicatios, or ay life-threatig situatios. It is offered "as is", with o warraties whatsoever. The program is Copyright Pedro J.N. Silva, 000. How to get the latest versio S ABestimator is postcardware meaig, if you like it, you should sed me a ice postcard from your home tow, but you should ever have to pay for it, or ask ayoe to pay you for it. Eve if you are so mea as to actually use the child of my labors ad ot sed me a postcard, please sed me a email, so I kow who is usig the program ad what for, ad ca tell you about bugs ad ew versios. You ca always get the latest versio of S ABestimator from its home page, curretly http://alf1.cii.fc.ul.pt/~pedro/soft/abestimator/ N.B. This program is NT meat to be used i cliical applicatios, or ay life-threatig situatios. It is offered "as is", with o warraties whatsoever. Have much fu! Pedro J.N. Silva Faculdade de Ciecias da Uiversidade de Lisboa Campo Grade, C, 4o. piso P-1700 LISBA PRTUGAL Pedro.Silva@fc.ul.pt AB allele frequecy estimatio 7

Recommeded readig Hartl, D.L. ad Clark, A.G. 1989. Priciples of Populatio Geetics, d ed. Siauer. (Hardy-Weiberg law, EM algorithm) Li, C.C. 1978. First course i Populatio Geetics. Boxwood. (Geetics of the AB system, Hardy- Weiberg law, ML estimatio, goodess-of-fit tests) Sokal, R.R. ad Rohlf, F.J. 1995. Biometry, 3rd ed. Freema. (ML estimatio, goodess-of-fit tests) Vogel, F. ad Motulsky, A.G. 1986. Huma Geetics, d ed. Spriger-Verlag. (Geetics of the AB system, Hardy-Weiberg law, ML estimatio, goodess-of-fit tests) Weir, B. 1998. Geetic data aalysis, d ed. Siauer. (ML estimatio, EM algorithm, goodess-of-fit tests) 8 Pedro J.N. Silva