Chapter 9. Applications of probability. 9.1 The genetic code



Similar documents
UNIVERSITETET I OSLO Det matematisk-naturvitenskapelige fakultet

Mutations and Genetic Variability. 1. What is occurring in the diagram below?

Hands on Simulation of Mutation

( TUTORIAL. (July 2006)

Pipe Cleaner Proteins. Essential question: How does the structure of proteins relate to their function in the cell?

DNA Bracelets

Part ONE. a. Assuming each of the four bases occurs with equal probability, how many bits of information does a nucleotide contain?

GENEWIZ, Inc. DNA Sequencing Service Details for USC Norris Comprehensive Cancer Center DNA Core

Module 6: Digital DNA

The p53 MUTATION HANDBOOK

Mutation. Mutation provides raw material to evolution. Different kinds of mutations have different effects

Molecular Facts and Figures

(A) Microarray analysis was performed on ATM and MDM isolated from 4 obese donors.

10 µg lyophilized plasmid DNA (store lyophilized plasmid at 20 C)

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

Concluding lesson. Student manual. What kind of protein are you? (Basic)

BOC334 (Proteomics) Practical 1. Calculating the charge of proteins

Table S1. Related to Figure 4

Gene Finding CMSC 423

Provincial Exam Questions. 9. Give one role of each of the following nucleic acids in the production of an enzyme.

Introduction to Perl Programming Input/Output, Regular Expressions, String Manipulation. Beginning Perl, Chap 4 6. Example 1

DNA Sample preparation and Submission Guidelines

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

Genomes and SNPs in Malaria and Sickle Cell Anemia

Guidelines for Writing a Scientific Paper

Advanced Medicinal & Pharmaceutical Chemistry CHEM 5412 Dept. of Chemistry, TAMUK

Shu-Ping Lin, Ph.D.

Next Generation Sequencing

Part A: Amino Acids and Peptides (Is the peptide IAG the same as the peptide GAI?)

Amino Acids, Peptides, Proteins

Molecular analyses of EGFR: mutation and amplification detection

PRACTICE TEST QUESTIONS

Problem Set 3 KEY

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

Supplementary Online Material for Morris et al. sirna-induced transcriptional gene

Gene Synthesis 191. Mutagenesis 194. Gene Cloning 196. AccuGeneBlock Service 198. Gene Synthesis FAQs 201. User Protocol 204

Inverse PCR & Cycle Sequencing of P Element Insertions for STS Generation

pcas-guide System Validation in Genome Editing

IV. -Amino Acids: carboxyl and amino groups bonded to -Carbon. V. Polypeptides and Proteins

Introduction to Bioinformatics (Master ChemoInformatique)

Drosophila NK-homeobox genes

SERVICES CATALOGUE WITH SUBMISSION GUIDELINES

Supplementary Information. Binding region and interaction properties of sulfoquinovosylacylglycerol (SQAG) with human

Genetics 1. Defective enzyme that does not make melanin. Very pale skin and hair color (albino)

Gene and Chromosome Mutation Worksheet (reference pgs in Modern Biology textbook)

Protein Synthesis. Page 41 Page 44 Page 47 Page 42 Page 45 Page 48 Page 43 Page 46 Page 49. Page 41. DNA RNA Protein. Vocabulary

RNA and Protein Synthesis

Bio 102 Practice Problems Genetic Code and Mutation

Academic Nucleic Acids and Protein Synthesis Test

13.2 Ribosomes & Protein Synthesis

Peptide bonds: resonance structure. Properties of proteins: Peptide bonds and side chains. Dihedral angles. Peptide bond. Protein physics, Lecture 5

Characterization of cdna clones of the family of trypsin/a-amylase inhibitors (CM-proteins) in barley {Hordeum vulgare L.)

Ms. Campbell Protein Synthesis Practice Questions Regents L.E.

LESSON 4. Using Bioinformatics to Analyze Protein Sequences. Introduction. Learning Objectives. Key Concepts

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Amino Acids. Amino acids are the building blocks of proteins. All AA s have the same basic structure: Side Chain. Alpha Carbon. Carboxyl. Group.

Structure and Function of DNA

Inverse PCR and Sequencing of P-element, piggybac and Minos Insertion Sites in the Drosophila Gene Disruption Project

H H N - C - C 2 R. Three possible forms (not counting R group) depending on ph

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10

Title : Parallel DNA Synthesis : Two PCR product from one DNA template

Genetics Lecture Notes Lectures 1 2

Random variables, probability distributions, binomial random variable


Terms: The following terms are presented in this lesson (shown in bold italics and on PowerPoint Slides 2 and 3):

ANALYSIS OF A CIRCULAR CODE MODEL

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Mendelian and Non-Mendelian Heredity Grade Ten

Genetics Test Biology I

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Lab # 12: DNA and RNA

Replication Study Guide

pcmv6-neo Vector Application Guide Contents

Translation Study Guide

Name: Date: Problem How do amino acid sequences provide evidence for evolution? Procedure Part A: Comparing Amino Acid Sequences

Introduction. What is Ecological Genetics?

Umm AL Qura University MUTATIONS. Dr Neda M Bogari

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Forensic DNA Testing Terminology

Actual Quiz 1 (closed book) will be given Monday10/4 at 10:00 am

Mutation, Repair, and Recombination

Supplemental Data. Short Article. PPARγ Activation Primes Human Monocytes. into Alternative M2 Macrophages. with Anti-inflammatory Properties

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

MAKING AN EVOLUTIONARY TREE

Biology Notes for exam 5 - Population genetics Ch 13, 14, 15

Multiple Choice Write the letter that best answers the question or completes the statement on the line provided.

Genetics Module B, Anchor 3

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

A. A peptide with 12 amino acids has the following amino acid composition: 2 Met, 1 Tyr, 1 Trp, 2 Glu, 1 Lys, 1 Arg, 1 Thr, 1 Asn, 1 Ile, 1 Cys

GENETIC CODING. A mathematician considers the problem of how genetic information is encoded for transmission from parent to offspring,.

The making of The Genoma Music

All commonly-used expression vectors used in the Jia Lab contain the following multiple cloning site: BamHI EcoRI SmaI SalI XhoI_ NotI

The Organic Chemistry of Amino Acids, Peptides, and Proteins

AP BIOLOGY 2010 SCORING GUIDELINES (Form B)

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Book Review of Rosenhouse, The Monty Hall Problem. Leslie Burkholder 1

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Transcription:

Chapter 9 Applications of probability In this chapter we use the tools of elementary probability to investigate problems of several kinds. First, we study the language of life by focusing on the universal genetic code. We ask how that code (consisting of four letters, called nucleotides or bases) is distributed over the 20 or so words (amino acids) that make up functional units, the proteins. We find that there are interesting properties of that code that make some amino acids more likely to occur than others (given random sequences of bases). Next, we turn attention to how the sentences made up of these elementary words are shaped by enzymes that cut DNA. We then turn to genetics, and ask how traits are inherited and passed from one generation to the next. We explore how information about family traits can help to answer questions about the likelihood of a genetic disease being transmitted. We end the chapter with other examples of biological and non-biological applications of the laws of chance. 9.1 The genetic code All living things are made of basic building blocks called proteins. These proteins come in a rich diversity, and each of these is in turn composed of elementary units called amino acids. A protein is created from a linear chain of such units, assembled one unit at a time in a long string. As it is being produced, the chain folds back on itself to produce a three-dimensional structure, like a beaded necklace that curls and folds up on itself. The beads (representing amino acids) come in 20 varieties. All share an identical framework, with distinct side chains that give them each unique properties: some are polar, and tend to favor interactions with water, while others are hydrophobic - i.e. favor an environment that is protected from water. Although the primary structure of the protein constrains the overall structure that can form, the side chains influence which amino acids like to be in proximity to which (in the folded 3D protein), and which sections of the chain form helical or flat portions of the protein. The sequence in which these beads are strung together, i.e. the linear sequence of amino acids, determines the final structure of the protein. All our cells are made up of a dazzling variety of proteins. Some are structural, and some functional enzyme proteins, that carry out catalytic activity. Instructions for the assembly of each and every one of these proteins is encoded in the genetic material, i.e. in DNA (deoxyribonucleic acid). This instruction book is itself a linear chain of code, composed of letters that represent v.2005.1 - January 5, 2009 1

A T T G A G C A T G C A nucleotides on DNA strand showing 4 codons I E H A amino acid sequence on protein 3D protein structure composed of amino acid subunits Figure 9.1: The sequence of nucleotides on a DNA chain represents a sequence of amino acids along the length of a protein. (DNA is actually a double-helix, with two intertwined complementary strands, but here we show only one of these strands). The strand illustrates four typical codons, and their corresponding amino acids (I=Isoleucine, E=Glutamic acid, H=Histidine, A=Alanine; see the Appendix for common abbreviations of the names of the amino acids). The lower panel shows the 3D structure formed by the folded protein chain. the sequence of amino acids in each protein. Let us consider one such portion of DNA that encodes a single protein, i.e. one gene. DNA is composed of just four distinct units, called nucleotides. These four components, Adenine, Cytosine, Guanine, and Thymine (A, C, G, T) arranged in a long string, one after the other, are translated into the corresponding necklace of amino acids that compose a protein whenever the given gene is activated. (We shall not here discuss the details of the amazing processes that handle this process of translation, since our focus will be on some aspects of the code.) DNA nucleotide words have to spell out 20 distinct meanings, corresponding to the 20 amino acids listed in Table 1. We first consider what features of this code follow from these simple facts. 1. DNA is composed of 4 nucleotides, (A, T, G, C). 2. These nucleotides, arranged one after the other, code for 20 amino acids. Facts 1 and 2 imply that we cannot make a simple correspondence of one nucleotide representing one amino acid. This implies that the code must contain longer words. Assuming that all words v.2005.1 - January 5, 2009 2

T C A G T C A G TTT Phe (F) TTC Phe (F) TTA Leu (L) TTG Leu (L) CTT Leu (L) CTC Leu (L) CTA Leu (L) CTG Leu (L) ATT Ile (I) ATC Ile (I) ATA Ile (I) ATG Met (M) GTT Val (V) GTC Val (V) GTA Val (V) GTG Val (V) TCT Ser (S) TCC Ser (S) TCA Ser (S) TCG Ser (S) CCT Pro (P) CCC Pro (P) CCA Pro (P) CCG Pro (P) ACT Thr (T) ACC Thr (T) ACA Thr (T) ACG Thr (T) GCT Ala (A) GCC Ala (A) GCA Ala (A) GCG Ala (A) TAT Tyr (Y) TAC Tyr (Y) TAA STOP TAG STOP CAT His (H) CAC His (H) CAA Gln (Q) CAG Gln (Q) AAT Asn (N) AAC Asn (N) AAA Lys (K) AAG Lys (K) GAT Asp (D) GAC Asp (D) GAA Glu (E) GAG Glu (E) TGT Cys (C) TGC Cys (C) TGA STOP TGG Trp (W) CGT Arg (R) CGC Arg (R) CGA Arg (R) CGG Arg (R) AGT Ser (S) AGC Ser (S) AGA Arg (R) AGG Arg (R) GGT Gly (G) GGC Gly (G) GGA Gly (G) GGG Gly (G) Table 9.1: Table of genetic code are the same length, we consider pairs of nucleotides. Then there are 4 4 = 16 possible two-nucleotide words, and this is still not enough to represent 20 amino acids. Consider all three-nucleotide words. There are 4 4 4 = 64 such words. This would more than suffice, and if we assume that there are no nonsense words (that represent no amino acid at all) then it is clear that many synonyms must exist, i.e. several distinct three-nucleotide words that all represent the same amino acid. It is now known that, indeed, the three- letter words here envisioned form the standard basic genetic code, i.e. the code shared by most organisms on our planet. From now on, we refer to the three-nucleotide word that codes for an amino acid as a codon. Table 9.1 lists the amino acids with their corresponding codons. (We also observe that there are three codons that are used to punctuate the code, i.e. signal the end of a sequence coding for one protein.) Simple arguments, phrased in terms of probabilities, can be used to study properties of this universal code of life. We shall use this biological example to motivate some very elementary calculations in probability. 9.1.1 Simple amino acid probabilities Let us imagine selecting one codon at random from a DNA sequence. How likely is it that the codon represents a specific amino acid, e.g. Glycine (Gly)? To answer this question, we need to know something about the probability of finding a given nucleotide in the three positions within a v.2005.1 - January 5, 2009 3

codon. Some data about this is investigated in the problem set. Here we simplify the discussion by assuming that each of the four nucleotides is equally likely to occur in each of the positions of a given codon. (This is not, in fact, the case, but it serves the purpose of demonstrating essential probability concepts.) P(T) = P(A) = P(G) = P(C) Since only these four nucleotides are used, it must be true that P(T) + P(A) + P(G) + P(C) = 1 and therefore, the equally likely assumption means that P(T) = P(A) = P(G) = P(C) = 0.25. This means that each of the nucleotides has a probability 1/4 of occurring in any spot. There are 4 4 4 = 64 possible combinations of three letters, as shown in Table 1. Out of these, four codons (GGT, GGC, GGA, GGG) code for Glycine. The probability that a codon chosen at random codes for Glycine is then P(Gly)=4/64=1/16. Which amino acid(s) would occur with highest probability? With lowest probability? Amino acids that would be represented with greatest probability include Leucine (Leu), whose codons can be any of (TTA, TTG, CTT, CTC, CTA, CTG). This means that the probability of leucine is P(Leu)=6/64 = 3/32. From Table 1 we can also see that Arginine (Arg) has a similar probability, with codons (CGT, CGC, CGA, CGG, AGA, AGG). The amino acids with fewest codons include Tryptophan (Trp) whose only codon is TGG and methionine (ATG). Each of these would occur with probability 1/64. What is the likelihood that the codon represents the end (STOP) of a protein chain?there are three STOP sequences, so these occur with probability 3/64. Amino acids with multiple (synonymous) codons Since there are 64 possible distinct codons and only 20 amino acids, if the code was assigned uniformly to all the amino acids, then on average there should be 64/20 = 3.2 (i.e. just over 3) different codons representing a single amino acid. (We will refer to such codons as synonymous, since they have the same interpretation). However, the assignment of codons is far from uniform. As shown in Table 9.1, some amino acids have up to 6 synonyms (e.g. Leucine, Arginine, and Serine) whereas others have only 1 (e.g. Methionine, Tryptophan). (The STOP mark is coded by three distinct codons.) We could represent this information using a table, a bar graph, or a cumulative function, just as we did for results of an experiment. An analysis of this type is carried out in the problem set. Mutations and codon volatility A mutation is an event that changes one or more of the instructions encoded in a gene. The simplest conceivable mutation is a single substitution of one nucleotide by some other nucleotide somewhere along the DNA chain. This is called a single nucleotide mutation. Here we will assume that each nucleotide in any codon has an equal probability of being replaced by this type of chance rare event. We will examine how the amino acid represented by the given codon would change. First, we observe that certain nucleotide substitutions result in synonymous codons for the same amino acid. For example, the substitution of adenine for thymosine in GGT to GGA preserves the v.2005.1 - January 5, 2009 4

CCG (Pro) CGG (Arg) CAG (Gly) ATG (Met) TAG (Stop) TCG (Ser) TTG (Trp) CTA (Leu) CTG (Leu) TTG (Leu) TTA (Leu) CTT (Leu) CTC (Leu) GTG (Val) TTT (Phe) TTC (Phe) Figure 9.2: A substitution in one nucleotide can change the meaning of a codon. Here we show two codons that both represent the amino acid leucine, and the result of all possible single nucleotide substitutions. assignment of the amino acid glycine. The term volatility of a codon denotes the proportion of its single nucleotide mutations that lead to different amino acids. In general since only one letter in the 3-letter code is changed, there are 3 3 3 distinct neighbors of each codon that result from a one-letter change. The volatility will be computed by the fraction of such neighbors (excluding the STOP codons) that result in new assignments. The volatility of a whole gene is defined as the average volatility of its codons. In general, the higher the volatility, the more likely it is that a mutation will result in a change in the structure of a protein. Some changes of this type would be catastrophic, resulting in a broken or nonfunctional protein. Others would be beneficial, leading to changes that improve the function, change the structure, or lead to new properties. One situation in which gene volatility is desirable is the evolution of parasites that must rapidly alter their external coats to avoid being detected and eliminated by a host immune system. The human malaria parasite, Plasmodium falciparum is one example in which high volatility is evident in numerous genes. Examples The amino acid Valine (Val) is coded by any of the four codons GTT, GTC, GTA, and GTG. Clearly changing the last nucleotide in this codon leads to a synonymous codon. Changes in either of the first two nucleotides leads to a distinct amino acid (and none leads to the termination sequence). This means that all codons for Valine have the same volatility. To compute this volatility, we note that three possible substitutions in nucleotide 1 or three possible substitutions in codon 2 (for a total of 6 possible changes) will by nonsynonomous. Thus the volatility of any of the Valine codons is 6/9=0.667. Only one sequence, TGG codes for Tryptophan. Any of the 9 possible changes of a single nucleotide will lead to a different amino acid. Thus, tryptophan has volatility 9/9=1. Not all codons for a given amino acid have the same volatility. In Figure 9.2 we show two codons for leucine. The codon CTG has 5 neighbors that are not synonyms (Met, Gly, Arg, Pro, Val), so v.2005.1 - January 5, 2009 5

its volatility is 5/9=0.556. The codon TTG has 6 out of 8 nonsynonymous neighbors (excluding the STOP sequence TAG), so has volatility 6/8=0.75. 9.1.2 Cutting the strand: How restriction enzymes work The genetic material in a cell is manipulated in many ways by enzymes that copy, splice, translate, and rearrange it. Certain biomolecules, called restriction enzymes are responsible for carefully cutting DNA at specific markers based on its nucleotide sequence. In general, these enzymes look for some multiple nucleotide pattern, and cut the strand of DNA at the location of that pattern. (Many of these enzymes are now used in biological experiments to manipulate DNA artificially in order to dissect it into manageable fragments for sequencing purposes.) We here consider an example in which the probability of finding a nucleotide at a given position is used to compute the mean length of fragments produced by a simplified DNA-cutting enzyme of this type. We simplify the problem to its most basic level, to illustrate how probabilities of single nucleotides are combined to answer more involved questions. For simplicity, assume that a hypothetical restriction enzyme cuts a DNA strand repeatedly, but only after a specific nucleotide, e.g. the base G, as shown in Figure 9.3. We ask what is the mean length of fragments created? A A T C C T A G T C P(no G here) X P (G here) Figure 9.3: A restriction enzyme (shown here as a pair of scissors ) cuts a DNA strand. In this simple example, the cut is always after the nucleotide G. To compute the mean length of fragments, we first need to find the probability of a given length, l. This is the product of probabilities that no G is found in any of l 1 positions, followed by a G in the l th position. v.2005.1 - January 5, 2009 6

Mean length of fragments of a simple restriction enzyme We assume that all nucleotides appear randomly with equal probabilities. Then, as before, P(G)=P(C)=P(A)=P(T) A fragment of length l if the enzyme encounters l 1 bases that are NOT G followed by base G in the l th position. Thus, the probability of a fragment of length l is: ( 3 P(fragment of length l) = 4 3 4... 3 ) ( ) l 1 ( ) 3 1 1 4 4 =. 4 4 } {{ } l 1 terms The mean length of all fragments is found by computing l = lp(l) l=0 where we have used P(l) to denote P(fragment of length l). Then, by the above, l = l l=0 ( ) l 1 3 4 ( ) 1 = 1 4 4 where we have taken out the common factor (1/4) and used the notation r = 3/4 in the above. In order to evaluate the above, we need a formula for the sum of the series shown here. In the problem set, we show, (using the derivative of a geometric series) that for any r such that r < 1, Thus l = 1 4 kr k 1 = k=0 1 (1 r) 2 = 1 4 1 (1 r) 2. l=0 1 (1 (3/4)) 2 = 1 4 Thus the mean length of the fragments is 4 nucleotides long. lr l 1 1 (1/4) 2 = 4 The calculation of mean length of DNA fragments for an enzyme that cuts after every occurrence of G, results in a mean length of 4 bases. We could reason intuitively that this makes sense: If bases are equally likely, then G occurs in roughly 1/4 of the positions. Thus it would occur roughly once per 4 nucleotides, giving rise to a 4-nucleotide fragment on average. 9.2 Hardy-Weinberg genetics Each of us has two entire sets of chromosomes: one set inherited from our mother, and one set from our father. These chromosomes carry genes, the unit of genetic material that codes for proteins and ultimately, through complicated biochemistry and molecular biology, determines all of our physical traits. v.2005.1 - January 5, 2009 7

We will investigate how a single gene (with two flavors, called alleles ) is passed from one generation to the next. We will consider a particularly simple situation, when the single gene determines some physical trait (such as eye color). The trait (say blue or green eyes) will be denoted the phenotype and the actual pair of genes (one on each parentally derived chromosome) will be called the genotype. Suppose that the gene for eye color comes in two forms that will be referred to as A and a. For example, A might be an allele for blue eyes, whereas a could be an allele for brown eyes. Then each individual must have one of the following pairs of combinations: AA, aa, or aa. (Note that aa and Aa will be considered to be equivalent here, i.e the order of the letters in the genotype is not important.) Suppose we know that the fraction of all genes for eye color of type A in the population is p, and the fraction of all genes for eye color of type a is q, where p+q = 1. (This means that there are only two possibilities for the gene type, of course.) Then we can interpret p and q as probabilities that a gene selected at random from the population will turn out to be type a (respectively A), i.e., P(A) = p, P(a)=q. Now suppose we draw at random two alleles out of the (large) population. (This is like tossing a coin twice). We get the following cases with given probabilities (computed with the second multiplication principle:) The probability of finding the genotypes aa (or the equivalent Aa), say, is the same as the probability of a AND A, and the probability of any genotype (aa, aa, AA, or Aa) is then the same as the product of probabilities as follows: Genotype: aa AA aa Aa Probability: pq p 2 q 2 pq We do not distinguish between genotypes aa and Aa, since the order of the alleles does not matter. Thus the probability of having a genotype which includes a and A (aa or Aa) is 2pq. If the population size is N, then, on average we would expect Np 2 individuals of type AA, Nq 2 of type aa and 2Npq individuals of the mixed type. The total probability of any of the genotypes is p 2 + 2pq + q 2 = (p + q) 2 = 1 9.2.1 Random non-assortative mating We now examine what happens if mates are chosen (at random) and father and mother pass down one or another copy of their alleles to the progeny. We investigate how the proportion of genes of various types is arranged. In the table below, we show the possible genotypes of the mother and father, and calculate the probability that mating of such individuals would occur under the assumption that choice of mate is random - does not depend at all on eye color. We assume that the allele from the father (carried by his sperm) is independent of the allele found in the mother s egg cell. This means that we can use the multiplicative property of probability to determine the probability of a given combination of parental alleles. (i.e. P(x, y) = P(x)P(y)). v.2005.1 - January 5, 2009 8

For example, the probability that a couple chosen at random will consist of a woman of genotype aa and a man of genotype aa is a product of the fraction of females that are of type aa and the fraction of males that are of type aa. But that is just (2pq)(p 2 ) = 2p 3 q. Now let us examine the distribution of possible offspring of various parents. In the table, we note, for example, that in the case that the couple are both of type aa, each parent can donate either a or A to the progeny, so we expect to see children of types aa, aa, AA in the ratio 1:2:1. We can now group together and summarize all the progeny of a given genotype, with the probabilities that they are produced by one or another such random mating. Using this table, we can then determine the probability of each of the three genotypes in the next generation. Mother: AA aa aa Frequency: p 2 2pq q 2 Father: AA p 2 1 AA aa 1 AA Aa 2 2 p 4 2pqp 2 p 2 q 2 aa 2pq 1 2 2 1 aa 1 aa 1 AA 4 2 4 1 aa 1 Aa 2 2 2pqp 2 4p 2 q 2 2pqq 2 aa q 2 Aa 1 aa 2 2 aa p 2 q 2 2pqq 2 q 4 Table 9.2: The frequency of progeny of various types in Hardy-Weinberg genetics can be calculated as shown in this table. The genotype of the mother is shown across the top and the father s genotype is shown on the left column. The various progeny resulting from mating are the table entries, with the probabilities directly underneath each genotype. Problem Find the probability that a random (Hardy Weinberg) mating will give rise to a progeny of type AA. Solution 1 Using Table 9.2, we see that there are only four ways that a child of type AA can result from a mating: either both parents are AA, or one or the other parent is Aa, or both parents are Aa. Thus, for children of type AA the probability is P(child of type AA) = p 4 + 1 2 (2pqp2 ) + 1 2 (2pqp2 ) + 1 4 (4p2 q 2 ) v.2005.1 - January 5, 2009 9

Simplifying leads to p(child of type AA) = p 2 (p 2 + 2qp + q 2 ) = p 2 (p + q) 2 = p 2 In the problem set, we also find that the probability of a child of type aa is 2qp, the probability of the child being type aa is q 2. We thus observe that the frequency of genotypes of the progeny is exactly the same as that of the parents. This type of genetic makeup is termed Hardy-Weinberg genetics. Alternate solution child AA father mother 2pq p2 2pq p 2 Aa AA Aa AA 1/2 1 1 1/2 A or A A or A (pq+p 2 ). ( pq + p 2 ) Figure 9.4: A tree diagram to aid the calculation of the probability that a child with genotype AA results from random assortative (Hardy Weinberg) mating. In Figure 9.4, we show an alternate solution to the same problem using a tree diagram. Reading from the top down, we examine all the possibilities at each branch point. A child AA cannot have any parent of genotype aa, so both father and mother s genotype could only have been one of AA or Aa. Each arrow indicating the given case is accompanied by the probability of that event. (For example, a random individual has probability 2pq of having genotype Aa, as shown on the arrows from the father and mother to these genotypes.) Continuing down the branches, we ask with what probability the given parent would have contributed an allele of type A to the child. For a parent of type AA, this is certainly true, so the given branch carries probability 1. For a parent of type Aa, the probability that A is passed down to the child is only 1/2. The combined probability is computed as follows: we determine the probability of getting an A from father (of type AA OR Aa): This is P(A from father)=(1/2)2pq+1 p 2 ) = (pq+p 2 ) and multiply it by a similar probability of getting A from the mother (of type AA OR Aa). (We must multiply, since we need A from the father AND A from the mother for the genotype AA. Thus P(child of type AA) =(pq + p 2 )(pq + p 2 ) = p 2 (q + p) 2 = p 2 1 = p 2. It is of interest to investigate what happens when one of the assumptions we made is relaxed, for example, when the genotype of the individual has an impact on survival or ability to reproduce. v.2005.1 - January 5, 2009 10

9.3 Random walker In this section we discuss an application of the binomial distribution to the process of a random walk. A shown in Figure 9.5(a), we consider a straight (1 dimensional) path and an erratic walker who takes steps randomly to the left or right. We will assume that the walker never stops. With probability p, she takes a step towards the right, and with probability q she takes a step towards the left. (Since these are the only two choices, it must be true that p + q = 1.) In Figure 9.5(b) we show the walkers position, x plotted versus the number of steps (n) she has taken. (We may as well assume that the steps occur at regular intervals of time, so that the horizontal axis of this plot can be thought of as a time axis.) (a) q p 1 0 1 x x (b) n Figure 9.5: A random walker in 1 dimension takes a step to the right with probability p and a step to the left with probability q. The process described here is classic, and often attributed to a drunken wanderer. In our case, we could consider this motion as a 1D simplification of the random tumbles and swims of a bacterium in its turbulent environment. it is usually the case that a goal of this swim is a search for some nutrient source, or possibly avoidance of poor environmental conditions. We shall see that if the probabilities of left and right motion are unequal (i.e. the motion is biased in one direction or another) this swimmer tends to drift along towards a preferred direction. In this problem, each step has only two outcomes (analogous to a trial in a Bernoulli experiment). We could imagine the walker tossing a coin to determine whether to move right or left. We wish to characterize the probability of the walker being at a certain position at a given time, and to find her expected position after n steps. Our familiarity with Bernoulli trials and the binomial distribution will prove useful in this context. Example (a) What is the probability of a run of steps as follows: RLRRRLRLLLL v.2005.1 - January 5, 2009 11

(b) Find the probability that the walker moves k steps to the right out of a total run of n consecutive steps. (c) Suppose that p = q = 1/2. What is the probability that a walker starting at the origin returns to the origin on her 10 th step? Solution (a) The probability of the run RLRRRLRLLL is the product pqpppqpqqq = p 5 q 5. Note the similarity to the question What is the probability of tossing HTHHHTHTTT? (b) This problem is identical to the problem of k heads in n tosses of a coin. The probability of such an event is given by a term in the binomial distribution: P(k out of n moves to right)=c(n, k)p k q n k. (c) The walker returns to the origin after 10 steps only if she has taken 5 steps to the left (total) and 5 steps to the right (total). The order of the steps does not matter. Thus this problem reduces to the problem (b) with 5 steps out of 10 taken to the right. The probability is thus P(back at 0 after 10 steps) = P(5 out of 10 steps to right) ( ) 10 ( ) 1 10! 1 =C(10, 5)p 5 q 5 = C(10, 5) = 2 5!5! 1024 = 0.24609 Mean position We now ask how to determine the expected position of the walker after n steps, i.e. how the mean value of x depends on the number of steps and the probabilities associated with each step. After 1 step, with probability p the position is x = +1 and with probability q, the position is x = 1. The expected (mean) position after 1 move is thus x 1 = p(+1) + q( 1) = p q But the process follows a binomial distribution, and thus the mean after n steps is. x n = n(p q) 9.4 Further examples and problems This section is intended to help with practice of concepts of permutations and combinations, and to provide further examples of calculations of discrete probability. v.2005.1 - January 5, 2009 12

The Monty Hall problem Monty Hall was the host of a television game-show called Let s Make a Deal. This example concerns a problem in probability named in his honour. The problem attracted media attention when it was featured in a column written by Marilyn Vos Savant. The game goes as follows: There are three doors, and a contestant is offered the opportunity to win a car (behind one of the doors). Behind the other two doors is a less attractive option (e.g. a goat). The contestant at first selects one door. Monty Hall (who knows what is behind each door) opens one of the other two doors to reveal a goat and asks: Would you like to change your mind, or to stay with your original selection?. The question to be answered is whether, given this information, it is a better strategy to switch or to stay with the first door. We analyze this problem with the tree diagram shown in Figure 9.6. The question we ask is whether the odds of winning are greater if the contestant switches or stays with the original selection. Starting from the top of the diagram, we show each possibility with a branch, and assign a probability, assuming (as usual) that it is equally likely that the car is behind any of the three doors, and that Monty Hall always opens one of the other doors that contains a goat. For example, if your selection is labeled A, and it so happens that there is a goat behind A, then the car is equally likely to be behind doors B or C (probability 1/2 each). Monty will certainly open the other door with the goat (probability 1). In that case, switching leads to a win. We compute the probability of those events by multiplying the assigned probabilities down the length of each branch, and then adding the results of the relevant branches (as shown in the dotted lines at the base of the diagram in Figure 9.6. In case the car is actually behind the initially selected door, staying leads to a win, but as shown in the calculation, this happens with lower probability based on the assumptions in the problem. Therefore, the winning strategy is to switch your selection after Monty Hall opens one of the doors. Example 2 A class of 12 students is divided into three equal teams of 4 to work together on three history group projects. The first team will investigate the civilization of the Mayas (project 1), the second team will research the Aztecs (project 2), and the third team will work on the Incas (project 3). How many different ways are there of forming the teams? Assume that the order of individuals within a team does not matter, but the order in which teams are picked determines which team gets project 1, or project 2, etc. Solution There will be a total of 3 teams formed by this subdivision. As noted above, the order of the teams matters, since the projects assigned are distinct. We must find out how many ways there are of choosing people from the class to fill each of these teams. For the first team, the number of ways of selecting 4 out of 12 people is given by C(12, 4) = 12! 8!4! = 12 11 10 9 4 3 2 = 495 v.2005.1 - January 5, 2009 13

You first pick: Door A 2/3 1/3 goat in A 1/2 1/2 car in B 1 1 car in C car in A 1/2 1/2 Monty opens: C B C B You: Result: stay switch stay switch stay switch stay switch to B to C lose win lose win win lose win lose P(win if stay) = P(win if switch)= (2/3)(1/2) + (2/3)(1/2) (1/3)(1/2) + (1/3)(1/2) = 1/3 = 2/3 Figure 9.6: The Monty Hall problem. v.2005.1 - January 5, 2009 14

[Remark: we use the formula for combinations here, because the arrangement of individuals within a team is not relevant: a team composed of Mary, Bob, Jack, and Jane is the same as a team composed of Bob, Jane, Jack, and Mary.] For each one of these ways we now have many ways of choosing the remaining teams. Once this team has formed, we have only 8 people left to choose from for the other teams. So for the second team, the number of combinations are C(8, 4) = 8! 4! 4! = 70. We now have only four people left, and they have to form the last team. The total number of ways of forming these teams is the product of the three results obtained above, i.e 495 70 1 = 34650. Example 3 (a) George has a photograph of each of his 8 sisters, but a wallet-sized photo album with 5 spaces for some of these photos. How many different arrangement of these photos could George make in the available space? (b) What is the probability that both Mary and Jane will be included in the photo album if each photograph is selected randomly for the display? (Assume that Mary and Jane are two of the sisters.) Solution (a) Here the arrangement of the photos is relevant, so we must consider permutations. (This means that a display in which Mary is first and Jane second is considered distinct from a display in which Jane is first and Mary is second.) The number of ways of arranging 8 objects into 5 slots is P(8, 5) = 8! (8 5)! = 8 7 6 5 4 = 6720. (b) The probability that Mary will be selected to fill the first slot is 1/8. The probability that she is selected for any one of the 5 slots is thus 5/8. But if Mary was selected, there would then be 7 sisters left to choose between, and 4 slots in which to place their photographs. Thus the probability that Jane is then selected for one of these other slots is 4/7. The probability that both of the above occur is (5/8) (4/7) = 0.357. v.2005.1 - January 5, 2009 15

9.5 Appendix Abbrev Abbrev Amino acid A Ala Alanine R Arg Arginine N Asn Asparagine D Asp Aspartic acid C Cys Cysteine Q Gln Glutamine E Glu Glutamic acid G Gly Glycine H His Histidine I Ile Isoleucine L Leu Leucine K Lys Lysine M Met Methionine F Phe Phenylalanine P Pro Proline S Ser Serine T Thr Threonine W Trp Tryptophan Y Tyr Tyrosine V Val Valine Table 9.3: Common abbreviations for the amino acids 9.6 For further reading Plotkin JB, Dushoff J, Fraser H B (2004) Detecting selection using a single nucleotide sequence of M tuberculosis and P falciparum, Nature 428: 942-945 (April 29, 2004). Plotkin JB, Dushoff J (2003) Codon bias and frequency-dependent selection on the hemagglutinin epitopes of influenza A virus, PNAS 100:7152-7157 (June 10, 2003). Zhang J (2005) On the evolution of codon volatility, Genetics 169:495-501 (January 2005). v.2005.1 - January 5, 2009 16