# Hidden Markov Models with Applications to DNA Sequence Analysis. Christopher Nemeth, STOR-i

Save this PDF as:

Size: px
Start display at page:

Download "Hidden Markov Models with Applications to DNA Sequence Analysis. Christopher Nemeth, STOR-i"

## Transcription

1 Hidden Markov Models with Applications to DNA Sequence Analysis Christopher Nemeth, STOR-i May 4, 2011

2 Contents 1 Introduction 1 2 Hidden Markov Models Introduction Determining the observation sequence Brute Force Approach Forward-Backward Algorithm Determining the state sequence Viterbi Algorithm Parameter Estimation Baum-Welch Algorithm HMM Applied to DNA Sequence Analysis Introduction to DNA CpG Islands Modelling a DNA sequence with a known number of hidden states DNA sequence analysis with an unknown number of hidden states Evaluation Conclusions

3 Abstract The Hidden Markov Model (HMM) is a model with a finite number of states, each associated with a probability distribution. The transitions between states cannot be directly measured (hidden), but in a particular state an observation can be generated. It is the observations and not the states themselves which are visible to an outside observer. However, by applying a series of statistical techniques it is possible to gain insight into the hidden states via the observations they generate. In the case of DNA analysis, we observe a strand of DNA which we believe can be segmented into homogeneous regions to identify the specific functions of the DNA strand. Through the use of HMMs we can determine which parts of the strand belong to which segments by matching segments to hidden states. HMMs have been applied to a wide range of applications including speech recognition, signal processing and econometrics to name a few. Here we will discuss the theory behind HMMs and the applications that they have in DNA sequence analysis. We shall be specifically interested in discussing how can we determine from which state our observations are being generated?, how do we determine the parameters of our model? and how do we determine the sequence of hidden states given our observations? While there are many applications of HMMs we shall only be concerned with their use in terms of DNA sequence analysis and shall cover examples in this literature where HMMs have been successfully used. In this area we shall compare approaches to using HMMs for DNA segmentation. Firstly, the approach taken when the number of hidden states is known and secondly, how it is possible to segment a DNA sequence when the number of hidden states is unknown?

4 Chapter 1 Introduction Since the discovery of DNA by Crick and Watson in 1953 scientists have endeavored to better understand the basic building blocks of life. By identifying patterns in the DNA structure it is possible not only to categorise different species, but on a more detailed level, it is possible to discover the more subtle characteristics such as gender, eye colour, predisposition to disease, etc. It is possible for scientists to gain a better understanding of DNA through segmenting long DNA strands into smaller homogeneous regions which are different in composition from the rest of the sequence. Identifying these homogeneous regions may prove useful to scientists who wish to understand the functional importance of the DNA sequence. There are various statistical techniques available to assist in the segmentation effort which are covered in Braun and Muller (1998). However, here we shall only focus on the use of Hidden Markov models (HMM) as an approach to DNA segmentation. Hidden Markov models were first discussed by Baum and Petrie (1966) and since then, have been applied in a variety of fields such as speech recognition, hand-writing identification, signal processing, bioinformatics, climatology, econometrics, etc. (see Cappe (2001) for a detailed list of applications). HMMs offer a way to model the latent structure of temporally dependent data where we assume that the observed process evolves independently given an unobserved Markov chain. There are a discrete finite number of states in the Markov chain which switch between one another according to a small probability. Given that these states are unobserved and random in occurrence they form a hidden Markov chain. It is possible to model the sequence of state changes that occur in the hidden Markov chain via observations which are dependent on the hidden states. Since the 1980 s and early 1990 s HMMs have been applied to DNA sequence analysis with the seminal paper by Churchill (1989) that first applied HMM to DNA segmentation. Since then hidden Markov models have been widely used in the field of DNA sequence analysis with many papers evolving and updating the original idea laid out by Churchill. Aside from the papers written in this area the practical use of these techniques has found its way into gene finding software such as FGENESH+, GENSCAN and SLAM, which can be used to predict the location of genes in a genetic sequence. In this report we shall firstly outline the theory behind HMMs, covering areas such as parameter estimation, identification of hidden states and determining the sequence of hidden states. We shall then develop this theory into several motivating examples in the field of DNA sequence analysis. Specifically, the theory behind HMMs is applied to practical examples in DNA sequence analysis, and how to model the hidden states on the occasions when the number of hidden states is known and when the number of hidden states is unknown. We shall conclude the report by discussing extensions which can be made to the HMM. 1

5 Chapter 2 Hidden Markov Models 2.1 Introduction A Markov chain represents a sequence of random variables q 1, q 2,..., q T where the future state q t+1 is dependent only on the current state q t, (2.1). P (q t+1 q 1, q 2,..., q t ) = P (q t+1 q t ) (2.1) There are a finite number of states which the chain can be in at time t which we define as S 1, S 2,..., S N. Where at any time t, q t = S i, 1 i N. In the case of a hidden Markov model it is not possible to directly observe the state q t at and time t. Instead we observe an extra stochastic process Y t which is dependent on our unobservable state q t. We refer these unobservable states as hidden states where all inferences about the hidden states is determined through the observations Y t as shown in Figure 2.1. Figure 2.1: Hidden Markov model with observations Y t and hidden states q t. 2

6 Example Imagine we have two coins, one of which is fair and the other biased. If we choose one of the coins at random and toss it, how can we determine whether we are tossing the fair coin or the biased coin, based on the outcome of the coin tosses? One way of modelling this problem is through a hidden Markov model where we treat the coins as being the hidden states q t (i.e. fair, biased) and the tosses of the coin are the observations Y t (i.e. heads, tails). As we can see from Figure 2.1, the observation at time t is dependent on the choice of coin which can be either fair or biased. The coin itself which is used at time t is dependent on the coin used at time t 1 which can either be fair or biased depending on the transition probability between the two states. Formal Definition Throughout this report we shall use the notation from, Rabiner (1989). Hidden Markov models are adapted from simpler Markov models with the extension that the states S i of the HMM cannot be observed directly. In order to determine the state of the HMM we must make inference from the observation of some random process Y t which is dependent on the state at time t. The HMM is characterised by a discrete set of N states, S = {S 1, S 2,..., S N } where the state at time t is denoted as q t (i.e. q t = S i ). Generally speaking the states are connected in such a way that it is possible to move from any state to any other state (e.g. in the case of an ergodic model). The movement between states is defined through a matrix of state transition probabilities A = {a ij } where, a ij = P (q t+1 = S j q t = S i ), for 1 i, j N (2.2) For the special case where any state can reach any other state in a single step a ij > 0 for all i, j. The observations Y t can take M distinct observations (i.e. symbols per state). The observation symbols are the observed output of the system and are sometimes referred to as the discrete alphabet. We define the probability of a given symbol being observed from state j at time t as following a probability distribution, B = {b j (k)}, where, b j (k) = P (Y t = k q t = S j ), 1 j N, 1 k M (2.3) which is sometimes referred to as the emission probability as this is the probability that the state j generates the observation k. Finally, In order to model the beginning of the process we will introduce an initial state distribution π = {π i } where, π i = P (q 1 = S i ), 1 i N (2.4) Now that we have defined our observations and our states we can now model the system. In order to do this we require model parameters N (number of states) and M (number of distinct observations), observation symbols and the three probability measures A, B and π. For completeness we define the complete set of parameters as λ = (A, B, π). For the remainder of this section we aim to cover three issues associated with HMMs, which are as follows: 3

7 Issue 1: Issue 2: Issue 3: How can we calculate P (Y 1:T λ), the probability of observing the sequence Y 1:T = {Y 1, Y 2,..., Y T } for a given model with parameters λ = (A, B, π) Given an observation sequence Y 1:T with model parameters λ, how do we determine the sequence of hidden states based on the observations P (Q 1:T Y 1:T, λ) with Q 1:T = {q 1, q 2,..., q T }? How can we determine the optimal model parameter values for λ = (A, B, π) so as to maximise P (Y 1:T Q 1:T, λ)? 2.2 Determining the observation sequence Brute Force Approach Suppose we wish to calculate the probability of observing a given sequence of observations Y 1:T = {Y 1, Y 2,..., Y T } from a given model. This can be useful as it allows us to test the validity of the model. If there are several candidate models available to choose from then our aim will be chose the model which best explains the observations. In other words the model which maximises P (Y 1:T λ). Solving this problem can be done by enumerating all of the possible state sequences Q 1:T = {q 1, q 2,..., q T } which generate the observations. Then the probability of observing the sequence Y 1:T is, P (Y 1:T Q 1:T, λ) = where we assume independence of observations, which gives, T P (Y t q t, λ) (2.5) t=1 P (Y 1:T Q 1:T, λ) = b q1 (Y 1 )b q2 (Y 2 )... b qt (Y T ) (2.6) The joint posterior probability of Y 1:T and Q 1:T P (Y 1:T, Q 1:T λ) given model parameters λ can be found by multiplying (2.6) by P (Q 1:T λ). P (Y 1:T, Q 1:T λ) = P (Y 1:T Q 1:T, λ) P (Q 1:T λ) (2.7) where the probability of such a state sequence Q 1:T of occurring is, P (Q 1:T λ) = π q1 a q1 q 2 a q2 q 3... a qt 1 q T (2.8) In order to calculate the probability of observing Y we simply sum the joint probability given in (2.7) over all possible state sequences q t P (Y 1:T λ) = all Q P (Y 1:T Q 1:T, λ) P (Q 1:T λ) (2.9) = q 1,q 2,...,q T π q1 b q1 (Y 1 )a q1 q 2 b q2 (Y 2 )... a qt 1 q T b qt (Y T ) (2.10) Calculating P (Y 1:T λ) through direct enumeration of all the states may seem like the simplest approach. However, while this approach may appear to be straight forward the required computation is not. Altogether, if we were to calculate P (Y 1:T λ) in this fashion, we would require 2T N T calculations which is computationally expensive, even for small problems. Therefore given the computational complexity of this approach an alternative approach is required. 4

8 2.2.2 Forward-Backward Algorithm A computationally faster way of determining P (Y 1:T λ) is by using the forward-backward algorithm. The forward-backward algorithm (Baum and Egon (1967) and Baum (1972)) comprises of two parts. Firstly, we compute forwards through the sequence of observations the joint probability of Y 1:t with the state q t = S i at time t, (i.e. P (q t = S i, Y 1:t )). Secondly, we compute backwards the probability of observing the observations Y t+1:t given the state at time t (i.e. P (Y t+1:t q t = S i )). We can then combine the forwards and backwards pass to obtain the probability of a state S i at any given time t from the entire set of observations. P (q t = S i, Y 1:T λ) = P (Y 1, Y 2,..., Y t, q t = S i, λ) P (Y t+1, Y t+2,..., Y T q t = S i, λ) (2.11) The forward-backward algorithm also allows us to define the probability of being in a state S i at time t (q t = S i ) by taking (2.11) and P (Y 1:T λ) from either the forwards or backwards pass of the algorithm. In the next section we will see how (2.12) can be used for determining the entire state sequence q 1, q 2,..., q T. P (q t = S i Y 1:T, λ) = P (q t = S i, Y 1:T, λ) P (Y 1:T λ) (2.12) Forward Algorithm We define a forward variable α t (i) to be the partial observations for a sequence Y 1:t with the state at time t, q t = S i given model parameters λ. α t (i) = P (Y 1, Y 2,..., Y t, q t = S i λ) (2.13) We can now use our forward variable to enumerate through all the possible states up to time T with the following algorithm. Algorithm 1. Initialisation: α 1 (i) = π i b i (Y 1 ) 1 i N (2.14) 2. Recursion: [ N ] α t+1 (j) = α t (i)a ij b j (Y t+1 ) 1 j N, 1 t T 1 (2.15) i=1 3. Termination: P (Y 1:T λ) = N α T (i) (2.16) i=1 The algorithm starts by initialising α 1 (i) at time t = 1 with the joint probability of the first observation Y 1 with the initial state π i. To determine α t+1 (j) at time t + 1 we enumerate through all the possible state transitions from time t to t + 1. As α t (i) is the joint probability that Y 1, Y 2,..., Y t are observed with state S i at time t, then α t (i)a ij is the probability of the joint event that Y 1, Y 2,..., Y t are observed and that the state S j at time t + 1 is arrived at via state S i at time t. Once we sum over all possible states S i at time t we will then know S j at time t + 1. It is then a case of determining α t+1 (j) by accounting for the observation Y t+1 in state j, i.e. b j (Y t + 1). We compute α t+1 (j) for all states j, 1 j N at time t and then iterate through t = 1, 2,..., T 1 until time T. Our desired calculation P (Y 1:T λ) is calculated by summing through all the of the possible N states which as, 5

9 we can find P (Y 1:T λ) by summing α T (i). α T (i) = P (Y 1, Y 2,..., Y T, q T = S i λ) (2.17) Backward Algorithm In a similar fashion as with the forward algorithm we can now calculate the backward variable β t (i) which is defined as, β t (i) = P (Y t+1, Y t+2,..., Y T q t = S i, λ) (2.18) which are the partial observations recorded for time t + 1 to T, given that the state at time t is S i with model parameters λ. As with the forward case there is a backward algorithm which solves β t (i) inductively. Algorithm 1. Initialisation: β T (i) = 1, 1 i N (2.19) 2. Recursion: N β t (i) = a ij b j (Y t+1 )β t+1 (j), t = T 1, T 2,..., 1, 1 i N (2.20) j=1 3. Termination: N P (Y 1:T λ) = π j b j (Y 1 )β 1 (j) (2.21) j=1 We set β T (i) = 1 for all i as we require the sequence of observations to end a time T but do not specify the final state as it is unknown. We then induct backwards from t + 1 to t through all possible transition states. To do this we must account for all possible transitions between q t+1 = S j and q t = S i at time t, as well as the observation Y t+1 and all of the observations from time T to t + 1 (β t+1 (j)). Both the forward and backward algorithms require approximately N 2 T calculations which means that in comparison with the brute force approach (which requires 2T N T calculations) the forward-backward algorithm is much faster. 2.3 Determining the state sequence Suppose we wish to know What is the optimal sequence of hidden states? For example, in the coin toss problem we may wish to know which coin (biased or fair) was used at time t and whether the same coin was used at time t + 1. There are several ways of answering this question, one possible approach is to choose the state at each time t which is most likely given the observations. To solve this problem we use (2.11) from the forward-backward algorithm, where P (q t = S i, Y 1:T λ) = α t (i)β t (i). P (q t = S i Y 1:T, λ) = α t(i)β t (i) P (Y 1:T λ) = α t (i)β t (i) N i=1 α t(i)β t (i) (2.22) where α t (i) represents the observations Y 1:t with state S i at time t and β t (i) represents the remaining observations Y t+1:t given state S i at time t. To ensure that P (q t = S i Y 1:T, λ) is a proper probability measure we normalise by P (Y 1:T λ). 6

10 Once we know P (q t = S i Y 1:T, λ) for all states 1 i N we can calculate the most likely state at time t by finding the state i which maximises P (q t = S i Y 1:T, λ). q t = arg max P (q t = S i Y 1:T, λ), 1 t T (2.23) 1 i N While (2.23) allows us to find the most likely state at time t, it is not, however, a realistic approach in doing so. The main disadvantage of this approach is that it does not take into account the state transitions. It may be the case that the optimal state sequence includes states q t 1 = S i and q t = S j when in fact the transition between the two states is not possible (i.e. a ij = 0). This is because (2.23) gives the most likely state at each time t without regard to the state transitions. A logical solution to the problem regarding (2.23) is to change the optimality criterion, and instead of seeking the most likely state at time t we instead find the most likely pairs of states (q t, q t+1 ). However, a more widely used approach is to find the single optimal state sequence Q 1:T, which is the best state sequence path P (Q 1:T, Y 1:T λ). We find this using a dynamic programming algorithm known as the Viterbi Algorithm, which chooses the best state sequence that maximises the likelihood of the state sequence for a given set of observations Viterbi Algorithm Let δ t (i) be the maximum probability of the state sequence with length t that ends in state i (i.e. q t = S i ) which produces the first t observations. δ t (i) = max P (q 1, q 2,..., q t = i, Y 1, Y 2,..., Y t λ) (2.24) q 1,q 2,...,q t 1 The Viterbi algorithm (Viterbi (1967) and Forney (1973)) is similar to the forward-backward algorithm except that here we use maximisation instead of summation at the recursion and termination stages. We store the maximisations δ t (i) in a N by T matrix ψ. Later this matrix is used to retrieve the optimal state sequence path at the backtracking step. 1. Initialisation: 2. Recursion: 3. Termination: δ 1 (i) = π i b i (Y 1 ), 1 i N (2.25) ψ 1 (i) = 0 (2.26) δ t (j) = max 1 i N [δ t 1(i)a ij ]b j (Y t ), 2 t T, 1 j N (2.27) ψ t (j) = arg max[δ t 1 (i)a ij ], 2 t T, 1 j N (2.28) 1 i N 4. Path (state sequence) backtracking: p = max 1 i N [δ T (i)] (2.29) qt = arg max[δ T (i)] (2.30) 1 i N q t = ψ t+1 (q t+1), t = T 1, T 2,..., 1 (2.31) 7

11 The advantage of the Viterbi algorithm is that it does not blindly accept the most likely state at each time t, but in fact takes a decision based on the whole sequence. This is useful if there is an unlikely event at some point in the sequence. This will not effect the rest of the sequence if the remainder is reasonable. This is particularly useful in speech recognition where a phoneme may be garbled or lost, but the overall spoken word is still detectable. One of the problems with the Viterbi algorithm is that multiplying probabilities will yield small numbers that can cause underflow errors in the computer. Therefore it is recommended that the logarithm of the probabilities is taken so as to change the multiplication into a summation. Once the algorithm has terminated, an accurate value can be obtained by taking the exponent of the results. 2.4 Parameter Estimation The third issue which we shall consider is, How do we determine the model parameters λ = (A, B, π)? We wish to select the model parameters such that they maximise the probability of the observation sequence given the hidden states. There is no analytical way to solve this problem, but we can solve it iteratively using the Baum-Welch algorithm (Baum et al. (1970) and Baum (1972)) which is an Expectation-Maximisation algorithm that finds λ = (A, B, π) such that P (Y 1:T λ) is locally maximised Baum-Welch Algorithm The Baum-Welch algorithm calculates the expected number of times each transition (a ij ) and emission (b j (Y t )) is used, from a training sequence. To do this it uses the same forward and backward values as used to determine the state sequence. Firstly, we define the probability of being in state S i at time t and state S j at time t+1 given the model parameters and observation sequence as, P (q t = S i, q t+1 = S j Y 1:T, λ) = α t(i)a ij b j (Y t+1 )β t+1 (j) P (Y 1:T λ) α t (i)a ij b j (Y t+1 )β t+1 (j) = N j=1 α t(i)a ij b j (Y t+1 )β t+1 (j) N i=1 (2.32) (2.33) equation (2.32) is illustrated in Figure 2.2. From the forward-backward algorithm we have already defined P (q t = S i Y 1:T, λ) as the probability of being in state S i at time t given the model parameters and sequence of observations. We notice that this equation relates to (2.32) as follows, P (q t = S i Y 1:T, λ) = N P (q t = S i, q t+1 = S j Y 1:T, λ) (2.34) j=1 If we then sum P (q t = S i Y 1:T, λ) over t we get the expected number of times that state S i is visited. Similarly summing over t for P (q t = S i, q t+1 = S j Y 1:T, λ) is the expected number of transitions from state S i to state S j. Combining the above we are now able to determine a re-estimation of the model parameters λ = (A, B, π). 1. Initial probabilities: π i = P (q t = S i Y 1:T, λ), expected number of times in state S i at time t = 1 (2.35) 8

12 Figure 2.2: Graphical representation of the computation required for the joint event that the system is in state S i at time t and state S j at time t + 1 as given in (2.32). 2. Transition probabilities: 3. Emission probabilities: T 1 t=1 a ij = P (q t = S i, q t+1 = S j Y 1:T, λ) T 1 t=1 P (q t = S i Y 1:T, λ) (2.36) = expected number of transitions from state S i to state S j expected number of transitions from state S i (2.37) b j (k) = = t=1 P (q t = S i Y 1:T, λ) T t=1 P (q t = S i Y 1:T, λ) expected number of times in state j and observing k expected number of times in state j (2.38) (2.39) where t=1 denotes the sum over t such that Y t = k If we start by defining the model parameters λ as (A, B, π) we can then use these to calculate (2.35)-(2.39) to create a re-estimated model λ = (A, B, π). If λ = λ then the initial model λ already defines the critical point of the likelihood function. If, however, λ λ and λ is more likely than λ in the sense that P (Y 1:T λ) > P (Y 1:T λ) then the new model parameters are such that the observation sequence is more likely to have been produced by λ. Once we have found an improved λ through re-estimation we can then repeat this procedure recursively, thus improving the probability of Y 1:T being observed. We can then repeat this procedure until some limiting point which will finally result in a maximum likelihood estimate of the HMM. 9

13 Chapter 3 HMM Applied to DNA Sequence Analysis 3.1 Introduction to DNA Deoxyribonucleic acid (DNA) is the genetic material of a cell which is a code containing instructions for the make-up of human beings and other organisms. The DNA code is made up of four chemical bases: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). The sequence of these bases determines information necessary to build and maintain an organism, similar to the way in which the arrangement of letters determine a word. DNA bases are paired together as (A-T) and (C-G) to form base pairs which are attached to a sugar-phosphate backbone (deoxyribose). The combination of a base, sugar and phosphate is called a nucleotide, which is arranged in two long strands that form a twisted spiral famously known as the double helix (Figure 3.1). Figure 3.1: Strand of DNA in the form of a double helix where the base pairs are separated by a phosphate backbone (p) Since the 1960s it has been known that the pattern in which the four bases occur in a DNA sequence is not random. Early research into the composition of DNA relied on indirect methods such as base composition determination or the analysis of nearest neighbour frequencies. It was only when Elton (1974) noticed that models which assumed a homogeneous DNA structure where inappropriate when modelling the compositional heterogeneity of DNA and thus it was proposed that DNA should be viewed as a sequence of segments, where each segment follows its own distribution of bases. The seminal paper by Churchill (1989) was the first to apply HMMs to DNA sequence analysis where a heterogeneous strand of DNA was assumed to comprise of homogeneous segments. Using the hidden states of the hidden Markov model it was possible to detect the underlying process of the individual segments and categorise the entire sequence in terms of shorter segments. 3.2 CpG Islands To illustrate the use of hidden Markov models in DNA sequence analysis we will consider an example given by Dubin et al. (1998). 10

14 In the human genome wherever the dinucleotide CG (sequence of two base pairs) occurs where a cytosine nucleotide is found next to a guanine nucleotide in a linear sequence of bases along its length (Figure 3.1). We use the notation CpG (-C-phosphate-G-) to separate the dinucleotide CG from the base pair C-G. Typically wherever the CG dinucleotide occurs the C nucleotide is modified by the process of methylation where the cytosine nucleotide is converted into methyl-c before mutating into T, thus creating the dinucleotide TG. The consequence of this is that the CpG dinucleotides are rarer in the genome than would be expected. For biological reasons the methylation process is suppressed in short stretches of the genome, such as around the start regions of genes. In these regions we see more CpG dinucleotides than elsewhere in the gene sequence. These regions are referred to as CpG Islands (Bird, 1987) and are usually anywhere from a few hundred to a few thousand bases long. Using a hidden Markov model we can consider, given a short sequence of DNA, if it is from a CpG island and also how do we find CpG islands in a longer sequence? In terms of our hidden Markov model we can define the genomic sequence as being a sequence of bases which are either within the CpG island or are not. This then gives us our two hidden states {CpG island, Non-CpG island} which we wish to uncover by observing the sequence of bases. As all four bases can occur in both the CpG island and non-cpg island regions, we first must define a sensible notation to differentiate between C in a CpG island region and C in a non-cpg island region. For A, C, G, T in a CpG island we have {A +, C +, G +, T + } and for those bases that are not in a CpG island we have {A, C, G, T }. Figure 3.2: Possible transitions between bases in CpG island and non-cpg island regions Figure 3.2 illustrates the possible transitions between bases, where it is possible to transition between all bases in both CpG island states and non-cpg island states. The transitions which occur do so according to two sets of probabilities which specify firstly the state and then the given observed chemical base from the state. Once we have established the observations Y t = {A, C, G, T } and the states S i =(CpG island, Non-CpG island) we are then able to construct a direct acyclic graph (DAG) which we shall use to illustrate the dependent structure of the model. The DAG given in Figure 3.3 shows that observations Y t are dependent on the hidden states q t = S i and that both the states and observations are dependent on probability matrices A and B, respectively. The matrix A = {a ij } represents the transition between the two hidden states P (q t = S j q t 1 = S i ) = a ij and B denotes the observable state probabilities for the 2-hidden states B = (p +, p ). As we have seen from the previous section we can estimate the parameters of A and B using the Baum-Welch algorithm. In the CpG island example the observation probabilities p + and p are given in Table 3.1 and Table 3.2. We notice from the observation probabilities that the transitions from G to C and C to G 11

15 Figure 3.3: DAG of the hidden Markov model with A representing the state transition probabilities and B representing the observation probabilities for a given state + A C G T A C G T Table 3.1: Transitions for CpG island region - A C G T A C G T Table 3.2: Transition probabilities for non-cpg island region in the CpG island region are higher than in the non-cpg region. The difference in observation probabilities between the two regions justifies the use of the hidden Markov model. If the observation probabilities were constant throughout the strand of DNA then the sequence would be homogeneous and we would be able to model the DNA sequence with one set of probabilities for the transitions between bases in the sequence. However, we know that the probability of transition between certain bases is greater in specific regions and so the probability of moving from a G to C is not constant throughout the sequence. Thus, we require an extra stochastic layer which we model through a hidden Markov model. 3.3 Modelling a DNA sequence with a known number of hidden states Once we have established the theory of hidden Markov models and how they apply to DNA analysis we can then develop models with which to analyse the DNA sequence. Here we will use the paper Boys et al. (2000) to illustrate through an example how we can segment a DNA sequence where the number of hidden states is known and each hidden state corresponds to a segment of the DNA sequence. In this paper, the authors analyse the chimpanzee α-fetoprotein gene; this protein is secreted by embryonic liver epithelial cells and is also produced in the yolk sac of mammals. This protein plays an important role in the embryonic development of mammals; in particular unusual levels of the protein found in pregnant mothers is associated with genetic disorders, such as, neural tube defects, spina bifida and Down s syndrome. One approach which can be used to identify the parts of the DNA sequence where a transition between states occurs is by using a multiple-changepoint model, where inferences about the base transition matrices operating within the hidden state are made conditional on estimates 12

16 of their location. Carlin et al. (1992) give a solution to the multiple-changepoint problem for Markov chains within a Bayesian framework. However, the drawback to the approach is that it is difficult to specify informative prior knowledge about the location of the changepoints, so Bayesian analysis of changepoint problems tends to assume relatively uninformative priors for the changepoints. The authors also felt that while changepoint models are appropriate in time series analysis, they are perhaps not appropriate for DNA sequence analysis as they fail to capture the evolution of the DNA structure. Therefore, a more flexible approach to modelling both the DNA sequence and the underlying prior information is to use a hidden Markov model. The main advantage to this is that rather than specifying prior information about the precise location of the changepoint, it is perhaps preferable to specify the length of a segment instead. Previous work initially done by Churchill (1989) used a maximum likelihood approach and an EM algorithm to determine base transitions for a given hidden state. Here the authors adopt a Bayesian approach which incorporates prior information in identifying the hidden states. Inferences are made by simulating from the posterior distribution using the Markov chain Monte Carlo (MCMC) technique of Gibbs sampling. The advantage of this technique is that it allows for prior information to be incorporated and permits the detection of segment types allowing for posterior parameter uncertainty. Model As before, we we take our observations Y t {A, C, G, T } to be the four chemical bases and the states to represent the different segment types q t {S 1, S 2,..., S r }, t = 1, 2,..., T. In this case we assume that the number of different segment types are known of which there are r. We make the assumption that the transition between the four bases follows a first order Markov chain, where P (Y t Y 1, Y 2,..., Y t 1 ) = P (Y t Y t 1 ), however, as we shall consider later, this is not necessarily a valid assumption. By establishing the same dependent structure as given by the DAG in Figure 3.3 we define the base transition matrices for each segment type as B = {P 1, P 2,..., P r }, where the observations follow a multinomial distribution P k = Pij k. Therefore, the base transitions follow, P (Y t = j q t = S k, Y 1, Y 2,..., Y t 1 = i, B) = P (Y t = j q t = S k, Y t 1 = i, B) = P S k ij (3.1) where i, j {A, C, G, T }, k {1, 2,..., r} The hidden states are modelled using a first order Markov process with transition matrix A = a kl as shown in Figure 3.3. The hidden states at each location are unknown, therefore, we must treat these hidden states as unknown parameters in our model. If we assume Y 1 and q 1 follow independent discrete uniform distributions then we can define the likelihood function for the model parameters A and B as follows given the observed DNA sequence Y 1:T and the unobserved segment types Q 1:T. L(A, B Y 1:T, Q 1:T ) = P (Y 1, q 1 A, B) n P (Y t = j q t = S k, Y t 1 = i, B) P (q t = S l q t 1 = S k, A) (3.2) where we define t=2 = (4r) 1 n t=2 P S k ij a kl, i, j {A, C, G, T }, k, l {1, 2,..., r} (3.3) P (q t = S l q 1, q 2,..., q t 1 = S k, A) = P (q t = S l q t 1 = S k, A) = a kl (3.4) 13

17 Prior distributions Prior for base transitions Given the multinomial form of the likelihood we can take the prior distribution to be the Dirichlet distribution as this is the conjugate prior. Therefore, if we take the row of a base transition matrix to be p i = (p ij ), then the prior for p i will be a Dirichlet distribution. P (p i ) 4 j=1 p α ij ij, 0 p ij 1, j = 1, 2, 3, 4 where α = (α ij ) are the parameters of the distribution. 4 p ij = 1 (3.5) j=1 Prior for the segment types As the transition matrices for the hidden states A follow in a similar fashion to the matrices of the base transitions we shall again use a Dirichlet distribution of r dimension for the prior of the rows (3.7), a k = a kj. In general the prior belief for the hidden states is well defined, particularly with regards to segment length. In practice it is difficult to identify short segments and so it is assumed that the transition between hidden states is rare, E(a ii ) is close to 1. p k i = (p ij ) D(c k i ), i = 1, 2, 3, 4, k = 1, 2,..., r (3.6) a k = (a kj ) D(d k ), k = 1, 2,..., r (3.7) Posterior analysis The posterior distribution for the parameters A and B and the hidden states at time t (i.e. q t = S i ) are found using Gibbs sampling with data augmentation. This involves simulating the hidden states conditional on the parameters and then simulating the parameters conditional on the hidden states. This process is then repeated until the parameters converge. Determining the posterior distribution for the parameters P (A, B Y 1:T, Q 1:T ) follows from the model likelihood given by (3.2), which when incorporated with the conjugate Dirichlet distribution given in the previous section produces independent posterior Dirichlet distributions for the rows of the transition matrices given by ( ). where p k i Y 1:T, Q 1:T D(c k i + n k i ), i = 1, 2, 3, 4, k = 1, 2,..., r (3.8) a k i Y 1:T, Q 1:T D(d k + m k ), k = 1, 2,..., r (3.9) n k i = (nk ij ), nk ij = m k = (m kj ), m kj = where I(A) = 1 if A is true and 0 otherwise. T I(Y t 1 = i, Y t = j, q t = S k ) (3.10) t=2 T I(q t 1 = S k, q t = S j ) (3.11) t=2 The second part of the Gibbs sampler involves determining the hidden states which are simulated as P (Q 1:T Y 1:T, A, B). This can simulated sequentially using its univariate updates P (q t Q t, A, B), t = 1, 2,..., T where Q t = 1, 2,..., t 1, t + 1,..., T. 14

18 Results In the α-fetoprotein example the authors compared whether the DNA sequence should be segmented into two or three hidden states. Firstly, they consider the case of two hidden states where the base transitions must follow one of two transition matrices P 1 or P 2 which mark the transitions between the four bases (A, C, G, T). By selecting the number of hidden states a priori the segment lengths are also pre-specified by setting E(a ii ) = 0.99 with SD(a ii ) = 0.01 gives a change between segments approximately every 100 bases. The posterior results for parameters A and B = (P 1, P 2 ) given in Figure 3.4 where the mean length of segment type 1 is around 500 bases and the length of segment type 2 is around 70 bases. The main difference between the two transition probability matrices can be seen in the transitions to bases A and C. Where in P 1 there are more transitions to A and in P 2 there are more transitions to C. The larger variability of the from A and from G rows in P 2 is due to segment type 2 being rich in C and T with few A s and G s. Figure 3.4: Boys et al. (2000), Posterior summary of transition matrices with two hidden states, E(a ii ) = 0.99 and SD(a ii ) = 0.01 The authors then compared the posterior analysis with the results obtained when the number of hidden states is set to three. Figure 3.5 shows the approximate probabilities of being in each of the three states through the DNA sequence. The figure indicates that it is reasonable to assume that the sequence consists of three segments and not two as was first assumed. It is possible to increase the number of segments until the point where the posterior standard deviations of the base transition matrix is sufficiently small. In practice, however, determining the exact number of segments can be tested by using the information criteria. In conclusion, the method of segmenting the DNA sequence using a Bayesian framework can be advantageous if sufficient prior information, such as the length and number of segments is available. However, in practice this is not usually the case, and so we shall expand upon this approach and show how it is possible to segment the DNA sequence when the number of hidden states is unknown by using a reversible jump MCMC approach. 3.4 DNA sequence analysis with an unknown number of hidden states In the last example we considered the case where the number of hidden states was know; this is frequently not the case and the number of hidden states must be determined. Here we will consider how we can calculate the number of hidden states by utilising reversible jump MCMC 15

19 Figure 3.5: Boys et al. (2000), Posterior probability of being in state in one of the three states at time t. (a) P (q t = S 1 Y 1:T, A, B), (b) P (q t = S 2 Y 1:T, A, B) and (c) P (q t = S 3 Y 1:T, A, B) algorithms. The paper Boys and Henderson (2001) uses the reversible jump MCMC approach for the case of DNA sequence segmentation. We shall discuss this paper and the techniques used when the number of hidden states is unknown. We shall also include in this section the paper Boys and Henderson (2004) in which the authors expand on the idea that not only is the number of hidden states unknown but also the order of the Markov dependence which, until now, has been assumed to be first order. Model We use a similar notation to the last example where we take our observations Y t Y = {A, C, G, T } to be the four bases (Adenine, Cytosine, Guanine and Thymine), to simplify notation we can denote the state space as Y = {1, 2,..., b} (for applications to DNA b = 4 letter alphabet). We denote our hidden states as q t = S k, t {1, 2,..., T }, k S = {1, 2,..., r}, representing the different segment types and δ represents the order of the Markov chain conditional on the hidden states. When δ = 0 we have the usual independence assumption, but for δ > 0 we can include the short range dependent structure found in DNA (Churchill, 1992). The HMM can be considered in terms of the observation equations (3.12) and the state equations (3.13). P (Y t Y 1:t 1, Q 1:t ) = P (Y t = j Y t δ,..., Y t 1, q t = S k ) = p k ij (3.12) i Y δ = {1, 2,..., b δ }, j Y, k {1, 2,..., r} where i = I(Y 1:T, t, δ, b) = 1 + δ l=1 (Y t l 1)b l 1 P (q t = S l q t 1 = S k ) = a kl, k, l S r = {1, 2,..., r} (3.13) where A = a kl is the matrix of hidden state transition probabilities and B = {P 1,..., P r } denotes the collection of observable base transitions in the r hidden states with P k = p k ij, r R = {1, 2,..., r max } and δ Q = {0, 1, 2,..., δ max }. While we treat r and δ are unknown it is, however, necessary for the reversible jump algorithm to restrict the unknown number of states and order of dependence as r max and δ max. In this example we consider the case where the 16

20 number of hidden states r is unknown. The DAG in Figure 3.6 denotes the unknown quantities with circles and the known with squares, thus in this case we label our unknown number of states r with a circle. Figure 3.6: DAG of the hidden Markov model with r hidden states It is computationally convenient to model the hidden states Q 1:T as missing data and work with the complete-data likelihood P (Y 1:T, Q 1:T r, δ, A, B). Where for a given, r, the completedata likelihood is simply the product of the observation and state equations (3.14). P (Y 1:T, Q 1:T r, δ, A, B) = P (Y 1:T r, δ, Q 1:T, A, B) P (Q 1:T r, δ, A, B) (3.14) ij a mij ij (3.15) j S r n k ij = m ij = (p k ij) nk i Y δ jiny k S r i S r T t=δ max+1 T t=δ max+1 I(I(Y 1:T, t, δ, b) = i, Y t = j, q t = S k ) I(q t 1 = S i, q t = S j ) where I() denotes the indicator function which equals 1 if true and 0 otherwise. Prior distributions The advantage of using a Bayesian analysis is that it is possible to include a priori uncertainty about the unknown parameters. The aim of this analysis is to make inferences about the unknown number of segments, r, the order of dependence δ, the model transition parameters A, B, and also the sequence of hidden states, Q 1:T. It is possible to quantify the uncertainty of these parameters through the prior distribution (3.16). P (r, δ, A, B) = P (r) P (δ) P (A, B r, δ) = P (r) P (δ) P (A r) P (B r, δ) (3.16) In reversible jump applications we restrict our number of hidden states r and order of dependence δ to be r max and δ max respectively. For the distributions of r and δ we use independent truncated prior distributions where r P o(α r ), r {1, 2,..., r max } and δ P o(α δ ), δ {1, 2,..., δ max } with fixed hyperparameters α r > 0 and α δ > 0. As in the last example we shall again take independent Dirichlet distributions for the priors of the row elements of A and B, where a k = a kj and p i = p ij represent the rows of the matrices A and B, respectively. 17

21 p k i = (p k ij) r, δ D(c k i ), i Y δ, j Y, k S r (3.17) a k = (a kl ) r D(d k ), k, l S r (3.18) where the Dirichlet parameters c and d are chosen to reflect the goal of the analysis. Posterior analysis In Bayesian analysis we can combine information about the model parameters from the data and the prior distribution to obtain the posterior distribution (3.19), which calibrates the uncertainty about the unknown parameters after observing the data. P (r, δ, Q 1:T, A, B Y 1:T ) P (Y 1:T, Q 1:T r, δ, A, B) P (r, δ, A, B) (3.19) In the last example it was possible to determine the posterior distribution using a straightforward MCMC technique with Gibbs sampling. However, in this case the posterior is more complicated as we have now taken our number of hidden states r and the order of Markov dependence δ to be unknown quantities. The extra complexity which this adds means that the MCMC algorithm must now allow the sampler to jump between parameter spaces with different dimensions which correspond to models with different values for r and δ. This can be achieved using reversible jump techniques (Green, 1995), which are a generalisation of the Metropolis- Hastings algorithm. The term reversible jump comes from the fact that the parameter space is explored by a number of move types which all attain detailed balance and some allow jumps between subspaces of different dimensions. The two most popular categories of reversible jump moves are the split/merge and birth/death moves. The basic idea behind the split/merge move is that the hidden state is either split in two or combined with another hidden state according to some probability. Whereas, the birth/death moves, which shall be focused on here, have a random chance of creating or deleting a hidden state according to some probability. MCMC scheme After each iteration of the MCMC algorithm the following steps are performed: 1. Update the order of dependence and transition probability matrices P (δ, A, B r, Y 1:T, Q 1:T ). 2. Update the number of hidden states r and also A, B and Q 1:T conditional on δ. 3. Update the sequence of hidden states Q 1:T using P (Q 1:T r, δ, Y 1:T, A, B). Step 3 of the MCMC procedure is simply an implementation of the forward-backward algorithm. In step 1 we update the order of Markov dependence P (δ r, Y 1:T, Q 1:T ) and the transition probability parameters P (A, B r, δ, Y 1:T, Q 1:T ) in the same step. Choosing a conjugate Dirichlet prior distribution for B allows δ to be updated without the need for a reversible jump move but instead to be updated from the conditional distribution of the form, P (δ r, Y 1:T, Q 1:T ) P (δ r, Q 1:T ) P (Y 1:T r, δ, Q 1:T ) = P (δ) P (Y 1:T r, δ, Q 1:T ) (3.20) it is possible to simplify P (δ r, Q 1:T ) as we defined δ to be independent of (r, Q 1:T ) a priori. In step 2 the number of hidden states r is updated using a birth/death reversible jump move. The birth/death move is computationally simpler than the split/merge move. The authors found that the birth/death moves produces the best mixing chain. Birth and Death moves 18

22 The initial move begins with a random choice between creating or deleting a hidden state with probabilities b r and d r, respectively. In the birth move a new hidden state j is proposed which increases the number of hidden states from r to r+1. A set of base transition probabilities u for the new state are generated from the prior distribution (3.17), with P j = u and P j = P j for j j. Then simulate a row vector v for the state transitions Ã from the prior distribution (3.18) and set the row of the proposed transition matrix to be ã j = v. Column j is filled by taking ã ij = w i for i j, where w i Beta( d ij, d j j ij ). Finally, a new hidden state is simulated conditional on Ã, B and r + 1 using the forward-backward algorithm. The move is then accepted with probability min(1, A B ) where, D(ṽ d j ) i S r+1 \j B A B = P (Y 1:T r + 1, δ, Ã, B) P (Y 1:T r, δ, A, B) w i d ij, i S r+1 D(ã i d i ) D(a i d i ) j S r+1 \j d ij d r+1 b r (r + 1) P (r + 1) (r + 1) (3.21) P (r) k S r+1 i Yδ D( pk i ck i ) k S r i Yδ D(pk i ck i ) 1 D(u i c j i ) i Y δ (1 w i ) r 1 i S r+1 \j The first two lines of the above expression contain the likelihood ratio and prior ratio with the remaining lines consisting of the proposal ratio and Jacobian resulting from the transformation of (B, u) B and (A, v, w) Ã. We note that the expression does not depend on Q 1:T and Q 1:T because the expression simplifies as, P (Y 1:T r + 1, δ, Ã, B) P (Y 1:T r, δ, A, B) = P (Y 1:T, Q 1:T r + 1, δ, Ã, B) P (Y 1:T, Q 1:T r, δ, A, B) P (Q 1:T Y 1:T, r, δ, A, B) P ( Q 1:T Y 1:T, r + 1, δ, Ã, B) (3.22) The death moves follow in a similar fashion to the approach given for the birth moves where a randomly chosen hidden state j is proposed to be deleted, after which the remaining parameters are adjusted. Firstly, P j is deleted with the remaining base transition probabilities P j = P j for j j with the row and column j of Ã also being deleted. The death of a hidden state is accepted with probability min(1, A 1 B ) and thus the birth and death moves form a reversible pair. Bacteriophage lambda genome In the paper Boys and Henderson (2004) the authors call upon the example of analysing the genome of the bacteriophage lambda, a parasite of the intestinal bacterium Escherichia coli which is often considered a benchmark example for comparing DNA segmentation techniques. Previous analysis of this genome structure as conducted by Churchill (1989) and others have discovered that the number of hidden states is r 6 and the Markov dependence is δ 1. However, by taking the Markov order of dependence and the number of hidden states as parameters suggests that there are r = 6 (with a 95% highest density interval of (6,7,8)) hidden states with Markov dependence of the order q = 2. This is supported by the fact that the bacteriophage lambda genome is predominately comprised of codons which are the coding regions of DNA that occur as triplets (Y t 2, Y t 1, Y t ). However, it has been conjectured by Lawerence and Auger that some of the hidden states are reverse complements of each other which is an area that the authors are exploring further. 19

23 Chapter 4 Evaluation 4.1 Conclusions The use of hidden Markov models for DNA sequence analysis has been well explored over the past two decades and even longer in other fields of research. While in this report we only considered the applications of HMMs to DNA, much work has also been done to apply these techniques to RNA and protein sequences. Perhaps the best known example of these techniques being used in practice is through in ab intio gene finding where the DNA sequence is scanned for signs protein-coding genes. One of the major drawbacks of the approaches given in this report is that most of the work assumes a first-order Markov dependence for the hidden states which means that the duration time (i.e. time spent in a state) follows a geometric distribution. In practice, the duration times for the hidden states of a DNA sequence do not follow a geometric distribution and so the constraint imposed by the first-order Markov assumption will undoubtedly lead to unreliable results. One solution to this problem, which has been implemented in the GENSCAN algorithm, is the use of hidden semi-markov models, which follow in a similar fashion to the hidden Markov model except that the hidden states are semi-markov rather than Markov. The advantage of this is that the duration times are no longer geometric; but instead the probability of transitioning to a new state depends on the length of time spent in the current state. This means that all states no longer have identical duration times. In terms of DNA sequence analysis, HMMs are not the only statistical approach available for segmenting the sequence. Much work has been done with multiple-changepoint segmentation models, which, instead of using a hidden layer to detect a change in the base transitions, they instead observe the sequence of bases and identify points in the sequence where the distribution of bases changes. Compared to the HMM, multiple-changepoint models are computationally more efficient as the posterior sample can be obtained without the use of MCMC techniques. The development of HMMs over the past four decades has allowed them to be used in various fields with many successful applications. Particularly, in terms of biology, HMMs have been a great success in combining biology and statistics with both fields reaping the benefits of developing new areas of research. The theory behind HMMs has expanded to allow for greater flexibility of models available, including models with higher order Markov dependency and models which do not require that the number of hidden states be pre-specified. There is still the potential for further work with HMMs in terms of improved parameter estimation, unknown Markov dependency and state identifiability, to name a few. There will certainly be further applications to which HMMs will be applied and with those new applications, new challenges will surely develop and improve upon the theory which has already been established. 20

### Hidden Markov Models. and. Sequential Data

Hidden Markov Models and Sequential Data Sequential Data Often arise through measurement of time series Snowfall measurements on successive days in Buffalo Rainfall measurements in Chirrapunji Daily values

### Hidden Markov Models. Hidden Markov Models

Markov Chains Simplest possible information graph structure: just a linear chain X Y Z. Obeys Markov property: joint probability factors into conditional probabilities that only depend on the previous

### Hidden Markov Models for biological systems

Hidden Markov Models for biological systems 1 1 1 1 0 2 2 2 2 N N KN N b o 1 o 2 o 3 o T SS 2005 Heermann - Universität Heidelberg Seite 1 We would like to identify stretches of sequences that are actually

### An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

### Lecture 10: Sequential Data Models

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 10: Sequential Data Models 1 Example: sequential data Until now, considered data to be i.i.d. Turn attention to sequential data Time-series: stock

### Hidden Markov Models

8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies

### Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm

### Hidden Markov Models

Hidden Markov Models Phil Blunsom pcbl@cs.mu.oz.au August 9, 00 Abstract The Hidden Markov Model (HMM) is a popular statistical tool for modelling a wide range of time series data. In the context of natural

### Hidden Markov Models Fundamentals

Hidden Markov Models Fundamentals Daniel Ramage CS229 Section Notes December, 2007 Abstract How can we apply machine learning to data that is represented as a sequence of observations over time? For instance,

### Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

### Structure and Function of DNA

Structure and Function of DNA DNA and RNA Structure DNA and RNA are nucleic acids. They consist of chemical units called nucleotides. The nucleotides are joined by a sugar-phosphate backbone. The four

### Ch. 12: DNA and RNA 12.1 DNA Chromosomes and DNA Replication

Ch. 12: DNA and RNA 12.1 DNA A. To understand genetics, biologists had to learn the chemical makeup of the gene Genes are made of DNA DNA stores and transmits the genetic information from one generation

### 5 = DNA. < I Thymine. ~ Adenine. = Guanine ... = sugar (deoxyribose) = phosphate :>.~>~;:<.. DNA. o, I', . :..:. Key to the 4 Nucleotides...

...... ":: I.. ".0._.. ':.. ".. :.. o -:, I::.. :.:. :..:. : :.. " -...... (Of DNA :... -....'.:;:".. :.U,...... :....1:... :: -........ ".. i'.-;. "0 : :: :...:.. -... -. :>.~>~;:

### Basic Concepts of DNA, Proteins, Genes and Genomes

Basic Concepts of DNA, Proteins, Genes and Genomes Kun-Mao Chao 1,2,3 1 Graduate Institute of Biomedical Electronics and Bioinformatics 2 Department of Computer Science and Information Engineering 3 Graduate

### Markov chains and Markov Random Fields (MRFs)

Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume

### Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA Machine Learning Project

Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA 308-761 Machine Learning Project Kaleigh Smith January 17, 2002 The goal of this paper is to review the theory of Hidden

### Hidden Markov Model. Jia Li. Department of Statistics The Pennsylvania State University. Hidden Markov Model

Hidden Markov Model Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Hidden Markov Model Hidden Markov models have close connection with mixture models. A mixture model

### Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Models Terminology and Basic Algorithms Motivation We make predictions based on models of observed data (machine learning). A simple model is that observations are assumed to be independent

### Biology 3 Transcription, Translation, and Mutations

Biology 3 Transcription, Translation, and Mutations Dr. Terence Lee Overview 1. DNA and RNA structure 2. DNA replication 3. Transcription makes RNA 4. Translation makes protein James Watson, Francis Crick,

### DNA, genes and chromosomes

DNA, genes and chromosomes Learning objectives By the end of this learning material you would have learnt about the components of a DNA and the process of DNA replication, gene types and sequencing and

### MAKING MODELS OF DNA TEACHER NOTES AND HANDY HINTS

MAKING MODELS OF DNA TEACHER NOTES AND HANDY HINTS This resource is designed to provide an insight into the practical model building methods used by James Watson and Francis Crick to discover the structure

### DNA Structure and Replication

Why? DNA Structure and Replication How is genetic information stored and copied? Deoxyribonucleic acid or DNA is the molecule of heredity. It contains the genetic blueprint for life. For organisms to grow

### Life. In nature, we find living things and non living things. Living things can move, reproduce, as opposed to non living things.

Computat onal Biology Lecture 1 Life In nature, we find living things and non living things. Living things can move, reproduce, as opposed to non living things. Both are composed of the same atoms and

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### Fact Sheet 1 AN INTRODUCTION TO DNA, GENES AND CHROMOSOMES

10:23 AM11111 DNA contains the instructions for growth and development in humans and all living things. Our DNA is packaged into chromosomes that contain all of our genes. In summary DNA stands for (DeoxyriboNucleic

### Bayesian Techniques for Parameter Estimation. He has Van Gogh s ear for music, Billy Wilder

Bayesian Techniques for Parameter Estimation He has Van Gogh s ear for music, Billy Wilder Statistical Inference Goal: The goal in statistical inference is to make conclusions about a phenomenon based

### Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office

### ECEN 5682 Theory and Practice of Error Control Codes

ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.

### 12.1 The Role of DNA in Heredity

12.1 The Role of DNA in Heredity Only in the last 50 years have scientists understood the role of DNA in heredity. That understanding began with the discovery of DNA s structure. In 1952, Rosalind Franklin

### INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

### Hierarchical Bayesian Modeling of the HIV Response to Therapy

Hierarchical Bayesian Modeling of the HIV Response to Therapy Shane T. Jensen Department of Statistics, The Wharton School, University of Pennsylvania March 23, 2010 Joint Work with Alex Braunstein and

### Model-based Synthesis. Tony O Hagan

Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that

### DNA, RNA, Protein synthesis, and Mutations. Chapters 12-13.3

DNA, RNA, Protein synthesis, and Mutations Chapters 12-13.3 1A)Identify the components of DNA and explain its role in heredity. DNA s Role in heredity: Contains the genetic information of a cell that can

### DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

DNA Replication & Protein Synthesis This isn t a baaaaaaaddd chapter!!! The Discovery of DNA s Structure Watson and Crick s discovery of DNA s structure was based on almost fifty years of research by other

### Machine Learning I Week 14: Sequence Learning Introduction

Machine Learning I Week 14: Sequence Learning Introduction Alex Graves Technische Universität München 29. January 2009 Literature Pattern Recognition and Machine Learning Chapter 13: Sequential Data Christopher

### Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism )

Biology 1406 Exam 3 Notes Structure of DNA Ch. 10 Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Proteins

### Lecture 6: The Bayesian Approach

Lecture 6: The Bayesian Approach What Did We Do Up to Now? We are given a model Log-linear model, Markov network, Bayesian network, etc. This model induces a distribution P(X) Learning: estimate a set

### 4. Joint Distributions

Virtual Laboratories > 2. Distributions > 1 2 3 4 5 6 7 8 4. Joint Distributions Basic Theory As usual, we start with a random experiment with probability measure P on an underlying sample space. Suppose

### EC 6310: Advanced Econometric Theory

EC 6310: Advanced Econometric Theory July 2008 Slides for Lecture on Bayesian Computation in the Nonlinear Regression Model Gary Koop, University of Strathclyde 1 Summary Readings: Chapter 5 of textbook.

### Objective and Catalyst Students will: Explain the process and importance of DNA replication and base-pairing Model DNA replication using paper clips

Objective and Catalyst Students will: Explain the process and importance of DNA replication and base-pairing Model DNA replication using paper clips In order to explain how a genetic trait is determined

### Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

DNA Based on and adapted from the Genetic Science Learning Center s How to Extract DNA from Any Living Thing (http://learn.genetics.utah.edu/units/activities/extraction/) and BioRad s Genes in a bottle

### Segmentation of 2D Gel Electrophoresis Spots Using a Markov Random Field

Segmentation of 2D Gel Electrophoresis Spots Using a Markov Random Field Christopher S. Hoeflich and Jason J. Corso csh7@cse.buffalo.edu, jcorso@cse.buffalo.edu Computer Science and Engineering University

### DNA Activity Model. For complete technical support call Objectives: Use models to demonstrate complementary base pairing

Neo/SCI Teacher s Guide DNA Activity Model Objectives: Use models to demonstrate complementary base pairing Learn the components of nucleotides. Model the double helix structure of DNA #30-1335 For complete

### Bayesian Methods. 1 The Joint Posterior Distribution

Bayesian Methods Every variable in a linear model is a random variable derived from a distribution function. A fixed factor becomes a random variable with possibly a uniform distribution going from a lower

### DNA - The Double Helix

DNA - The Double Helix Name: Date: Recall that the nucleus is a small spherical, dense body in a cell. It is often called the "control center" because it controls all the activities of the cell including

### BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

### This activity will help you to learn how a gene provides the instructions for making a protein.

NAME: PERIOD: From Gene to Protein Transcription and Translation By Dr. Ingrid Waldron and Jennifer Doherty, Department of Biology, University of Pennsylvania, Copyright, 2009 i This activity will help

### a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.

Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given

### Where is DNA found. In the cell lies the nucleus which house the chromosomes which is composed of short sections called genes. Chromosome.

Where is DA found In the cell lies the nucleus which house the chromosomes which is composed of short sections called genes. Cells ucleus Chromosome For human beings there are 23 pairs of chromosomes which

### DNA (Deoxyribonucleic Acid)

DNA (Deoxyribonucleic Acid) Genetic material of cells GENES units of genetic material that CODES FOR A SPECIFIC TRAIT Called NUCLEIC ACIDS DNA is made up of repeating molecules called NUCLEOTIDES Phosphate

### CSE8393 Introduction to Bioinformatics Lecture 3: More problems, Global Alignment. DNA sequencing

SE8393 Introduction to Bioinformatics Lecture 3: More problems, Global lignment DN sequencing Recall that in biological experiments only relatively short segments of the DN can be investigated. To investigate

### agucacaaacgcu agugcuaguuua uaugcagucuua

RNA Secondary Structure Prediction: The Co-transcriptional effect on RNA folding agucacaaacgcu agugcuaguuua uaugcagucuua By Conrad Godfrey Abstract RNA secondary structure prediction is an area of bioinformatics

### Direct Methods for Solving Linear Systems. Linear Systems of Equations

Direct Methods for Solving Linear Systems Linear Systems of Equations Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University

### II. DNA Deoxyribonucleic Acid Located in the nucleus of the cell Codes for your genes

HEREDITY = passing on of characteristics from parents to offspring How?...DNA! I. DNA, Chromosomes, Chromatin, and Genes DNA = blueprint of life (has the instructions for making an organism) Chromatin=

### Introduction to Algorithmic Trading Strategies Lecture 2

Introduction to Algorithmic Trading Strategies Lecture 2 Hidden Markov Trading Model Haksun Li haksun.li@numericalmethod.com www.numericalmethod.com Outline Carry trade Momentum Valuation CAPM Markov chain

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Genetics. Chapter 9. Chromosome. Genes Three categories. Flow of Genetics/Information The Central Dogma. DNA RNA Protein

Chapter 9 Topics - Genetics - Flow of Genetics/Information - Regulation - Mutation - Recombination gene transfer Genetics Genome - the sum total of genetic information in a organism Genotype - the A's,

### Chapter 4. Simulated Method of Moments and its siblings

Chapter 4. Simulated Method of Moments and its siblings Contents 1 Two motivating examples 1 1.1 Du e and Singleton (1993)......................... 1 1.2 Model for nancial returns with stochastic volatility...........

### Continuous Time Markov Chains

Continuous Time Markov Chains Dan Crytser March 11, 2011 1 Introduction The theory of Markov chains of the discrete form X 1, X 2,... can be adapted to a continuous form X t, t [0, ). This requires use

### 11. Time series and dynamic linear models

11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd

### DNA 1. I. Extracting DNA from Your Cells

DNA 1 I. Extracting DNA from Your Cells Each cell in your body has a nucleus with multiple chromosomes. Each chromosome contains a DNA molecule. Each cell is surrounded by a cell membrane that regulates

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### 3.2 Roulette and Markov Chains

238 CHAPTER 3. DISCRETE DYNAMICAL SYSTEMS WITH MANY VARIABLES 3.2 Roulette and Markov Chains In this section we will be discussing an application of systems of recursion equations called Markov Chains.

### Web Quest: DNA & Protein Synthesis Biology 1

Web Quest: DNA & Protein Synthesis Biology 1 Name: TO ACCESS THE WEBSITES IN THIS WEB QUEST WITHOUT HAVING TO TYPE IN ALL OF THE URLs: 1. Go to alkire.weebly.com 2. Mouse over Biology 1 3. Click on Online

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

### Structure of DNA Remember: genes control certain traits, genes are sections of DNA

tructure of DNA Remember: genes control certain traits, genes are sections of DNA I. tructure of DNA (deoxyribonucleic acid) A. Made of nucleotides 1. nucleotides have 3 main parts a. sugar (deoxyribose)

### 1 What Does DNA Look Like?

CHATER 4 1 What Does DNA Look Like? ECTION Genes and DNA BEFORE YOU READ After you read this section, you should be able to answer these questions: What units make up DNA? What does DNA look like? How

### Hardware Implementation of Probabilistic State Machine for Word Recognition

IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2

### INTRODUCTION TO DNA. DNA, CHROMOSOMES AND GENES How do these terms relate to one another?

INTRODUCTION TO DNA You've probably heard the term a million times. You know that DNA is something inside cells; you probably know that DNA has something to do with who we are and how we get to look the

### 2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables

### OUTCOMES. PROTEIN SYNTHESIS IB Biology Core Topic 3.5 Transcription and Translation OVERVIEW ANIMATION CONTEXT RIBONUCLEIC ACID (RNA)

OUTCOMES PROTEIN SYNTHESIS IB Biology Core Topic 3.5 Transcription and Translation 3.5.1 Compare the structure of RNA and DNA. 3.5.2 Outline DNA transcription in terms of the formation of an RNA strand

### Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Magnus Wiktorsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February

### Probabilistic Methods for Time-Series Analysis

Probabilistic Methods for Time-Series Analysis 2 Contents 1 Analysis of Changepoint Models 1 1.1 Introduction................................ 1 1.1.1 Model and Notation....................... 2 1.1.2 Example:

### Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

### Continuous-time Markov Chains

Continuous-time Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ October 31, 2016

### DNA, RNA AND PROTEIN SYNTHESIS

DNA, RNA AND PROTEIN SYNTHESIS Evolution of Eukaryotic Cells Eukaryotes are larger, more complex cells that contain a nucleus and membrane bound organelles. Oldest eukarytotic fossil is 1800 million years

### Genetics Notes C. Molecular Genetics

Genetics Notes C Molecular Genetics Vocabulary central dogma of molecular biology Chargaff's rules messenger RNA (mrna) ribosomal RNA (rrna) transfer RNA (trna) Your DNA, or deoxyribonucleic acid, contains

### HMM : Viterbi algorithm - a toy example

MM : Viterbi algorithm - a toy example.5.3.4.2 et's consider the following simple MM. This model is composed of 2 states, (high C content) and (low C content). We can for example consider that state characterizes

### 6 Scalar, Stochastic, Discrete Dynamic Systems

47 6 Scalar, Stochastic, Discrete Dynamic Systems Consider modeling a population of sand-hill cranes in year n by the first-order, deterministic recurrence equation y(n + 1) = Ry(n) where R = 1 + r = 1

### 12.1 Identifying the Substance of Genes

12.1 Identifying the Substance of Genes Lesson Objectives Summarize the process of bacterial transformation. Describe the role of bacteriophages in identifying genetic material. Identify the role of DNA

### Markov chains (cont d)

6.867 Machine learning, lecture 19 (Jaaola) 1 Lecture topics: Marov chains (cont d) Hidden Marov Models Marov chains (cont d) In the contet of spectral clustering (last lecture) we discussed a random wal

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations SCENARIO You have responded, as a result of a call from the police to the Coroner s Office, to the scene of the death of

### Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

### Senior Secondary Australian Curriculum

Senior Secondary Australian Curriculum Mathematical Methods Glossary Unit 1 Functions and graphs Asymptote A line is an asymptote to a curve if the distance between the line and the curve approaches zero

### What s the Point? --- Point, Frameshift, Inversion, & Deletion Mutations

What s the Point? --- Point, Frameshift, Inversion, & Deletion Mutations http://members.cox.net/amgough/mutation_chromosome_translocation.gif Introduction: In biology, mutations are changes to the base

### MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788

MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788 ISSN 0340-6253 Three distances for rapid similarity analysis of DNA sequences Wei Chen,

### Monte Carlo Simulation technique. S. B. Santra Department of Physics Indian Institute of Technology Guwahati

Monte Carlo Simulation technique S. B. Santra Department of Physics Indian Institute of Technology Guwahati What is Monte Carlo (MC)? Introduction Monte Carlo method is a common name for a wide variety

### Methods for big data in medical genomics

Methods for big data in medical genomics Parallel Hidden Markov Models in Population Genetics Chris Holmes, (Peter Kecskemethy & Chris Gamble) Department of Statistics and, Nuffield Department of Medicine

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

### From DNA to Protein. Chapter 14

From DNA to Protein Chapter 14 Impacts, Issues: Ricin and your Ribosomes Ricin is toxic because it inactivates ribosomes, the organelles which assemble amino acids into proteins, critical to life processes

### Approximating the Coalescent with Recombination. Niall Cardin Corpus Christi College, University of Oxford April 2, 2007

Approximating the Coalescent with Recombination A Thesis submitted for the Degree of Doctor of Philosophy Niall Cardin Corpus Christi College, University of Oxford April 2, 2007 Approximating the Coalescent

### It took a while for biologists to figure out that genetic information was carried on DNA.

DNA Finally, we want to understand how all of the things we've talked about (genes, alleles, meiosis, etc.) come together at the molecular level. Ultimately, what is an allele? What is a gene? How does