Hidden Markov Models with Applications to DNA Sequence Analysis. Christopher Nemeth, STOR-i

Transcription

1 Hidden Markov Models with Applications to DNA Sequence Analysis Christopher Nemeth, STOR-i May 4, 2011

2 Contents 1 Introduction 1 2 Hidden Markov Models Introduction Determining the observation sequence Brute Force Approach Forward-Backward Algorithm Determining the state sequence Viterbi Algorithm Parameter Estimation Baum-Welch Algorithm HMM Applied to DNA Sequence Analysis Introduction to DNA CpG Islands Modelling a DNA sequence with a known number of hidden states DNA sequence analysis with an unknown number of hidden states Evaluation Conclusions

3 Abstract The Hidden Markov Model (HMM) is a model with a finite number of states, each associated with a probability distribution. The transitions between states cannot be directly measured (hidden), but in a particular state an observation can be generated. It is the observations and not the states themselves which are visible to an outside observer. However, by applying a series of statistical techniques it is possible to gain insight into the hidden states via the observations they generate. In the case of DNA analysis, we observe a strand of DNA which we believe can be segmented into homogeneous regions to identify the specific functions of the DNA strand. Through the use of HMMs we can determine which parts of the strand belong to which segments by matching segments to hidden states. HMMs have been applied to a wide range of applications including speech recognition, signal processing and econometrics to name a few. Here we will discuss the theory behind HMMs and the applications that they have in DNA sequence analysis. We shall be specifically interested in discussing how can we determine from which state our observations are being generated?, how do we determine the parameters of our model? and how do we determine the sequence of hidden states given our observations? While there are many applications of HMMs we shall only be concerned with their use in terms of DNA sequence analysis and shall cover examples in this literature where HMMs have been successfully used. In this area we shall compare approaches to using HMMs for DNA segmentation. Firstly, the approach taken when the number of hidden states is known and secondly, how it is possible to segment a DNA sequence when the number of hidden states is unknown?

4 Chapter 1 Introduction Since the discovery of DNA by Crick and Watson in 1953 scientists have endeavored to better understand the basic building blocks of life. By identifying patterns in the DNA structure it is possible not only to categorise different species, but on a more detailed level, it is possible to discover the more subtle characteristics such as gender, eye colour, predisposition to disease, etc. It is possible for scientists to gain a better understanding of DNA through segmenting long DNA strands into smaller homogeneous regions which are different in composition from the rest of the sequence. Identifying these homogeneous regions may prove useful to scientists who wish to understand the functional importance of the DNA sequence. There are various statistical techniques available to assist in the segmentation effort which are covered in Braun and Muller (1998). However, here we shall only focus on the use of Hidden Markov models (HMM) as an approach to DNA segmentation. Hidden Markov models were first discussed by Baum and Petrie (1966) and since then, have been applied in a variety of fields such as speech recognition, hand-writing identification, signal processing, bioinformatics, climatology, econometrics, etc. (see Cappe (2001) for a detailed list of applications). HMMs offer a way to model the latent structure of temporally dependent data where we assume that the observed process evolves independently given an unobserved Markov chain. There are a discrete finite number of states in the Markov chain which switch between one another according to a small probability. Given that these states are unobserved and random in occurrence they form a hidden Markov chain. It is possible to model the sequence of state changes that occur in the hidden Markov chain via observations which are dependent on the hidden states. Since the 1980 s and early 1990 s HMMs have been applied to DNA sequence analysis with the seminal paper by Churchill (1989) that first applied HMM to DNA segmentation. Since then hidden Markov models have been widely used in the field of DNA sequence analysis with many papers evolving and updating the original idea laid out by Churchill. Aside from the papers written in this area the practical use of these techniques has found its way into gene finding software such as FGENESH+, GENSCAN and SLAM, which can be used to predict the location of genes in a genetic sequence. In this report we shall firstly outline the theory behind HMMs, covering areas such as parameter estimation, identification of hidden states and determining the sequence of hidden states. We shall then develop this theory into several motivating examples in the field of DNA sequence analysis. Specifically, the theory behind HMMs is applied to practical examples in DNA sequence analysis, and how to model the hidden states on the occasions when the number of hidden states is known and when the number of hidden states is unknown. We shall conclude the report by discussing extensions which can be made to the HMM. 1

5 Chapter 2 Hidden Markov Models 2.1 Introduction A Markov chain represents a sequence of random variables q 1, q 2,..., q T where the future state q t+1 is dependent only on the current state q t, (2.1). P (q t+1 q 1, q 2,..., q t ) = P (q t+1 q t ) (2.1) There are a finite number of states which the chain can be in at time t which we define as S 1, S 2,..., S N. Where at any time t, q t = S i, 1 i N. In the case of a hidden Markov model it is not possible to directly observe the state q t at and time t. Instead we observe an extra stochastic process Y t which is dependent on our unobservable state q t. We refer these unobservable states as hidden states where all inferences about the hidden states is determined through the observations Y t as shown in Figure 2.1. Figure 2.1: Hidden Markov model with observations Y t and hidden states q t. 2

6 Example Imagine we have two coins, one of which is fair and the other biased. If we choose one of the coins at random and toss it, how can we determine whether we are tossing the fair coin or the biased coin, based on the outcome of the coin tosses? One way of modelling this problem is through a hidden Markov model where we treat the coins as being the hidden states q t (i.e. fair, biased) and the tosses of the coin are the observations Y t (i.e. heads, tails). As we can see from Figure 2.1, the observation at time t is dependent on the choice of coin which can be either fair or biased. The coin itself which is used at time t is dependent on the coin used at time t 1 which can either be fair or biased depending on the transition probability between the two states. Formal Definition Throughout this report we shall use the notation from, Rabiner (1989). Hidden Markov models are adapted from simpler Markov models with the extension that the states S i of the HMM cannot be observed directly. In order to determine the state of the HMM we must make inference from the observation of some random process Y t which is dependent on the state at time t. The HMM is characterised by a discrete set of N states, S = {S 1, S 2,..., S N } where the state at time t is denoted as q t (i.e. q t = S i ). Generally speaking the states are connected in such a way that it is possible to move from any state to any other state (e.g. in the case of an ergodic model). The movement between states is defined through a matrix of state transition probabilities A = {a ij } where, a ij = P (q t+1 = S j q t = S i ), for 1 i, j N (2.2) For the special case where any state can reach any other state in a single step a ij > 0 for all i, j. The observations Y t can take M distinct observations (i.e. symbols per state). The observation symbols are the observed output of the system and are sometimes referred to as the discrete alphabet. We define the probability of a given symbol being observed from state j at time t as following a probability distribution, B = {b j (k)}, where, b j (k) = P (Y t = k q t = S j ), 1 j N, 1 k M (2.3) which is sometimes referred to as the emission probability as this is the probability that the state j generates the observation k. Finally, In order to model the beginning of the process we will introduce an initial state distribution π = {π i } where, π i = P (q 1 = S i ), 1 i N (2.4) Now that we have defined our observations and our states we can now model the system. In order to do this we require model parameters N (number of states) and M (number of distinct observations), observation symbols and the three probability measures A, B and π. For completeness we define the complete set of parameters as λ = (A, B, π). For the remainder of this section we aim to cover three issues associated with HMMs, which are as follows: 3

7 Issue 1: Issue 2: Issue 3: How can we calculate P (Y 1:T λ), the probability of observing the sequence Y 1:T = {Y 1, Y 2,..., Y T } for a given model with parameters λ = (A, B, π) Given an observation sequence Y 1:T with model parameters λ, how do we determine the sequence of hidden states based on the observations P (Q 1:T Y 1:T, λ) with Q 1:T = {q 1, q 2,..., q T }? How can we determine the optimal model parameter values for λ = (A, B, π) so as to maximise P (Y 1:T Q 1:T, λ)? 2.2 Determining the observation sequence Brute Force Approach Suppose we wish to calculate the probability of observing a given sequence of observations Y 1:T = {Y 1, Y 2,..., Y T } from a given model. This can be useful as it allows us to test the validity of the model. If there are several candidate models available to choose from then our aim will be chose the model which best explains the observations. In other words the model which maximises P (Y 1:T λ). Solving this problem can be done by enumerating all of the possible state sequences Q 1:T = {q 1, q 2,..., q T } which generate the observations. Then the probability of observing the sequence Y 1:T is, P (Y 1:T Q 1:T, λ) = where we assume independence of observations, which gives, T P (Y t q t, λ) (2.5) t=1 P (Y 1:T Q 1:T, λ) = b q1 (Y 1 )b q2 (Y 2 )... b qt (Y T ) (2.6) The joint posterior probability of Y 1:T and Q 1:T P (Y 1:T, Q 1:T λ) given model parameters λ can be found by multiplying (2.6) by P (Q 1:T λ). P (Y 1:T, Q 1:T λ) = P (Y 1:T Q 1:T, λ) P (Q 1:T λ) (2.7) where the probability of such a state sequence Q 1:T of occurring is, P (Q 1:T λ) = π q1 a q1 q 2 a q2 q 3... a qt 1 q T (2.8) In order to calculate the probability of observing Y we simply sum the joint probability given in (2.7) over all possible state sequences q t P (Y 1:T λ) = all Q P (Y 1:T Q 1:T, λ) P (Q 1:T λ) (2.9) = q 1,q 2,...,q T π q1 b q1 (Y 1 )a q1 q 2 b q2 (Y 2 )... a qt 1 q T b qt (Y T ) (2.10) Calculating P (Y 1:T λ) through direct enumeration of all the states may seem like the simplest approach. However, while this approach may appear to be straight forward the required computation is not. Altogether, if we were to calculate P (Y 1:T λ) in this fashion, we would require 2T N T calculations which is computationally expensive, even for small problems. Therefore given the computational complexity of this approach an alternative approach is required. 4

8 2.2.2 Forward-Backward Algorithm A computationally faster way of determining P (Y 1:T λ) is by using the forward-backward algorithm. The forward-backward algorithm (Baum and Egon (1967) and Baum (1972)) comprises of two parts. Firstly, we compute forwards through the sequence of observations the joint probability of Y 1:t with the state q t = S i at time t, (i.e. P (q t = S i, Y 1:t )). Secondly, we compute backwards the probability of observing the observations Y t+1:t given the state at time t (i.e. P (Y t+1:t q t = S i )). We can then combine the forwards and backwards pass to obtain the probability of a state S i at any given time t from the entire set of observations. P (q t = S i, Y 1:T λ) = P (Y 1, Y 2,..., Y t, q t = S i, λ) P (Y t+1, Y t+2,..., Y T q t = S i, λ) (2.11) The forward-backward algorithm also allows us to define the probability of being in a state S i at time t (q t = S i ) by taking (2.11) and P (Y 1:T λ) from either the forwards or backwards pass of the algorithm. In the next section we will see how (2.12) can be used for determining the entire state sequence q 1, q 2,..., q T. P (q t = S i Y 1:T, λ) = P (q t = S i, Y 1:T, λ) P (Y 1:T λ) (2.12) Forward Algorithm We define a forward variable α t (i) to be the partial observations for a sequence Y 1:t with the state at time t, q t = S i given model parameters λ. α t (i) = P (Y 1, Y 2,..., Y t, q t = S i λ) (2.13) We can now use our forward variable to enumerate through all the possible states up to time T with the following algorithm. Algorithm 1. Initialisation: α 1 (i) = π i b i (Y 1 ) 1 i N (2.14) 2. Recursion: [ N ] α t+1 (j) = α t (i)a ij b j (Y t+1 ) 1 j N, 1 t T 1 (2.15) i=1 3. Termination: P (Y 1:T λ) = N α T (i) (2.16) i=1 The algorithm starts by initialising α 1 (i) at time t = 1 with the joint probability of the first observation Y 1 with the initial state π i. To determine α t+1 (j) at time t + 1 we enumerate through all the possible state transitions from time t to t + 1. As α t (i) is the joint probability that Y 1, Y 2,..., Y t are observed with state S i at time t, then α t (i)a ij is the probability of the joint event that Y 1, Y 2,..., Y t are observed and that the state S j at time t + 1 is arrived at via state S i at time t. Once we sum over all possible states S i at time t we will then know S j at time t + 1. It is then a case of determining α t+1 (j) by accounting for the observation Y t+1 in state j, i.e. b j (Y t + 1). We compute α t+1 (j) for all states j, 1 j N at time t and then iterate through t = 1, 2,..., T 1 until time T. Our desired calculation P (Y 1:T λ) is calculated by summing through all the of the possible N states which as, 5

9 we can find P (Y 1:T λ) by summing α T (i). α T (i) = P (Y 1, Y 2,..., Y T, q T = S i λ) (2.17) Backward Algorithm In a similar fashion as with the forward algorithm we can now calculate the backward variable β t (i) which is defined as, β t (i) = P (Y t+1, Y t+2,..., Y T q t = S i, λ) (2.18) which are the partial observations recorded for time t + 1 to T, given that the state at time t is S i with model parameters λ. As with the forward case there is a backward algorithm which solves β t (i) inductively. Algorithm 1. Initialisation: β T (i) = 1, 1 i N (2.19) 2. Recursion: N β t (i) = a ij b j (Y t+1 )β t+1 (j), t = T 1, T 2,..., 1, 1 i N (2.20) j=1 3. Termination: N P (Y 1:T λ) = π j b j (Y 1 )β 1 (j) (2.21) j=1 We set β T (i) = 1 for all i as we require the sequence of observations to end a time T but do not specify the final state as it is unknown. We then induct backwards from t + 1 to t through all possible transition states. To do this we must account for all possible transitions between q t+1 = S j and q t = S i at time t, as well as the observation Y t+1 and all of the observations from time T to t + 1 (β t+1 (j)). Both the forward and backward algorithms require approximately N 2 T calculations which means that in comparison with the brute force approach (which requires 2T N T calculations) the forward-backward algorithm is much faster. 2.3 Determining the state sequence Suppose we wish to know What is the optimal sequence of hidden states? For example, in the coin toss problem we may wish to know which coin (biased or fair) was used at time t and whether the same coin was used at time t + 1. There are several ways of answering this question, one possible approach is to choose the state at each time t which is most likely given the observations. To solve this problem we use (2.11) from the forward-backward algorithm, where P (q t = S i, Y 1:T λ) = α t (i)β t (i). P (q t = S i Y 1:T, λ) = α t(i)β t (i) P (Y 1:T λ) = α t (i)β t (i) N i=1 α t(i)β t (i) (2.22) where α t (i) represents the observations Y 1:t with state S i at time t and β t (i) represents the remaining observations Y t+1:t given state S i at time t. To ensure that P (q t = S i Y 1:T, λ) is a proper probability measure we normalise by P (Y 1:T λ). 6

10 Once we know P (q t = S i Y 1:T, λ) for all states 1 i N we can calculate the most likely state at time t by finding the state i which maximises P (q t = S i Y 1:T, λ). q t = arg max P (q t = S i Y 1:T, λ), 1 t T (2.23) 1 i N While (2.23) allows us to find the most likely state at time t, it is not, however, a realistic approach in doing so. The main disadvantage of this approach is that it does not take into account the state transitions. It may be the case that the optimal state sequence includes states q t 1 = S i and q t = S j when in fact the transition between the two states is not possible (i.e. a ij = 0). This is because (2.23) gives the most likely state at each time t without regard to the state transitions. A logical solution to the problem regarding (2.23) is to change the optimality criterion, and instead of seeking the most likely state at time t we instead find the most likely pairs of states (q t, q t+1 ). However, a more widely used approach is to find the single optimal state sequence Q 1:T, which is the best state sequence path P (Q 1:T, Y 1:T λ). We find this using a dynamic programming algorithm known as the Viterbi Algorithm, which chooses the best state sequence that maximises the likelihood of the state sequence for a given set of observations Viterbi Algorithm Let δ t (i) be the maximum probability of the state sequence with length t that ends in state i (i.e. q t = S i ) which produces the first t observations. δ t (i) = max P (q 1, q 2,..., q t = i, Y 1, Y 2,..., Y t λ) (2.24) q 1,q 2,...,q t 1 The Viterbi algorithm (Viterbi (1967) and Forney (1973)) is similar to the forward-backward algorithm except that here we use maximisation instead of summation at the recursion and termination stages. We store the maximisations δ t (i) in a N by T matrix ψ. Later this matrix is used to retrieve the optimal state sequence path at the backtracking step. 1. Initialisation: 2. Recursion: 3. Termination: δ 1 (i) = π i b i (Y 1 ), 1 i N (2.25) ψ 1 (i) = 0 (2.26) δ t (j) = max 1 i N [δ t 1(i)a ij ]b j (Y t ), 2 t T, 1 j N (2.27) ψ t (j) = arg max[δ t 1 (i)a ij ], 2 t T, 1 j N (2.28) 1 i N 4. Path (state sequence) backtracking: p = max 1 i N [δ T (i)] (2.29) qt = arg max[δ T (i)] (2.30) 1 i N q t = ψ t+1 (q t+1), t = T 1, T 2,..., 1 (2.31) 7

11 The advantage of the Viterbi algorithm is that it does not blindly accept the most likely state at each time t, but in fact takes a decision based on the whole sequence. This is useful if there is an unlikely event at some point in the sequence. This will not effect the rest of the sequence if the remainder is reasonable. This is particularly useful in speech recognition where a phoneme may be garbled or lost, but the overall spoken word is still detectable. One of the problems with the Viterbi algorithm is that multiplying probabilities will yield small numbers that can cause underflow errors in the computer. Therefore it is recommended that the logarithm of the probabilities is taken so as to change the multiplication into a summation. Once the algorithm has terminated, an accurate value can be obtained by taking the exponent of the results. 2.4 Parameter Estimation The third issue which we shall consider is, How do we determine the model parameters λ = (A, B, π)? We wish to select the model parameters such that they maximise the probability of the observation sequence given the hidden states. There is no analytical way to solve this problem, but we can solve it iteratively using the Baum-Welch algorithm (Baum et al. (1970) and Baum (1972)) which is an Expectation-Maximisation algorithm that finds λ = (A, B, π) such that P (Y 1:T λ) is locally maximised Baum-Welch Algorithm The Baum-Welch algorithm calculates the expected number of times each transition (a ij ) and emission (b j (Y t )) is used, from a training sequence. To do this it uses the same forward and backward values as used to determine the state sequence. Firstly, we define the probability of being in state S i at time t and state S j at time t+1 given the model parameters and observation sequence as, P (q t = S i, q t+1 = S j Y 1:T, λ) = α t(i)a ij b j (Y t+1 )β t+1 (j) P (Y 1:T λ) α t (i)a ij b j (Y t+1 )β t+1 (j) = N j=1 α t(i)a ij b j (Y t+1 )β t+1 (j) N i=1 (2.32) (2.33) equation (2.32) is illustrated in Figure 2.2. From the forward-backward algorithm we have already defined P (q t = S i Y 1:T, λ) as the probability of being in state S i at time t given the model parameters and sequence of observations. We notice that this equation relates to (2.32) as follows, P (q t = S i Y 1:T, λ) = N P (q t = S i, q t+1 = S j Y 1:T, λ) (2.34) j=1 If we then sum P (q t = S i Y 1:T, λ) over t we get the expected number of times that state S i is visited. Similarly summing over t for P (q t = S i, q t+1 = S j Y 1:T, λ) is the expected number of transitions from state S i to state S j. Combining the above we are now able to determine a re-estimation of the model parameters λ = (A, B, π). 1. Initial probabilities: π i = P (q t = S i Y 1:T, λ), expected number of times in state S i at time t = 1 (2.35) 8

12 Figure 2.2: Graphical representation of the computation required for the joint event that the system is in state S i at time t and state S j at time t + 1 as given in (2.32). 2. Transition probabilities: 3. Emission probabilities: T 1 t=1 a ij = P (q t = S i, q t+1 = S j Y 1:T, λ) T 1 t=1 P (q t = S i Y 1:T, λ) (2.36) = expected number of transitions from state S i to state S j expected number of transitions from state S i (2.37) b j (k) = = t=1 P (q t = S i Y 1:T, λ) T t=1 P (q t = S i Y 1:T, λ) expected number of times in state j and observing k expected number of times in state j (2.38) (2.39) where t=1 denotes the sum over t such that Y t = k If we start by defining the model parameters λ as (A, B, π) we can then use these to calculate (2.35)-(2.39) to create a re-estimated model λ = (A, B, π). If λ = λ then the initial model λ already defines the critical point of the likelihood function. If, however, λ λ and λ is more likely than λ in the sense that P (Y 1:T λ) > P (Y 1:T λ) then the new model parameters are such that the observation sequence is more likely to have been produced by λ. Once we have found an improved λ through re-estimation we can then repeat this procedure recursively, thus improving the probability of Y 1:T being observed. We can then repeat this procedure until some limiting point which will finally result in a maximum likelihood estimate of the HMM. 9

13 Chapter 3 HMM Applied to DNA Sequence Analysis 3.1 Introduction to DNA Deoxyribonucleic acid (DNA) is the genetic material of a cell which is a code containing instructions for the make-up of human beings and other organisms. The DNA code is made up of four chemical bases: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). The sequence of these bases determines information necessary to build and maintain an organism, similar to the way in which the arrangement of letters determine a word. DNA bases are paired together as (A-T) and (C-G) to form base pairs which are attached to a sugar-phosphate backbone (deoxyribose). The combination of a base, sugar and phosphate is called a nucleotide, which is arranged in two long strands that form a twisted spiral famously known as the double helix (Figure 3.1). Figure 3.1: Strand of DNA in the form of a double helix where the base pairs are separated by a phosphate backbone (p) Since the 1960s it has been known that the pattern in which the four bases occur in a DNA sequence is not random. Early research into the composition of DNA relied on indirect methods such as base composition determination or the analysis of nearest neighbour frequencies. It was only when Elton (1974) noticed that models which assumed a homogeneous DNA structure where inappropriate when modelling the compositional heterogeneity of DNA and thus it was proposed that DNA should be viewed as a sequence of segments, where each segment follows its own distribution of bases. The seminal paper by Churchill (1989) was the first to apply HMMs to DNA sequence analysis where a heterogeneous strand of DNA was assumed to comprise of homogeneous segments. Using the hidden states of the hidden Markov model it was possible to detect the underlying process of the individual segments and categorise the entire sequence in terms of shorter segments. 3.2 CpG Islands To illustrate the use of hidden Markov models in DNA sequence analysis we will consider an example given by Dubin et al. (1998). 10

14 In the human genome wherever the dinucleotide CG (sequence of two base pairs) occurs where a cytosine nucleotide is found next to a guanine nucleotide in a linear sequence of bases along its length (Figure 3.1). We use the notation CpG (-C-phosphate-G-) to separate the dinucleotide CG from the base pair C-G. Typically wherever the CG dinucleotide occurs the C nucleotide is modified by the process of methylation where the cytosine nucleotide is converted into methyl-c before mutating into T, thus creating the dinucleotide TG. The consequence of this is that the CpG dinucleotides are rarer in the genome than would be expected. For biological reasons the methylation process is suppressed in short stretches of the genome, such as around the start regions of genes. In these regions we see more CpG dinucleotides than elsewhere in the gene sequence. These regions are referred to as CpG Islands (Bird, 1987) and are usually anywhere from a few hundred to a few thousand bases long. Using a hidden Markov model we can consider, given a short sequence of DNA, if it is from a CpG island and also how do we find CpG islands in a longer sequence? In terms of our hidden Markov model we can define the genomic sequence as being a sequence of bases which are either within the CpG island or are not. This then gives us our two hidden states {CpG island, Non-CpG island} which we wish to uncover by observing the sequence of bases. As all four bases can occur in both the CpG island and non-cpg island regions, we first must define a sensible notation to differentiate between C in a CpG island region and C in a non-cpg island region. For A, C, G, T in a CpG island we have {A +, C +, G +, T + } and for those bases that are not in a CpG island we have {A, C, G, T }. Figure 3.2: Possible transitions between bases in CpG island and non-cpg island regions Figure 3.2 illustrates the possible transitions between bases, where it is possible to transition between all bases in both CpG island states and non-cpg island states. The transitions which occur do so according to two sets of probabilities which specify firstly the state and then the given observed chemical base from the state. Once we have established the observations Y t = {A, C, G, T } and the states S i =(CpG island, Non-CpG island) we are then able to construct a direct acyclic graph (DAG) which we shall use to illustrate the dependent structure of the model. The DAG given in Figure 3.3 shows that observations Y t are dependent on the hidden states q t = S i and that both the states and observations are dependent on probability matrices A and B, respectively. The matrix A = {a ij } represents the transition between the two hidden states P (q t = S j q t 1 = S i ) = a ij and B denotes the observable state probabilities for the 2-hidden states B = (p +, p ). As we have seen from the previous section we can estimate the parameters of A and B using the Baum-Welch algorithm. In the CpG island example the observation probabilities p + and p are given in Table 3.1 and Table 3.2. We notice from the observation probabilities that the transitions from G to C and C to G 11

15 Figure 3.3: DAG of the hidden Markov model with A representing the state transition probabilities and B representing the observation probabilities for a given state + A C G T A C G T Table 3.1: Transitions for CpG island region - A C G T A C G T Table 3.2: Transition probabilities for non-cpg island region in the CpG island region are higher than in the non-cpg region. The difference in observation probabilities between the two regions justifies the use of the hidden Markov model. If the observation probabilities were constant throughout the strand of DNA then the sequence would be homogeneous and we would be able to model the DNA sequence with one set of probabilities for the transitions between bases in the sequence. However, we know that the probability of transition between certain bases is greater in specific regions and so the probability of moving from a G to C is not constant throughout the sequence. Thus, we require an extra stochastic layer which we model through a hidden Markov model. 3.3 Modelling a DNA sequence with a known number of hidden states Once we have established the theory of hidden Markov models and how they apply to DNA analysis we can then develop models with which to analyse the DNA sequence. Here we will use the paper Boys et al. (2000) to illustrate through an example how we can segment a DNA sequence where the number of hidden states is known and each hidden state corresponds to a segment of the DNA sequence. In this paper, the authors analyse the chimpanzee α-fetoprotein gene; this protein is secreted by embryonic liver epithelial cells and is also produced in the yolk sac of mammals. This protein plays an important role in the embryonic development of mammals; in particular unusual levels of the protein found in pregnant mothers is associated with genetic disorders, such as, neural tube defects, spina bifida and Down s syndrome. One approach which can be used to identify the parts of the DNA sequence where a transition between states occurs is by using a multiple-changepoint model, where inferences about the base transition matrices operating within the hidden state are made conditional on estimates 12

16 of their location. Carlin et al. (1992) give a solution to the multiple-changepoint problem for Markov chains within a Bayesian framework. However, the drawback to the approach is that it is difficult to specify informative prior knowledge about the location of the changepoints, so Bayesian analysis of changepoint problems tends to assume relatively uninformative priors for the changepoints. The authors also felt that while changepoint models are appropriate in time series analysis, they are perhaps not appropriate for DNA sequence analysis as they fail to capture the evolution of the DNA structure. Therefore, a more flexible approach to modelling both the DNA sequence and the underlying prior information is to use a hidden Markov model. The main advantage to this is that rather than specifying prior information about the precise location of the changepoint, it is perhaps preferable to specify the length of a segment instead. Previous work initially done by Churchill (1989) used a maximum likelihood approach and an EM algorithm to determine base transitions for a given hidden state. Here the authors adopt a Bayesian approach which incorporates prior information in identifying the hidden states. Inferences are made by simulating from the posterior distribution using the Markov chain Monte Carlo (MCMC) technique of Gibbs sampling. The advantage of this technique is that it allows for prior information to be incorporated and permits the detection of segment types allowing for posterior parameter uncertainty. Model As before, we we take our observations Y t {A, C, G, T } to be the four chemical bases and the states to represent the different segment types q t {S 1, S 2,..., S r }, t = 1, 2,..., T. In this case we assume that the number of different segment types are known of which there are r. We make the assumption that the transition between the four bases follows a first order Markov chain, where P (Y t Y 1, Y 2,..., Y t 1 ) = P (Y t Y t 1 ), however, as we shall consider later, this is not necessarily a valid assumption. By establishing the same dependent structure as given by the DAG in Figure 3.3 we define the base transition matrices for each segment type as B = {P 1, P 2,..., P r }, where the observations follow a multinomial distribution P k = Pij k. Therefore, the base transitions follow, P (Y t = j q t = S k, Y 1, Y 2,..., Y t 1 = i, B) = P (Y t = j q t = S k, Y t 1 = i, B) = P S k ij (3.1) where i, j {A, C, G, T }, k {1, 2,..., r} The hidden states are modelled using a first order Markov process with transition matrix A = a kl as shown in Figure 3.3. The hidden states at each location are unknown, therefore, we must treat these hidden states as unknown parameters in our model. If we assume Y 1 and q 1 follow independent discrete uniform distributions then we can define the likelihood function for the model parameters A and B as follows given the observed DNA sequence Y 1:T and the unobserved segment types Q 1:T. L(A, B Y 1:T, Q 1:T ) = P (Y 1, q 1 A, B) n P (Y t = j q t = S k, Y t 1 = i, B) P (q t = S l q t 1 = S k, A) (3.2) where we define t=2 = (4r) 1 n t=2 P S k ij a kl, i, j {A, C, G, T }, k, l {1, 2,..., r} (3.3) P (q t = S l q 1, q 2,..., q t 1 = S k, A) = P (q t = S l q t 1 = S k, A) = a kl (3.4) 13

17 Prior distributions Prior for base transitions Given the multinomial form of the likelihood we can take the prior distribution to be the Dirichlet distribution as this is the conjugate prior. Therefore, if we take the row of a base transition matrix to be p i = (p ij ), then the prior for p i will be a Dirichlet distribution. P (p i ) 4 j=1 p α ij ij, 0 p ij 1, j = 1, 2, 3, 4 where α = (α ij ) are the parameters of the distribution. 4 p ij = 1 (3.5) j=1 Prior for the segment types As the transition matrices for the hidden states A follow in a similar fashion to the matrices of the base transitions we shall again use a Dirichlet distribution of r dimension for the prior of the rows (3.7), a k = a kj. In general the prior belief for the hidden states is well defined, particularly with regards to segment length. In practice it is difficult to identify short segments and so it is assumed that the transition between hidden states is rare, E(a ii ) is close to 1. p k i = (p ij ) D(c k i ), i = 1, 2, 3, 4, k = 1, 2,..., r (3.6) a k = (a kj ) D(d k ), k = 1, 2,..., r (3.7) Posterior analysis The posterior distribution for the parameters A and B and the hidden states at time t (i.e. q t = S i ) are found using Gibbs sampling with data augmentation. This involves simulating the hidden states conditional on the parameters and then simulating the parameters conditional on the hidden states. This process is then repeated until the parameters converge. Determining the posterior distribution for the parameters P (A, B Y 1:T, Q 1:T ) follows from the model likelihood given by (3.2), which when incorporated with the conjugate Dirichlet distribution given in the previous section produces independent posterior Dirichlet distributions for the rows of the transition matrices given by ( ). where p k i Y 1:T, Q 1:T D(c k i + n k i ), i = 1, 2, 3, 4, k = 1, 2,..., r (3.8) a k i Y 1:T, Q 1:T D(d k + m k ), k = 1, 2,..., r (3.9) n k i = (nk ij ), nk ij = m k = (m kj ), m kj = where I(A) = 1 if A is true and 0 otherwise. T I(Y t 1 = i, Y t = j, q t = S k ) (3.10) t=2 T I(q t 1 = S k, q t = S j ) (3.11) t=2 The second part of the Gibbs sampler involves determining the hidden states which are simulated as P (Q 1:T Y 1:T, A, B). This can simulated sequentially using its univariate updates P (q t Q t, A, B), t = 1, 2,..., T where Q t = 1, 2,..., t 1, t + 1,..., T. 14

18 Results In the α-fetoprotein example the authors compared whether the DNA sequence should be segmented into two or three hidden states. Firstly, they consider the case of two hidden states where the base transitions must follow one of two transition matrices P 1 or P 2 which mark the transitions between the four bases (A, C, G, T). By selecting the number of hidden states a priori the segment lengths are also pre-specified by setting E(a ii ) = 0.99 with SD(a ii ) = 0.01 gives a change between segments approximately every 100 bases. The posterior results for parameters A and B = (P 1, P 2 ) given in Figure 3.4 where the mean length of segment type 1 is around 500 bases and the length of segment type 2 is around 70 bases. The main difference between the two transition probability matrices can be seen in the transitions to bases A and C. Where in P 1 there are more transitions to A and in P 2 there are more transitions to C. The larger variability of the from A and from G rows in P 2 is due to segment type 2 being rich in C and T with few A s and G s. Figure 3.4: Boys et al. (2000), Posterior summary of transition matrices with two hidden states, E(a ii ) = 0.99 and SD(a ii ) = 0.01 The authors then compared the posterior analysis with the results obtained when the number of hidden states is set to three. Figure 3.5 shows the approximate probabilities of being in each of the three states through the DNA sequence. The figure indicates that it is reasonable to assume that the sequence consists of three segments and not two as was first assumed. It is possible to increase the number of segments until the point where the posterior standard deviations of the base transition matrix is sufficiently small. In practice, however, determining the exact number of segments can be tested by using the information criteria. In conclusion, the method of segmenting the DNA sequence using a Bayesian framework can be advantageous if sufficient prior information, such as the length and number of segments is available. However, in practice this is not usually the case, and so we shall expand upon this approach and show how it is possible to segment the DNA sequence when the number of hidden states is unknown by using a reversible jump MCMC approach. 3.4 DNA sequence analysis with an unknown number of hidden states In the last example we considered the case where the number of hidden states was know; this is frequently not the case and the number of hidden states must be determined. Here we will consider how we can calculate the number of hidden states by utilising reversible jump MCMC 15

19 Figure 3.5: Boys et al. (2000), Posterior probability of being in state in one of the three states at time t. (a) P (q t = S 1 Y 1:T, A, B), (b) P (q t = S 2 Y 1:T, A, B) and (c) P (q t = S 3 Y 1:T, A, B) algorithms. The paper Boys and Henderson (2001) uses the reversible jump MCMC approach for the case of DNA sequence segmentation. We shall discuss this paper and the techniques used when the number of hidden states is unknown. We shall also include in this section the paper Boys and Henderson (2004) in which the authors expand on the idea that not only is the number of hidden states unknown but also the order of the Markov dependence which, until now, has been assumed to be first order. Model We use a similar notation to the last example where we take our observations Y t Y = {A, C, G, T } to be the four bases (Adenine, Cytosine, Guanine and Thymine), to simplify notation we can denote the state space as Y = {1, 2,..., b} (for applications to DNA b = 4 letter alphabet). We denote our hidden states as q t = S k, t {1, 2,..., T }, k S = {1, 2,..., r}, representing the different segment types and δ represents the order of the Markov chain conditional on the hidden states. When δ = 0 we have the usual independence assumption, but for δ > 0 we can include the short range dependent structure found in DNA (Churchill, 1992). The HMM can be considered in terms of the observation equations (3.12) and the state equations (3.13). P (Y t Y 1:t 1, Q 1:t ) = P (Y t = j Y t δ,..., Y t 1, q t = S k ) = p k ij (3.12) i Y δ = {1, 2,..., b δ }, j Y, k {1, 2,..., r} where i = I(Y 1:T, t, δ, b) = 1 + δ l=1 (Y t l 1)b l 1 P (q t = S l q t 1 = S k ) = a kl, k, l S r = {1, 2,..., r} (3.13) where A = a kl is the matrix of hidden state transition probabilities and B = {P 1,..., P r } denotes the collection of observable base transitions in the r hidden states with P k = p k ij, r R = {1, 2,..., r max } and δ Q = {0, 1, 2,..., δ max }. While we treat r and δ are unknown it is, however, necessary for the reversible jump algorithm to restrict the unknown number of states and order of dependence as r max and δ max. In this example we consider the case where the 16

20 number of hidden states r is unknown. The DAG in Figure 3.6 denotes the unknown quantities with circles and the known with squares, thus in this case we label our unknown number of states r with a circle. Figure 3.6: DAG of the hidden Markov model with r hidden states It is computationally convenient to model the hidden states Q 1:T as missing data and work with the complete-data likelihood P (Y 1:T, Q 1:T r, δ, A, B). Where for a given, r, the completedata likelihood is simply the product of the observation and state equations (3.14). P (Y 1:T, Q 1:T r, δ, A, B) = P (Y 1:T r, δ, Q 1:T, A, B) P (Q 1:T r, δ, A, B) (3.14) ij a mij ij (3.15) j S r n k ij = m ij = (p k ij) nk i Y δ jiny k S r i S r T t=δ max+1 T t=δ max+1 I(I(Y 1:T, t, δ, b) = i, Y t = j, q t = S k ) I(q t 1 = S i, q t = S j ) where I() denotes the indicator function which equals 1 if true and 0 otherwise. Prior distributions The advantage of using a Bayesian analysis is that it is possible to include a priori uncertainty about the unknown parameters. The aim of this analysis is to make inferences about the unknown number of segments, r, the order of dependence δ, the model transition parameters A, B, and also the sequence of hidden states, Q 1:T. It is possible to quantify the uncertainty of these parameters through the prior distribution (3.16). P (r, δ, A, B) = P (r) P (δ) P (A, B r, δ) = P (r) P (δ) P (A r) P (B r, δ) (3.16) In reversible jump applications we restrict our number of hidden states r and order of dependence δ to be r max and δ max respectively. For the distributions of r and δ we use independent truncated prior distributions where r P o(α r ), r {1, 2,..., r max } and δ P o(α δ ), δ {1, 2,..., δ max } with fixed hyperparameters α r > 0 and α δ > 0. As in the last example we shall again take independent Dirichlet distributions for the priors of the row elements of A and B, where a k = a kj and p i = p ij represent the rows of the matrices A and B, respectively. 17

21 p k i = (p k ij) r, δ D(c k i ), i Y δ, j Y, k S r (3.17) a k = (a kl ) r D(d k ), k, l S r (3.18) where the Dirichlet parameters c and d are chosen to reflect the goal of the analysis. Posterior analysis In Bayesian analysis we can combine information about the model parameters from the data and the prior distribution to obtain the posterior distribution (3.19), which calibrates the uncertainty about the unknown parameters after observing the data. P (r, δ, Q 1:T, A, B Y 1:T ) P (Y 1:T, Q 1:T r, δ, A, B) P (r, δ, A, B) (3.19) In the last example it was possible to determine the posterior distribution using a straightforward MCMC technique with Gibbs sampling. However, in this case the posterior is more complicated as we have now taken our number of hidden states r and the order of Markov dependence δ to be unknown quantities. The extra complexity which this adds means that the MCMC algorithm must now allow the sampler to jump between parameter spaces with different dimensions which correspond to models with different values for r and δ. This can be achieved using reversible jump techniques (Green, 1995), which are a generalisation of the Metropolis- Hastings algorithm. The term reversible jump comes from the fact that the parameter space is explored by a number of move types which all attain detailed balance and some allow jumps between subspaces of different dimensions. The two most popular categories of reversible jump moves are the split/merge and birth/death moves. The basic idea behind the split/merge move is that the hidden state is either split in two or combined with another hidden state according to some probability. Whereas, the birth/death moves, which shall be focused on here, have a random chance of creating or deleting a hidden state according to some probability. MCMC scheme After each iteration of the MCMC algorithm the following steps are performed: 1. Update the order of dependence and transition probability matrices P (δ, A, B r, Y 1:T, Q 1:T ). 2. Update the number of hidden states r and also A, B and Q 1:T conditional on δ. 3. Update the sequence of hidden states Q 1:T using P (Q 1:T r, δ, Y 1:T, A, B). Step 3 of the MCMC procedure is simply an implementation of the forward-backward algorithm. In step 1 we update the order of Markov dependence P (δ r, Y 1:T, Q 1:T ) and the transition probability parameters P (A, B r, δ, Y 1:T, Q 1:T ) in the same step. Choosing a conjugate Dirichlet prior distribution for B allows δ to be updated without the need for a reversible jump move but instead to be updated from the conditional distribution of the form, P (δ r, Y 1:T, Q 1:T ) P (δ r, Q 1:T ) P (Y 1:T r, δ, Q 1:T ) = P (δ) P (Y 1:T r, δ, Q 1:T ) (3.20) it is possible to simplify P (δ r, Q 1:T ) as we defined δ to be independent of (r, Q 1:T ) a priori. In step 2 the number of hidden states r is updated using a birth/death reversible jump move. The birth/death move is computationally simpler than the split/merge move. The authors found that the birth/death moves produces the best mixing chain. Birth and Death moves 18

22 The initial move begins with a random choice between creating or deleting a hidden state with probabilities b r and d r, respectively. In the birth move a new hidden state j is proposed which increases the number of hidden states from r to r+1. A set of base transition probabilities u for the new state are generated from the prior distribution (3.17), with P j = u and P j = P j for j j. Then simulate a row vector v for the state transitions Ã from the prior distribution (3.18) and set the row of the proposed transition matrix to be ã j = v. Column j is filled by taking ã ij = w i for i j, where w i Beta( d ij, d j j ij ). Finally, a new hidden state is simulated conditional on Ã, B and r + 1 using the forward-backward algorithm. The move is then accepted with probability min(1, A B ) where, D(ṽ d j ) i S r+1 \j B A B = P (Y 1:T r + 1, δ, Ã, B) P (Y 1:T r, δ, A, B) w i d ij, i S r+1 D(ã i d i ) D(a i d i ) j S r+1 \j d ij d r+1 b r (r + 1) P (r + 1) (r + 1) (3.21) P (r) k S r+1 i Yδ D( pk i ck i ) k S r i Yδ D(pk i ck i ) 1 D(u i c j i ) i Y δ (1 w i ) r 1 i S r+1 \j The first two lines of the above expression contain the likelihood ratio and prior ratio with the remaining lines consisting of the proposal ratio and Jacobian resulting from the transformation of (B, u) B and (A, v, w) Ã. We note that the expression does not depend on Q 1:T and Q 1:T because the expression simplifies as, P (Y 1:T r + 1, δ, Ã, B) P (Y 1:T r, δ, A, B) = P (Y 1:T, Q 1:T r + 1, δ, Ã, B) P (Y 1:T, Q 1:T r, δ, A, B) P (Q 1:T Y 1:T, r, δ, A, B) P ( Q 1:T Y 1:T, r + 1, δ, Ã, B) (3.22) The death moves follow in a similar fashion to the approach given for the birth moves where a randomly chosen hidden state j is proposed to be deleted, after which the remaining parameters are adjusted. Firstly, P j is deleted with the remaining base transition probabilities P j = P j for j j with the row and column j of Ã also being deleted. The death of a hidden state is accepted with probability min(1, A 1 B ) and thus the birth and death moves form a reversible pair. Bacteriophage lambda genome In the paper Boys and Henderson (2004) the authors call upon the example of analysing the genome of the bacteriophage lambda, a parasite of the intestinal bacterium Escherichia coli which is often considered a benchmark example for comparing DNA segmentation techniques. Previous analysis of this genome structure as conducted by Churchill (1989) and others have discovered that the number of hidden states is r 6 and the Markov dependence is δ 1. However, by taking the Markov order of dependence and the number of hidden states as parameters suggests that there are r = 6 (with a 95% highest density interval of (6,7,8)) hidden states with Markov dependence of the order q = 2. This is supported by the fact that the bacteriophage lambda genome is predominately comprised of codons which are the coding regions of DNA that occur as triplets (Y t 2, Y t 1, Y t ). However, it has been conjectured by Lawerence and Auger that some of the hidden states are reverse complements of each other which is an area that the authors are exploring further. 19

23 Chapter 4 Evaluation 4.1 Conclusions The use of hidden Markov models for DNA sequence analysis has been well explored over the past two decades and even longer in other fields of research. While in this report we only considered the applications of HMMs to DNA, much work has also been done to apply these techniques to RNA and protein sequences. Perhaps the best known example of these techniques being used in practice is through in ab intio gene finding where the DNA sequence is scanned for signs protein-coding genes. One of the major drawbacks of the approaches given in this report is that most of the work assumes a first-order Markov dependence for the hidden states which means that the duration time (i.e. time spent in a state) follows a geometric distribution. In practice, the duration times for the hidden states of a DNA sequence do not follow a geometric distribution and so the constraint imposed by the first-order Markov assumption will undoubtedly lead to unreliable results. One solution to this problem, which has been implemented in the GENSCAN algorithm, is the use of hidden semi-markov models, which follow in a similar fashion to the hidden Markov model except that the hidden states are semi-markov rather than Markov. The advantage of this is that the duration times are no longer geometric; but instead the probability of transitioning to a new state depends on the length of time spent in the current state. This means that all states no longer have identical duration times. In terms of DNA sequence analysis, HMMs are not the only statistical approach available for segmenting the sequence. Much work has been done with multiple-changepoint segmentation models, which, instead of using a hidden layer to detect a change in the base transitions, they instead observe the sequence of bases and identify points in the sequence where the distribution of bases changes. Compared to the HMM, multiple-changepoint models are computationally more efficient as the posterior sample can be obtained without the use of MCMC techniques. The development of HMMs over the past four decades has allowed them to be used in various fields with many successful applications. Particularly, in terms of biology, HMMs have been a great success in combining biology and statistics with both fields reaping the benefits of developing new areas of research. The theory behind HMMs has expanded to allow for greater flexibility of models available, including models with higher order Markov dependency and models which do not require that the number of hidden states be pre-specified. There is still the potential for further work with HMMs in terms of improved parameter estimation, unknown Markov dependency and state identifiability, to name a few. There will certainly be further applications to which HMMs will be applied and with those new applications, new challenges will surely develop and improve upon the theory which has already been established. 20

24 Bibliography Baum, L. (1972). An equality and associated maximisation technique in statistical estimation for probablistic functions of markov processes. Inequalities, 3:1 8. Baum, L. and Egon, J. A. (1967). An equality with applications to statistical estimation for probabilistic functions of a markov process and to a model of ecology. Bulletin of the American Mathematical Society, 73(3): Baum, L. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6): Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The annals of mathematical statistics, 41(1): Bird, A. (1987). Cpg islands as gene markers in the vertebrate nucleus. Trends in Genetics, 3: Boys, R. and Henderson, D. (2001). A comparison of reversible jump MCMC algorithms for DNA sequence segmentation using hidden Markov models. Comp. Sci. and Statist, 33: Boys, R. and Henderson, D. (2004). Biometrics, 60(3): A Bayesian approach to DNA sequence segmentation. Boys, R., Henderson, D., and Wilkinson, D. (2000). Detecting homogeneous segments in DNA sequences by using hidden Markov models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 49(2): Braun, J. V. and Muller, H.-G. (1998). Statistical methods for dna sequence segmentation. Statistical Science, 13(2): Cappe, O. (2001). Ten years of hmm. hmmbib.html. Carlin, B., Gelfan, A., and Smith, A. (1992). Hierarchical bayesian analysis of changepoint problems. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2): Churchill, G. (1989). Stochastic models for heterogeneous dna sequences. Bulletin of Mathematical Biology, 51: /BF Churchill, G. (1992). Hidden markov chains and the analysis of genome structure. Computers and Chemistry, 16(2): Dubin, R. E. S., Krogh, A., and Mitchison, G. (1998). Biological Sequence Analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge. 21

25 Elton, R. A. (1974). Theoretical models for heterogeneity of base composition in dna. Journal of Theoretical Biology, 45(2): Forney, G. J. (1973). The viterbi algorithm. Proceedings of the IEEE, 61(3): Green, P. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4): Rabiner, L. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77: Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory, IEEE Transactions on, 13(2):