Hidden Markov Models with Applications to DNA Sequence Analysis. Christopher Nemeth, STOR-i
|
|
|
- Lindsey Daniels
- 9 years ago
- Views:
Transcription
1 Hidden Markov Models with Applications to DNA Sequence Analysis Christopher Nemeth, STOR-i May 4, 2011
2 Contents 1 Introduction 1 2 Hidden Markov Models Introduction Determining the observation sequence Brute Force Approach Forward-Backward Algorithm Determining the state sequence Viterbi Algorithm Parameter Estimation Baum-Welch Algorithm HMM Applied to DNA Sequence Analysis Introduction to DNA CpG Islands Modelling a DNA sequence with a known number of hidden states DNA sequence analysis with an unknown number of hidden states Evaluation Conclusions
3 Abstract The Hidden Markov Model (HMM) is a model with a finite number of states, each associated with a probability distribution. The transitions between states cannot be directly measured (hidden), but in a particular state an observation can be generated. It is the observations and not the states themselves which are visible to an outside observer. However, by applying a series of statistical techniques it is possible to gain insight into the hidden states via the observations they generate. In the case of DNA analysis, we observe a strand of DNA which we believe can be segmented into homogeneous regions to identify the specific functions of the DNA strand. Through the use of HMMs we can determine which parts of the strand belong to which segments by matching segments to hidden states. HMMs have been applied to a wide range of applications including speech recognition, signal processing and econometrics to name a few. Here we will discuss the theory behind HMMs and the applications that they have in DNA sequence analysis. We shall be specifically interested in discussing how can we determine from which state our observations are being generated?, how do we determine the parameters of our model? and how do we determine the sequence of hidden states given our observations? While there are many applications of HMMs we shall only be concerned with their use in terms of DNA sequence analysis and shall cover examples in this literature where HMMs have been successfully used. In this area we shall compare approaches to using HMMs for DNA segmentation. Firstly, the approach taken when the number of hidden states is known and secondly, how it is possible to segment a DNA sequence when the number of hidden states is unknown?
4 Chapter 1 Introduction Since the discovery of DNA by Crick and Watson in 1953 scientists have endeavored to better understand the basic building blocks of life. By identifying patterns in the DNA structure it is possible not only to categorise different species, but on a more detailed level, it is possible to discover the more subtle characteristics such as gender, eye colour, predisposition to disease, etc. It is possible for scientists to gain a better understanding of DNA through segmenting long DNA strands into smaller homogeneous regions which are different in composition from the rest of the sequence. Identifying these homogeneous regions may prove useful to scientists who wish to understand the functional importance of the DNA sequence. There are various statistical techniques available to assist in the segmentation effort which are covered in Braun and Muller (1998). However, here we shall only focus on the use of Hidden Markov models (HMM) as an approach to DNA segmentation. Hidden Markov models were first discussed by Baum and Petrie (1966) and since then, have been applied in a variety of fields such as speech recognition, hand-writing identification, signal processing, bioinformatics, climatology, econometrics, etc. (see Cappe (2001) for a detailed list of applications). HMMs offer a way to model the latent structure of temporally dependent data where we assume that the observed process evolves independently given an unobserved Markov chain. There are a discrete finite number of states in the Markov chain which switch between one another according to a small probability. Given that these states are unobserved and random in occurrence they form a hidden Markov chain. It is possible to model the sequence of state changes that occur in the hidden Markov chain via observations which are dependent on the hidden states. Since the 1980 s and early 1990 s HMMs have been applied to DNA sequence analysis with the seminal paper by Churchill (1989) that first applied HMM to DNA segmentation. Since then hidden Markov models have been widely used in the field of DNA sequence analysis with many papers evolving and updating the original idea laid out by Churchill. Aside from the papers written in this area the practical use of these techniques has found its way into gene finding software such as FGENESH+, GENSCAN and SLAM, which can be used to predict the location of genes in a genetic sequence. In this report we shall firstly outline the theory behind HMMs, covering areas such as parameter estimation, identification of hidden states and determining the sequence of hidden states. We shall then develop this theory into several motivating examples in the field of DNA sequence analysis. Specifically, the theory behind HMMs is applied to practical examples in DNA sequence analysis, and how to model the hidden states on the occasions when the number of hidden states is known and when the number of hidden states is unknown. We shall conclude the report by discussing extensions which can be made to the HMM. 1
5 Chapter 2 Hidden Markov Models 2.1 Introduction A Markov chain represents a sequence of random variables q 1, q 2,..., q T where the future state q t+1 is dependent only on the current state q t, (2.1). P (q t+1 q 1, q 2,..., q t ) = P (q t+1 q t ) (2.1) There are a finite number of states which the chain can be in at time t which we define as S 1, S 2,..., S N. Where at any time t, q t = S i, 1 i N. In the case of a hidden Markov model it is not possible to directly observe the state q t at and time t. Instead we observe an extra stochastic process Y t which is dependent on our unobservable state q t. We refer these unobservable states as hidden states where all inferences about the hidden states is determined through the observations Y t as shown in Figure 2.1. Figure 2.1: Hidden Markov model with observations Y t and hidden states q t. 2
6 Example Imagine we have two coins, one of which is fair and the other biased. If we choose one of the coins at random and toss it, how can we determine whether we are tossing the fair coin or the biased coin, based on the outcome of the coin tosses? One way of modelling this problem is through a hidden Markov model where we treat the coins as being the hidden states q t (i.e. fair, biased) and the tosses of the coin are the observations Y t (i.e. heads, tails). As we can see from Figure 2.1, the observation at time t is dependent on the choice of coin which can be either fair or biased. The coin itself which is used at time t is dependent on the coin used at time t 1 which can either be fair or biased depending on the transition probability between the two states. Formal Definition Throughout this report we shall use the notation from, Rabiner (1989). Hidden Markov models are adapted from simpler Markov models with the extension that the states S i of the HMM cannot be observed directly. In order to determine the state of the HMM we must make inference from the observation of some random process Y t which is dependent on the state at time t. The HMM is characterised by a discrete set of N states, S = {S 1, S 2,..., S N } where the state at time t is denoted as q t (i.e. q t = S i ). Generally speaking the states are connected in such a way that it is possible to move from any state to any other state (e.g. in the case of an ergodic model). The movement between states is defined through a matrix of state transition probabilities A = {a ij } where, a ij = P (q t+1 = S j q t = S i ), for 1 i, j N (2.2) For the special case where any state can reach any other state in a single step a ij > 0 for all i, j. The observations Y t can take M distinct observations (i.e. symbols per state). The observation symbols are the observed output of the system and are sometimes referred to as the discrete alphabet. We define the probability of a given symbol being observed from state j at time t as following a probability distribution, B = {b j (k)}, where, b j (k) = P (Y t = k q t = S j ), 1 j N, 1 k M (2.3) which is sometimes referred to as the emission probability as this is the probability that the state j generates the observation k. Finally, In order to model the beginning of the process we will introduce an initial state distribution π = {π i } where, π i = P (q 1 = S i ), 1 i N (2.4) Now that we have defined our observations and our states we can now model the system. In order to do this we require model parameters N (number of states) and M (number of distinct observations), observation symbols and the three probability measures A, B and π. For completeness we define the complete set of parameters as λ = (A, B, π). For the remainder of this section we aim to cover three issues associated with HMMs, which are as follows: 3
7 Issue 1: Issue 2: Issue 3: How can we calculate P (Y 1:T λ), the probability of observing the sequence Y 1:T = {Y 1, Y 2,..., Y T } for a given model with parameters λ = (A, B, π) Given an observation sequence Y 1:T with model parameters λ, how do we determine the sequence of hidden states based on the observations P (Q 1:T Y 1:T, λ) with Q 1:T = {q 1, q 2,..., q T }? How can we determine the optimal model parameter values for λ = (A, B, π) so as to maximise P (Y 1:T Q 1:T, λ)? 2.2 Determining the observation sequence Brute Force Approach Suppose we wish to calculate the probability of observing a given sequence of observations Y 1:T = {Y 1, Y 2,..., Y T } from a given model. This can be useful as it allows us to test the validity of the model. If there are several candidate models available to choose from then our aim will be chose the model which best explains the observations. In other words the model which maximises P (Y 1:T λ). Solving this problem can be done by enumerating all of the possible state sequences Q 1:T = {q 1, q 2,..., q T } which generate the observations. Then the probability of observing the sequence Y 1:T is, P (Y 1:T Q 1:T, λ) = where we assume independence of observations, which gives, T P (Y t q t, λ) (2.5) t=1 P (Y 1:T Q 1:T, λ) = b q1 (Y 1 )b q2 (Y 2 )... b qt (Y T ) (2.6) The joint posterior probability of Y 1:T and Q 1:T P (Y 1:T, Q 1:T λ) given model parameters λ can be found by multiplying (2.6) by P (Q 1:T λ). P (Y 1:T, Q 1:T λ) = P (Y 1:T Q 1:T, λ) P (Q 1:T λ) (2.7) where the probability of such a state sequence Q 1:T of occurring is, P (Q 1:T λ) = π q1 a q1 q 2 a q2 q 3... a qt 1 q T (2.8) In order to calculate the probability of observing Y we simply sum the joint probability given in (2.7) over all possible state sequences q t P (Y 1:T λ) = all Q P (Y 1:T Q 1:T, λ) P (Q 1:T λ) (2.9) = q 1,q 2,...,q T π q1 b q1 (Y 1 )a q1 q 2 b q2 (Y 2 )... a qt 1 q T b qt (Y T ) (2.10) Calculating P (Y 1:T λ) through direct enumeration of all the states may seem like the simplest approach. However, while this approach may appear to be straight forward the required computation is not. Altogether, if we were to calculate P (Y 1:T λ) in this fashion, we would require 2T N T calculations which is computationally expensive, even for small problems. Therefore given the computational complexity of this approach an alternative approach is required. 4
8 2.2.2 Forward-Backward Algorithm A computationally faster way of determining P (Y 1:T λ) is by using the forward-backward algorithm. The forward-backward algorithm (Baum and Egon (1967) and Baum (1972)) comprises of two parts. Firstly, we compute forwards through the sequence of observations the joint probability of Y 1:t with the state q t = S i at time t, (i.e. P (q t = S i, Y 1:t )). Secondly, we compute backwards the probability of observing the observations Y t+1:t given the state at time t (i.e. P (Y t+1:t q t = S i )). We can then combine the forwards and backwards pass to obtain the probability of a state S i at any given time t from the entire set of observations. P (q t = S i, Y 1:T λ) = P (Y 1, Y 2,..., Y t, q t = S i, λ) P (Y t+1, Y t+2,..., Y T q t = S i, λ) (2.11) The forward-backward algorithm also allows us to define the probability of being in a state S i at time t (q t = S i ) by taking (2.11) and P (Y 1:T λ) from either the forwards or backwards pass of the algorithm. In the next section we will see how (2.12) can be used for determining the entire state sequence q 1, q 2,..., q T. P (q t = S i Y 1:T, λ) = P (q t = S i, Y 1:T, λ) P (Y 1:T λ) (2.12) Forward Algorithm We define a forward variable α t (i) to be the partial observations for a sequence Y 1:t with the state at time t, q t = S i given model parameters λ. α t (i) = P (Y 1, Y 2,..., Y t, q t = S i λ) (2.13) We can now use our forward variable to enumerate through all the possible states up to time T with the following algorithm. Algorithm 1. Initialisation: α 1 (i) = π i b i (Y 1 ) 1 i N (2.14) 2. Recursion: [ N ] α t+1 (j) = α t (i)a ij b j (Y t+1 ) 1 j N, 1 t T 1 (2.15) i=1 3. Termination: P (Y 1:T λ) = N α T (i) (2.16) i=1 The algorithm starts by initialising α 1 (i) at time t = 1 with the joint probability of the first observation Y 1 with the initial state π i. To determine α t+1 (j) at time t + 1 we enumerate through all the possible state transitions from time t to t + 1. As α t (i) is the joint probability that Y 1, Y 2,..., Y t are observed with state S i at time t, then α t (i)a ij is the probability of the joint event that Y 1, Y 2,..., Y t are observed and that the state S j at time t + 1 is arrived at via state S i at time t. Once we sum over all possible states S i at time t we will then know S j at time t + 1. It is then a case of determining α t+1 (j) by accounting for the observation Y t+1 in state j, i.e. b j (Y t + 1). We compute α t+1 (j) for all states j, 1 j N at time t and then iterate through t = 1, 2,..., T 1 until time T. Our desired calculation P (Y 1:T λ) is calculated by summing through all the of the possible N states which as, 5
9 we can find P (Y 1:T λ) by summing α T (i). α T (i) = P (Y 1, Y 2,..., Y T, q T = S i λ) (2.17) Backward Algorithm In a similar fashion as with the forward algorithm we can now calculate the backward variable β t (i) which is defined as, β t (i) = P (Y t+1, Y t+2,..., Y T q t = S i, λ) (2.18) which are the partial observations recorded for time t + 1 to T, given that the state at time t is S i with model parameters λ. As with the forward case there is a backward algorithm which solves β t (i) inductively. Algorithm 1. Initialisation: β T (i) = 1, 1 i N (2.19) 2. Recursion: N β t (i) = a ij b j (Y t+1 )β t+1 (j), t = T 1, T 2,..., 1, 1 i N (2.20) j=1 3. Termination: N P (Y 1:T λ) = π j b j (Y 1 )β 1 (j) (2.21) j=1 We set β T (i) = 1 for all i as we require the sequence of observations to end a time T but do not specify the final state as it is unknown. We then induct backwards from t + 1 to t through all possible transition states. To do this we must account for all possible transitions between q t+1 = S j and q t = S i at time t, as well as the observation Y t+1 and all of the observations from time T to t + 1 (β t+1 (j)). Both the forward and backward algorithms require approximately N 2 T calculations which means that in comparison with the brute force approach (which requires 2T N T calculations) the forward-backward algorithm is much faster. 2.3 Determining the state sequence Suppose we wish to know What is the optimal sequence of hidden states? For example, in the coin toss problem we may wish to know which coin (biased or fair) was used at time t and whether the same coin was used at time t + 1. There are several ways of answering this question, one possible approach is to choose the state at each time t which is most likely given the observations. To solve this problem we use (2.11) from the forward-backward algorithm, where P (q t = S i, Y 1:T λ) = α t (i)β t (i). P (q t = S i Y 1:T, λ) = α t(i)β t (i) P (Y 1:T λ) = α t (i)β t (i) N i=1 α t(i)β t (i) (2.22) where α t (i) represents the observations Y 1:t with state S i at time t and β t (i) represents the remaining observations Y t+1:t given state S i at time t. To ensure that P (q t = S i Y 1:T, λ) is a proper probability measure we normalise by P (Y 1:T λ). 6
10 Once we know P (q t = S i Y 1:T, λ) for all states 1 i N we can calculate the most likely state at time t by finding the state i which maximises P (q t = S i Y 1:T, λ). q t = arg max P (q t = S i Y 1:T, λ), 1 t T (2.23) 1 i N While (2.23) allows us to find the most likely state at time t, it is not, however, a realistic approach in doing so. The main disadvantage of this approach is that it does not take into account the state transitions. It may be the case that the optimal state sequence includes states q t 1 = S i and q t = S j when in fact the transition between the two states is not possible (i.e. a ij = 0). This is because (2.23) gives the most likely state at each time t without regard to the state transitions. A logical solution to the problem regarding (2.23) is to change the optimality criterion, and instead of seeking the most likely state at time t we instead find the most likely pairs of states (q t, q t+1 ). However, a more widely used approach is to find the single optimal state sequence Q 1:T, which is the best state sequence path P (Q 1:T, Y 1:T λ). We find this using a dynamic programming algorithm known as the Viterbi Algorithm, which chooses the best state sequence that maximises the likelihood of the state sequence for a given set of observations Viterbi Algorithm Let δ t (i) be the maximum probability of the state sequence with length t that ends in state i (i.e. q t = S i ) which produces the first t observations. δ t (i) = max P (q 1, q 2,..., q t = i, Y 1, Y 2,..., Y t λ) (2.24) q 1,q 2,...,q t 1 The Viterbi algorithm (Viterbi (1967) and Forney (1973)) is similar to the forward-backward algorithm except that here we use maximisation instead of summation at the recursion and termination stages. We store the maximisations δ t (i) in a N by T matrix ψ. Later this matrix is used to retrieve the optimal state sequence path at the backtracking step. 1. Initialisation: 2. Recursion: 3. Termination: δ 1 (i) = π i b i (Y 1 ), 1 i N (2.25) ψ 1 (i) = 0 (2.26) δ t (j) = max 1 i N [δ t 1(i)a ij ]b j (Y t ), 2 t T, 1 j N (2.27) ψ t (j) = arg max[δ t 1 (i)a ij ], 2 t T, 1 j N (2.28) 1 i N 4. Path (state sequence) backtracking: p = max 1 i N [δ T (i)] (2.29) qt = arg max[δ T (i)] (2.30) 1 i N q t = ψ t+1 (q t+1), t = T 1, T 2,..., 1 (2.31) 7
11 The advantage of the Viterbi algorithm is that it does not blindly accept the most likely state at each time t, but in fact takes a decision based on the whole sequence. This is useful if there is an unlikely event at some point in the sequence. This will not effect the rest of the sequence if the remainder is reasonable. This is particularly useful in speech recognition where a phoneme may be garbled or lost, but the overall spoken word is still detectable. One of the problems with the Viterbi algorithm is that multiplying probabilities will yield small numbers that can cause underflow errors in the computer. Therefore it is recommended that the logarithm of the probabilities is taken so as to change the multiplication into a summation. Once the algorithm has terminated, an accurate value can be obtained by taking the exponent of the results. 2.4 Parameter Estimation The third issue which we shall consider is, How do we determine the model parameters λ = (A, B, π)? We wish to select the model parameters such that they maximise the probability of the observation sequence given the hidden states. There is no analytical way to solve this problem, but we can solve it iteratively using the Baum-Welch algorithm (Baum et al. (1970) and Baum (1972)) which is an Expectation-Maximisation algorithm that finds λ = (A, B, π) such that P (Y 1:T λ) is locally maximised Baum-Welch Algorithm The Baum-Welch algorithm calculates the expected number of times each transition (a ij ) and emission (b j (Y t )) is used, from a training sequence. To do this it uses the same forward and backward values as used to determine the state sequence. Firstly, we define the probability of being in state S i at time t and state S j at time t+1 given the model parameters and observation sequence as, P (q t = S i, q t+1 = S j Y 1:T, λ) = α t(i)a ij b j (Y t+1 )β t+1 (j) P (Y 1:T λ) α t (i)a ij b j (Y t+1 )β t+1 (j) = N j=1 α t(i)a ij b j (Y t+1 )β t+1 (j) N i=1 (2.32) (2.33) equation (2.32) is illustrated in Figure 2.2. From the forward-backward algorithm we have already defined P (q t = S i Y 1:T, λ) as the probability of being in state S i at time t given the model parameters and sequence of observations. We notice that this equation relates to (2.32) as follows, P (q t = S i Y 1:T, λ) = N P (q t = S i, q t+1 = S j Y 1:T, λ) (2.34) j=1 If we then sum P (q t = S i Y 1:T, λ) over t we get the expected number of times that state S i is visited. Similarly summing over t for P (q t = S i, q t+1 = S j Y 1:T, λ) is the expected number of transitions from state S i to state S j. Combining the above we are now able to determine a re-estimation of the model parameters λ = (A, B, π). 1. Initial probabilities: π i = P (q t = S i Y 1:T, λ), expected number of times in state S i at time t = 1 (2.35) 8
12 Figure 2.2: Graphical representation of the computation required for the joint event that the system is in state S i at time t and state S j at time t + 1 as given in (2.32). 2. Transition probabilities: 3. Emission probabilities: T 1 t=1 a ij = P (q t = S i, q t+1 = S j Y 1:T, λ) T 1 t=1 P (q t = S i Y 1:T, λ) (2.36) = expected number of transitions from state S i to state S j expected number of transitions from state S i (2.37) b j (k) = = t=1 P (q t = S i Y 1:T, λ) T t=1 P (q t = S i Y 1:T, λ) expected number of times in state j and observing k expected number of times in state j (2.38) (2.39) where t=1 denotes the sum over t such that Y t = k If we start by defining the model parameters λ as (A, B, π) we can then use these to calculate (2.35)-(2.39) to create a re-estimated model λ = (A, B, π). If λ = λ then the initial model λ already defines the critical point of the likelihood function. If, however, λ λ and λ is more likely than λ in the sense that P (Y 1:T λ) > P (Y 1:T λ) then the new model parameters are such that the observation sequence is more likely to have been produced by λ. Once we have found an improved λ through re-estimation we can then repeat this procedure recursively, thus improving the probability of Y 1:T being observed. We can then repeat this procedure until some limiting point which will finally result in a maximum likelihood estimate of the HMM. 9
13 Chapter 3 HMM Applied to DNA Sequence Analysis 3.1 Introduction to DNA Deoxyribonucleic acid (DNA) is the genetic material of a cell which is a code containing instructions for the make-up of human beings and other organisms. The DNA code is made up of four chemical bases: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). The sequence of these bases determines information necessary to build and maintain an organism, similar to the way in which the arrangement of letters determine a word. DNA bases are paired together as (A-T) and (C-G) to form base pairs which are attached to a sugar-phosphate backbone (deoxyribose). The combination of a base, sugar and phosphate is called a nucleotide, which is arranged in two long strands that form a twisted spiral famously known as the double helix (Figure 3.1). Figure 3.1: Strand of DNA in the form of a double helix where the base pairs are separated by a phosphate backbone (p) Since the 1960s it has been known that the pattern in which the four bases occur in a DNA sequence is not random. Early research into the composition of DNA relied on indirect methods such as base composition determination or the analysis of nearest neighbour frequencies. It was only when Elton (1974) noticed that models which assumed a homogeneous DNA structure where inappropriate when modelling the compositional heterogeneity of DNA and thus it was proposed that DNA should be viewed as a sequence of segments, where each segment follows its own distribution of bases. The seminal paper by Churchill (1989) was the first to apply HMMs to DNA sequence analysis where a heterogeneous strand of DNA was assumed to comprise of homogeneous segments. Using the hidden states of the hidden Markov model it was possible to detect the underlying process of the individual segments and categorise the entire sequence in terms of shorter segments. 3.2 CpG Islands To illustrate the use of hidden Markov models in DNA sequence analysis we will consider an example given by Dubin et al. (1998). 10
14 In the human genome wherever the dinucleotide CG (sequence of two base pairs) occurs where a cytosine nucleotide is found next to a guanine nucleotide in a linear sequence of bases along its length (Figure 3.1). We use the notation CpG (-C-phosphate-G-) to separate the dinucleotide CG from the base pair C-G. Typically wherever the CG dinucleotide occurs the C nucleotide is modified by the process of methylation where the cytosine nucleotide is converted into methyl-c before mutating into T, thus creating the dinucleotide TG. The consequence of this is that the CpG dinucleotides are rarer in the genome than would be expected. For biological reasons the methylation process is suppressed in short stretches of the genome, such as around the start regions of genes. In these regions we see more CpG dinucleotides than elsewhere in the gene sequence. These regions are referred to as CpG Islands (Bird, 1987) and are usually anywhere from a few hundred to a few thousand bases long. Using a hidden Markov model we can consider, given a short sequence of DNA, if it is from a CpG island and also how do we find CpG islands in a longer sequence? In terms of our hidden Markov model we can define the genomic sequence as being a sequence of bases which are either within the CpG island or are not. This then gives us our two hidden states {CpG island, Non-CpG island} which we wish to uncover by observing the sequence of bases. As all four bases can occur in both the CpG island and non-cpg island regions, we first must define a sensible notation to differentiate between C in a CpG island region and C in a non-cpg island region. For A, C, G, T in a CpG island we have {A +, C +, G +, T + } and for those bases that are not in a CpG island we have {A, C, G, T }. Figure 3.2: Possible transitions between bases in CpG island and non-cpg island regions Figure 3.2 illustrates the possible transitions between bases, where it is possible to transition between all bases in both CpG island states and non-cpg island states. The transitions which occur do so according to two sets of probabilities which specify firstly the state and then the given observed chemical base from the state. Once we have established the observations Y t = {A, C, G, T } and the states S i =(CpG island, Non-CpG island) we are then able to construct a direct acyclic graph (DAG) which we shall use to illustrate the dependent structure of the model. The DAG given in Figure 3.3 shows that observations Y t are dependent on the hidden states q t = S i and that both the states and observations are dependent on probability matrices A and B, respectively. The matrix A = {a ij } represents the transition between the two hidden states P (q t = S j q t 1 = S i ) = a ij and B denotes the observable state probabilities for the 2-hidden states B = (p +, p ). As we have seen from the previous section we can estimate the parameters of A and B using the Baum-Welch algorithm. In the CpG island example the observation probabilities p + and p are given in Table 3.1 and Table 3.2. We notice from the observation probabilities that the transitions from G to C and C to G 11
15 Figure 3.3: DAG of the hidden Markov model with A representing the state transition probabilities and B representing the observation probabilities for a given state + A C G T A C G T Table 3.1: Transitions for CpG island region - A C G T A C G T Table 3.2: Transition probabilities for non-cpg island region in the CpG island region are higher than in the non-cpg region. The difference in observation probabilities between the two regions justifies the use of the hidden Markov model. If the observation probabilities were constant throughout the strand of DNA then the sequence would be homogeneous and we would be able to model the DNA sequence with one set of probabilities for the transitions between bases in the sequence. However, we know that the probability of transition between certain bases is greater in specific regions and so the probability of moving from a G to C is not constant throughout the sequence. Thus, we require an extra stochastic layer which we model through a hidden Markov model. 3.3 Modelling a DNA sequence with a known number of hidden states Once we have established the theory of hidden Markov models and how they apply to DNA analysis we can then develop models with which to analyse the DNA sequence. Here we will use the paper Boys et al. (2000) to illustrate through an example how we can segment a DNA sequence where the number of hidden states is known and each hidden state corresponds to a segment of the DNA sequence. In this paper, the authors analyse the chimpanzee α-fetoprotein gene; this protein is secreted by embryonic liver epithelial cells and is also produced in the yolk sac of mammals. This protein plays an important role in the embryonic development of mammals; in particular unusual levels of the protein found in pregnant mothers is associated with genetic disorders, such as, neural tube defects, spina bifida and Down s syndrome. One approach which can be used to identify the parts of the DNA sequence where a transition between states occurs is by using a multiple-changepoint model, where inferences about the base transition matrices operating within the hidden state are made conditional on estimates 12
16 of their location. Carlin et al. (1992) give a solution to the multiple-changepoint problem for Markov chains within a Bayesian framework. However, the drawback to the approach is that it is difficult to specify informative prior knowledge about the location of the changepoints, so Bayesian analysis of changepoint problems tends to assume relatively uninformative priors for the changepoints. The authors also felt that while changepoint models are appropriate in time series analysis, they are perhaps not appropriate for DNA sequence analysis as they fail to capture the evolution of the DNA structure. Therefore, a more flexible approach to modelling both the DNA sequence and the underlying prior information is to use a hidden Markov model. The main advantage to this is that rather than specifying prior information about the precise location of the changepoint, it is perhaps preferable to specify the length of a segment instead. Previous work initially done by Churchill (1989) used a maximum likelihood approach and an EM algorithm to determine base transitions for a given hidden state. Here the authors adopt a Bayesian approach which incorporates prior information in identifying the hidden states. Inferences are made by simulating from the posterior distribution using the Markov chain Monte Carlo (MCMC) technique of Gibbs sampling. The advantage of this technique is that it allows for prior information to be incorporated and permits the detection of segment types allowing for posterior parameter uncertainty. Model As before, we we take our observations Y t {A, C, G, T } to be the four chemical bases and the states to represent the different segment types q t {S 1, S 2,..., S r }, t = 1, 2,..., T. In this case we assume that the number of different segment types are known of which there are r. We make the assumption that the transition between the four bases follows a first order Markov chain, where P (Y t Y 1, Y 2,..., Y t 1 ) = P (Y t Y t 1 ), however, as we shall consider later, this is not necessarily a valid assumption. By establishing the same dependent structure as given by the DAG in Figure 3.3 we define the base transition matrices for each segment type as B = {P 1, P 2,..., P r }, where the observations follow a multinomial distribution P k = Pij k. Therefore, the base transitions follow, P (Y t = j q t = S k, Y 1, Y 2,..., Y t 1 = i, B) = P (Y t = j q t = S k, Y t 1 = i, B) = P S k ij (3.1) where i, j {A, C, G, T }, k {1, 2,..., r} The hidden states are modelled using a first order Markov process with transition matrix A = a kl as shown in Figure 3.3. The hidden states at each location are unknown, therefore, we must treat these hidden states as unknown parameters in our model. If we assume Y 1 and q 1 follow independent discrete uniform distributions then we can define the likelihood function for the model parameters A and B as follows given the observed DNA sequence Y 1:T and the unobserved segment types Q 1:T. L(A, B Y 1:T, Q 1:T ) = P (Y 1, q 1 A, B) n P (Y t = j q t = S k, Y t 1 = i, B) P (q t = S l q t 1 = S k, A) (3.2) where we define t=2 = (4r) 1 n t=2 P S k ij a kl, i, j {A, C, G, T }, k, l {1, 2,..., r} (3.3) P (q t = S l q 1, q 2,..., q t 1 = S k, A) = P (q t = S l q t 1 = S k, A) = a kl (3.4) 13
17 Prior distributions Prior for base transitions Given the multinomial form of the likelihood we can take the prior distribution to be the Dirichlet distribution as this is the conjugate prior. Therefore, if we take the row of a base transition matrix to be p i = (p ij ), then the prior for p i will be a Dirichlet distribution. P (p i ) 4 j=1 p α ij ij, 0 p ij 1, j = 1, 2, 3, 4 where α = (α ij ) are the parameters of the distribution. 4 p ij = 1 (3.5) j=1 Prior for the segment types As the transition matrices for the hidden states A follow in a similar fashion to the matrices of the base transitions we shall again use a Dirichlet distribution of r dimension for the prior of the rows (3.7), a k = a kj. In general the prior belief for the hidden states is well defined, particularly with regards to segment length. In practice it is difficult to identify short segments and so it is assumed that the transition between hidden states is rare, E(a ii ) is close to 1. p k i = (p ij ) D(c k i ), i = 1, 2, 3, 4, k = 1, 2,..., r (3.6) a k = (a kj ) D(d k ), k = 1, 2,..., r (3.7) Posterior analysis The posterior distribution for the parameters A and B and the hidden states at time t (i.e. q t = S i ) are found using Gibbs sampling with data augmentation. This involves simulating the hidden states conditional on the parameters and then simulating the parameters conditional on the hidden states. This process is then repeated until the parameters converge. Determining the posterior distribution for the parameters P (A, B Y 1:T, Q 1:T ) follows from the model likelihood given by (3.2), which when incorporated with the conjugate Dirichlet distribution given in the previous section produces independent posterior Dirichlet distributions for the rows of the transition matrices given by ( ). where p k i Y 1:T, Q 1:T D(c k i + n k i ), i = 1, 2, 3, 4, k = 1, 2,..., r (3.8) a k i Y 1:T, Q 1:T D(d k + m k ), k = 1, 2,..., r (3.9) n k i = (nk ij ), nk ij = m k = (m kj ), m kj = where I(A) = 1 if A is true and 0 otherwise. T I(Y t 1 = i, Y t = j, q t = S k ) (3.10) t=2 T I(q t 1 = S k, q t = S j ) (3.11) t=2 The second part of the Gibbs sampler involves determining the hidden states which are simulated as P (Q 1:T Y 1:T, A, B). This can simulated sequentially using its univariate updates P (q t Q t, A, B), t = 1, 2,..., T where Q t = 1, 2,..., t 1, t + 1,..., T. 14
18 Results In the α-fetoprotein example the authors compared whether the DNA sequence should be segmented into two or three hidden states. Firstly, they consider the case of two hidden states where the base transitions must follow one of two transition matrices P 1 or P 2 which mark the transitions between the four bases (A, C, G, T). By selecting the number of hidden states a priori the segment lengths are also pre-specified by setting E(a ii ) = 0.99 with SD(a ii ) = 0.01 gives a change between segments approximately every 100 bases. The posterior results for parameters A and B = (P 1, P 2 ) given in Figure 3.4 where the mean length of segment type 1 is around 500 bases and the length of segment type 2 is around 70 bases. The main difference between the two transition probability matrices can be seen in the transitions to bases A and C. Where in P 1 there are more transitions to A and in P 2 there are more transitions to C. The larger variability of the from A and from G rows in P 2 is due to segment type 2 being rich in C and T with few A s and G s. Figure 3.4: Boys et al. (2000), Posterior summary of transition matrices with two hidden states, E(a ii ) = 0.99 and SD(a ii ) = 0.01 The authors then compared the posterior analysis with the results obtained when the number of hidden states is set to three. Figure 3.5 shows the approximate probabilities of being in each of the three states through the DNA sequence. The figure indicates that it is reasonable to assume that the sequence consists of three segments and not two as was first assumed. It is possible to increase the number of segments until the point where the posterior standard deviations of the base transition matrix is sufficiently small. In practice, however, determining the exact number of segments can be tested by using the information criteria. In conclusion, the method of segmenting the DNA sequence using a Bayesian framework can be advantageous if sufficient prior information, such as the length and number of segments is available. However, in practice this is not usually the case, and so we shall expand upon this approach and show how it is possible to segment the DNA sequence when the number of hidden states is unknown by using a reversible jump MCMC approach. 3.4 DNA sequence analysis with an unknown number of hidden states In the last example we considered the case where the number of hidden states was know; this is frequently not the case and the number of hidden states must be determined. Here we will consider how we can calculate the number of hidden states by utilising reversible jump MCMC 15
19 Figure 3.5: Boys et al. (2000), Posterior probability of being in state in one of the three states at time t. (a) P (q t = S 1 Y 1:T, A, B), (b) P (q t = S 2 Y 1:T, A, B) and (c) P (q t = S 3 Y 1:T, A, B) algorithms. The paper Boys and Henderson (2001) uses the reversible jump MCMC approach for the case of DNA sequence segmentation. We shall discuss this paper and the techniques used when the number of hidden states is unknown. We shall also include in this section the paper Boys and Henderson (2004) in which the authors expand on the idea that not only is the number of hidden states unknown but also the order of the Markov dependence which, until now, has been assumed to be first order. Model We use a similar notation to the last example where we take our observations Y t Y = {A, C, G, T } to be the four bases (Adenine, Cytosine, Guanine and Thymine), to simplify notation we can denote the state space as Y = {1, 2,..., b} (for applications to DNA b = 4 letter alphabet). We denote our hidden states as q t = S k, t {1, 2,..., T }, k S = {1, 2,..., r}, representing the different segment types and δ represents the order of the Markov chain conditional on the hidden states. When δ = 0 we have the usual independence assumption, but for δ > 0 we can include the short range dependent structure found in DNA (Churchill, 1992). The HMM can be considered in terms of the observation equations (3.12) and the state equations (3.13). P (Y t Y 1:t 1, Q 1:t ) = P (Y t = j Y t δ,..., Y t 1, q t = S k ) = p k ij (3.12) i Y δ = {1, 2,..., b δ }, j Y, k {1, 2,..., r} where i = I(Y 1:T, t, δ, b) = 1 + δ l=1 (Y t l 1)b l 1 P (q t = S l q t 1 = S k ) = a kl, k, l S r = {1, 2,..., r} (3.13) where A = a kl is the matrix of hidden state transition probabilities and B = {P 1,..., P r } denotes the collection of observable base transitions in the r hidden states with P k = p k ij, r R = {1, 2,..., r max } and δ Q = {0, 1, 2,..., δ max }. While we treat r and δ are unknown it is, however, necessary for the reversible jump algorithm to restrict the unknown number of states and order of dependence as r max and δ max. In this example we consider the case where the 16
20 number of hidden states r is unknown. The DAG in Figure 3.6 denotes the unknown quantities with circles and the known with squares, thus in this case we label our unknown number of states r with a circle. Figure 3.6: DAG of the hidden Markov model with r hidden states It is computationally convenient to model the hidden states Q 1:T as missing data and work with the complete-data likelihood P (Y 1:T, Q 1:T r, δ, A, B). Where for a given, r, the completedata likelihood is simply the product of the observation and state equations (3.14). P (Y 1:T, Q 1:T r, δ, A, B) = P (Y 1:T r, δ, Q 1:T, A, B) P (Q 1:T r, δ, A, B) (3.14) ij a mij ij (3.15) j S r n k ij = m ij = (p k ij) nk i Y δ jiny k S r i S r T t=δ max+1 T t=δ max+1 I(I(Y 1:T, t, δ, b) = i, Y t = j, q t = S k ) I(q t 1 = S i, q t = S j ) where I() denotes the indicator function which equals 1 if true and 0 otherwise. Prior distributions The advantage of using a Bayesian analysis is that it is possible to include a priori uncertainty about the unknown parameters. The aim of this analysis is to make inferences about the unknown number of segments, r, the order of dependence δ, the model transition parameters A, B, and also the sequence of hidden states, Q 1:T. It is possible to quantify the uncertainty of these parameters through the prior distribution (3.16). P (r, δ, A, B) = P (r) P (δ) P (A, B r, δ) = P (r) P (δ) P (A r) P (B r, δ) (3.16) In reversible jump applications we restrict our number of hidden states r and order of dependence δ to be r max and δ max respectively. For the distributions of r and δ we use independent truncated prior distributions where r P o(α r ), r {1, 2,..., r max } and δ P o(α δ ), δ {1, 2,..., δ max } with fixed hyperparameters α r > 0 and α δ > 0. As in the last example we shall again take independent Dirichlet distributions for the priors of the row elements of A and B, where a k = a kj and p i = p ij represent the rows of the matrices A and B, respectively. 17
21 p k i = (p k ij) r, δ D(c k i ), i Y δ, j Y, k S r (3.17) a k = (a kl ) r D(d k ), k, l S r (3.18) where the Dirichlet parameters c and d are chosen to reflect the goal of the analysis. Posterior analysis In Bayesian analysis we can combine information about the model parameters from the data and the prior distribution to obtain the posterior distribution (3.19), which calibrates the uncertainty about the unknown parameters after observing the data. P (r, δ, Q 1:T, A, B Y 1:T ) P (Y 1:T, Q 1:T r, δ, A, B) P (r, δ, A, B) (3.19) In the last example it was possible to determine the posterior distribution using a straightforward MCMC technique with Gibbs sampling. However, in this case the posterior is more complicated as we have now taken our number of hidden states r and the order of Markov dependence δ to be unknown quantities. The extra complexity which this adds means that the MCMC algorithm must now allow the sampler to jump between parameter spaces with different dimensions which correspond to models with different values for r and δ. This can be achieved using reversible jump techniques (Green, 1995), which are a generalisation of the Metropolis- Hastings algorithm. The term reversible jump comes from the fact that the parameter space is explored by a number of move types which all attain detailed balance and some allow jumps between subspaces of different dimensions. The two most popular categories of reversible jump moves are the split/merge and birth/death moves. The basic idea behind the split/merge move is that the hidden state is either split in two or combined with another hidden state according to some probability. Whereas, the birth/death moves, which shall be focused on here, have a random chance of creating or deleting a hidden state according to some probability. MCMC scheme After each iteration of the MCMC algorithm the following steps are performed: 1. Update the order of dependence and transition probability matrices P (δ, A, B r, Y 1:T, Q 1:T ). 2. Update the number of hidden states r and also A, B and Q 1:T conditional on δ. 3. Update the sequence of hidden states Q 1:T using P (Q 1:T r, δ, Y 1:T, A, B). Step 3 of the MCMC procedure is simply an implementation of the forward-backward algorithm. In step 1 we update the order of Markov dependence P (δ r, Y 1:T, Q 1:T ) and the transition probability parameters P (A, B r, δ, Y 1:T, Q 1:T ) in the same step. Choosing a conjugate Dirichlet prior distribution for B allows δ to be updated without the need for a reversible jump move but instead to be updated from the conditional distribution of the form, P (δ r, Y 1:T, Q 1:T ) P (δ r, Q 1:T ) P (Y 1:T r, δ, Q 1:T ) = P (δ) P (Y 1:T r, δ, Q 1:T ) (3.20) it is possible to simplify P (δ r, Q 1:T ) as we defined δ to be independent of (r, Q 1:T ) a priori. In step 2 the number of hidden states r is updated using a birth/death reversible jump move. The birth/death move is computationally simpler than the split/merge move. The authors found that the birth/death moves produces the best mixing chain. Birth and Death moves 18
22 The initial move begins with a random choice between creating or deleting a hidden state with probabilities b r and d r, respectively. In the birth move a new hidden state j is proposed which increases the number of hidden states from r to r+1. A set of base transition probabilities u for the new state are generated from the prior distribution (3.17), with P j = u and P j = P j for j j. Then simulate a row vector v for the state transitions à from the prior distribution (3.18) and set the row of the proposed transition matrix to be ã j = v. Column j is filled by taking ã ij = w i for i j, where w i Beta( d ij, d j j ij ). Finally, a new hidden state is simulated conditional on Ã, B and r + 1 using the forward-backward algorithm. The move is then accepted with probability min(1, A B ) where, D(ṽ d j ) i S r+1 \j B A B = P (Y 1:T r + 1, δ, Ã, B) P (Y 1:T r, δ, A, B) w i d ij, i S r+1 D(ã i d i ) D(a i d i ) j S r+1 \j d ij d r+1 b r (r + 1) P (r + 1) (r + 1) (3.21) P (r) k S r+1 i Yδ D( pk i ck i ) k S r i Yδ D(pk i ck i ) 1 D(u i c j i ) i Y δ (1 w i ) r 1 i S r+1 \j The first two lines of the above expression contain the likelihood ratio and prior ratio with the remaining lines consisting of the proposal ratio and Jacobian resulting from the transformation of (B, u) B and (A, v, w) Ã. We note that the expression does not depend on Q 1:T and Q 1:T because the expression simplifies as, P (Y 1:T r + 1, δ, Ã, B) P (Y 1:T r, δ, A, B) = P (Y 1:T, Q 1:T r + 1, δ, Ã, B) P (Y 1:T, Q 1:T r, δ, A, B) P (Q 1:T Y 1:T, r, δ, A, B) P ( Q 1:T Y 1:T, r + 1, δ, Ã, B) (3.22) The death moves follow in a similar fashion to the approach given for the birth moves where a randomly chosen hidden state j is proposed to be deleted, after which the remaining parameters are adjusted. Firstly, P j is deleted with the remaining base transition probabilities P j = P j for j j with the row and column j of à also being deleted. The death of a hidden state is accepted with probability min(1, A 1 B ) and thus the birth and death moves form a reversible pair. Bacteriophage lambda genome In the paper Boys and Henderson (2004) the authors call upon the example of analysing the genome of the bacteriophage lambda, a parasite of the intestinal bacterium Escherichia coli which is often considered a benchmark example for comparing DNA segmentation techniques. Previous analysis of this genome structure as conducted by Churchill (1989) and others have discovered that the number of hidden states is r 6 and the Markov dependence is δ 1. However, by taking the Markov order of dependence and the number of hidden states as parameters suggests that there are r = 6 (with a 95% highest density interval of (6,7,8)) hidden states with Markov dependence of the order q = 2. This is supported by the fact that the bacteriophage lambda genome is predominately comprised of codons which are the coding regions of DNA that occur as triplets (Y t 2, Y t 1, Y t ). However, it has been conjectured by Lawerence and Auger that some of the hidden states are reverse complements of each other which is an area that the authors are exploring further. 19
23 Chapter 4 Evaluation 4.1 Conclusions The use of hidden Markov models for DNA sequence analysis has been well explored over the past two decades and even longer in other fields of research. While in this report we only considered the applications of HMMs to DNA, much work has also been done to apply these techniques to RNA and protein sequences. Perhaps the best known example of these techniques being used in practice is through in ab intio gene finding where the DNA sequence is scanned for signs protein-coding genes. One of the major drawbacks of the approaches given in this report is that most of the work assumes a first-order Markov dependence for the hidden states which means that the duration time (i.e. time spent in a state) follows a geometric distribution. In practice, the duration times for the hidden states of a DNA sequence do not follow a geometric distribution and so the constraint imposed by the first-order Markov assumption will undoubtedly lead to unreliable results. One solution to this problem, which has been implemented in the GENSCAN algorithm, is the use of hidden semi-markov models, which follow in a similar fashion to the hidden Markov model except that the hidden states are semi-markov rather than Markov. The advantage of this is that the duration times are no longer geometric; but instead the probability of transitioning to a new state depends on the length of time spent in the current state. This means that all states no longer have identical duration times. In terms of DNA sequence analysis, HMMs are not the only statistical approach available for segmenting the sequence. Much work has been done with multiple-changepoint segmentation models, which, instead of using a hidden layer to detect a change in the base transitions, they instead observe the sequence of bases and identify points in the sequence where the distribution of bases changes. Compared to the HMM, multiple-changepoint models are computationally more efficient as the posterior sample can be obtained without the use of MCMC techniques. The development of HMMs over the past four decades has allowed them to be used in various fields with many successful applications. Particularly, in terms of biology, HMMs have been a great success in combining biology and statistics with both fields reaping the benefits of developing new areas of research. The theory behind HMMs has expanded to allow for greater flexibility of models available, including models with higher order Markov dependency and models which do not require that the number of hidden states be pre-specified. There is still the potential for further work with HMMs in terms of improved parameter estimation, unknown Markov dependency and state identifiability, to name a few. There will certainly be further applications to which HMMs will be applied and with those new applications, new challenges will surely develop and improve upon the theory which has already been established. 20
24 Bibliography Baum, L. (1972). An equality and associated maximisation technique in statistical estimation for probablistic functions of markov processes. Inequalities, 3:1 8. Baum, L. and Egon, J. A. (1967). An equality with applications to statistical estimation for probabilistic functions of a markov process and to a model of ecology. Bulletin of the American Mathematical Society, 73(3): Baum, L. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6): Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The annals of mathematical statistics, 41(1): Bird, A. (1987). Cpg islands as gene markers in the vertebrate nucleus. Trends in Genetics, 3: Boys, R. and Henderson, D. (2001). A comparison of reversible jump MCMC algorithms for DNA sequence segmentation using hidden Markov models. Comp. Sci. and Statist, 33: Boys, R. and Henderson, D. (2004). Biometrics, 60(3): A Bayesian approach to DNA sequence segmentation. Boys, R., Henderson, D., and Wilkinson, D. (2000). Detecting homogeneous segments in DNA sequences by using hidden Markov models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 49(2): Braun, J. V. and Muller, H.-G. (1998). Statistical methods for dna sequence segmentation. Statistical Science, 13(2): Cappe, O. (2001). Ten years of hmm. hmmbib.html. Carlin, B., Gelfan, A., and Smith, A. (1992). Hierarchical bayesian analysis of changepoint problems. Journal of the Royal Statistical Society: Series C (Applied Statistics), 41(2): Churchill, G. (1989). Stochastic models for heterogeneous dna sequences. Bulletin of Mathematical Biology, 51: /BF Churchill, G. (1992). Hidden markov chains and the analysis of genome structure. Computers and Chemistry, 16(2): Dubin, R. E. S., Krogh, A., and Mitchison, G. (1998). Biological Sequence Analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge. 21
25 Elton, R. A. (1974). Theoretical models for heterogeneity of base composition in dna. Journal of Theoretical Biology, 45(2): Forney, G. J. (1973). The viterbi algorithm. Proceedings of the IEEE, 61(3): Green, P. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4): Rabiner, L. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77: Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory, IEEE Transactions on, 13(2):
Hidden Markov Models
8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies
Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006
Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm
Markov Chain Monte Carlo Simulation Made Simple
Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical
Course: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 [email protected] Abstract Probability distributions on structured representation.
Structure and Function of DNA
Structure and Function of DNA DNA and RNA Structure DNA and RNA are nucleic acids. They consist of chemical units called nucleotides. The nucleotides are joined by a sugar-phosphate backbone. The four
Basic Concepts of DNA, Proteins, Genes and Genomes
Basic Concepts of DNA, Proteins, Genes and Genomes Kun-Mao Chao 1,2,3 1 Graduate Institute of Biomedical Electronics and Bioinformatics 2 Department of Computer Science and Information Engineering 3 Graduate
agucacaaacgcu agugcuaguuua uaugcagucuua
RNA Secondary Structure Prediction: The Co-transcriptional effect on RNA folding agucacaaacgcu agugcuaguuua uaugcagucuua By Conrad Godfrey Abstract RNA secondary structure prediction is an area of bioinformatics
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
Today you will extract DNA from some of your cells and learn more about DNA. Extracting DNA from Your Cells
DNA Based on and adapted from the Genetic Science Learning Center s How to Extract DNA from Any Living Thing (http://learn.genetics.utah.edu/units/activities/extraction/) and BioRad s Genes in a bottle
Algorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
Model-based Synthesis. Tony O Hagan
Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that
Tagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,
DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!
DNA Replication & Protein Synthesis This isn t a baaaaaaaddd chapter!!! The Discovery of DNA s Structure Watson and Crick s discovery of DNA s structure was based on almost fifty years of research by other
INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)
INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its
2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system
1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables
12.1 The Role of DNA in Heredity
12.1 The Role of DNA in Heredity Only in the last 50 years have scientists understood the role of DNA in heredity. That understanding began with the discovery of DNA s structure. In 1952, Rosalind Franklin
DNA, RNA, Protein synthesis, and Mutations. Chapters 12-13.3
DNA, RNA, Protein synthesis, and Mutations Chapters 12-13.3 1A)Identify the components of DNA and explain its role in heredity. DNA s Role in heredity: Contains the genetic information of a cell that can
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism )
Biology 1406 Exam 3 Notes Structure of DNA Ch. 10 Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Proteins
DNA is found in all organisms from the smallest bacteria to humans. DNA has the same composition and structure in all organisms!
Biological Sciences Initiative HHMI DNA omponents and Structure Introduction Nucleic acids are molecules that are essential to, and characteristic of, life on Earth. There are two basic types of nucleic
Bayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
Handling attrition and non-response in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Conditional Random Fields: An Introduction
Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including
Gaussian Conjugate Prior Cheat Sheet
Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian
An introduction to Value-at-Risk Learning Curve September 2003
An introduction to Value-at-Risk Learning Curve September 2003 Value-at-Risk The introduction of Value-at-Risk (VaR) as an accepted methodology for quantifying market risk is part of the evolution of risk
Worksheet: The theory of natural selection
Worksheet: The theory of natural selection Senior Phase Grade 7-9 Learning area: Natural Science Strand: Life and living Theme: Biodiversity, change and continuity Specific Aim 1: Acquiring knowledge of
11. Time series and dynamic linear models
11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd
3.2 Roulette and Markov Chains
238 CHAPTER 3. DISCRETE DYNAMICAL SYSTEMS WITH MANY VARIABLES 3.2 Roulette and Markov Chains In this section we will be discussing an application of systems of recursion equations called Markov Chains.
Detection of changes in variance using binary segmentation and optimal partitioning
Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the
Bayesian Statistics in One Hour. Patrick Lam
Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
Probabilistic Methods for Time-Series Analysis
Probabilistic Methods for Time-Series Analysis 2 Contents 1 Analysis of Changepoint Models 1 1.1 Introduction................................ 1 1.1.1 Model and Notation....................... 2 1.1.2 Example:
Hardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
HMM : Viterbi algorithm - a toy example
MM : Viterbi algorithm - a toy example.5.3.4.2 et's consider the following simple MM. This model is composed of 2 states, (high C content) and (low C content). We can for example consider that state characterizes
Section 1.3 P 1 = 1 2. = 1 4 2 8. P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., = 1 2 4.
Difference Equations to Differential Equations Section. The Sum of a Sequence This section considers the problem of adding together the terms of a sequence. Of course, this is a problem only if more than
Chapter 11: Molecular Structure of DNA and RNA
Chapter 11: Molecular Structure of DNA and RNA Student Learning Objectives Upon completion of this chapter you should be able to: 1. Understand the major experiments that led to the discovery of DNA as
Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)
Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Magnus Wiktorsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February
Big Ideas in Mathematics
Big Ideas in Mathematics which are important to all mathematics learning. (Adapted from the NCTM Curriculum Focal Points, 2006) The Mathematics Big Ideas are organized using the PA Mathematics Standards
6 Scalar, Stochastic, Discrete Dynamic Systems
47 6 Scalar, Stochastic, Discrete Dynamic Systems Consider modeling a population of sand-hill cranes in year n by the first-order, deterministic recurrence equation y(n + 1) = Ry(n) where R = 1 + r = 1
2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three
Chem 121 Chapter 22. Nucleic Acids 1. Any given nucleotide in a nucleic acid contains A) two bases and a sugar. B) one sugar, two bases and one phosphate. C) two sugars and one phosphate. D) one sugar,
Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations
Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations SCENARIO You have responded, as a result of a call from the police to the Coroner s Office, to the scene of the death of
Introduction to Algorithmic Trading Strategies Lecture 2
Introduction to Algorithmic Trading Strategies Lecture 2 Hidden Markov Trading Model Haksun Li [email protected] www.numericalmethod.com Outline Carry trade Momentum Valuation CAPM Markov chain
1 Solving LPs: The Simplex Algorithm of George Dantzig
Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.
Statistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R
A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R Federico Perea Justo Puerto MaMaEuSch Management Mathematics for European Schools 94342 - CP - 1-2001 - DE - COMENIUS - C21 University
Language Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
Basic Probability Concepts
page 1 Chapter 1 Basic Probability Concepts 1.1 Sample and Event Spaces 1.1.1 Sample Space A probabilistic (or statistical) experiment has the following characteristics: (a) the set of all possible outcomes
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Tutorial on Markov Chain Monte Carlo
Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
The Trip Scheduling Problem
The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems
An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics
Slide 1 An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Dr. Christian Asseburg Centre for Health Economics Part 1 Slide 2 Talk overview Foundations of Bayesian statistics
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
4. Continuous Random Variables, the Pareto and Normal Distributions
4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random
ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE
ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE YUAN TIAN This synopsis is designed merely for keep a record of the materials covered in lectures. Please refer to your own lecture notes for all proofs.
MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788
MATCH Communications in Mathematical and in Computer Chemistry MATCH Commun. Math. Comput. Chem. 61 (2009) 781-788 ISSN 0340-6253 Three distances for rapid similarity analysis of DNA sequences Wei Chen,
Matrix Differentiation
1 Introduction Matrix Differentiation ( and some other stuff ) Randal J. Barnes Department of Civil Engineering, University of Minnesota Minneapolis, Minnesota, USA Throughout this presentation I have
Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in
DNA, RNA, Protein Synthesis Keystone 1. During the process shown above, the two strands of one DNA molecule are unwound. Then, DNA polymerases add complementary nucleotides to each strand which results
Monte Carlo-based statistical methods (MASM11/FMS091)
Monte Carlo-based statistical methods (MASM11/FMS091) Jimmy Olsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February 5, 2013 J. Olsson Monte Carlo-based
1 The Brownian bridge construction
The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof
1 Prior Probability and Posterior Probability
Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which
Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom
Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Be able to explain the difference between the p-value and a posterior
Inference on Phase-type Models via MCMC
Inference on Phase-type Models via MCMC with application to networks of repairable redundant systems Louis JM Aslett and Simon P Wilson Trinity College Dublin 28 th June 202 Toy Example : Redundant Repairable
ECE 842 Report Implementation of Elliptic Curve Cryptography
ECE 842 Report Implementation of Elliptic Curve Cryptography Wei-Yang Lin December 15, 2004 Abstract The aim of this report is to illustrate the issues in implementing a practical elliptic curve cryptographic
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
Lab # 12: DNA and RNA
115 116 Concepts to be explored: Structure of DNA Nucleotides Amino Acids Proteins Genetic Code Mutation RNA Transcription to RNA Translation to a Protein Figure 12. 1: DNA double helix Introduction Long
CURVE FITTING LEAST SQUARES APPROXIMATION
CURVE FITTING LEAST SQUARES APPROXIMATION Data analysis and curve fitting: Imagine that we are studying a physical system involving two quantities: x and y Also suppose that we expect a linear relationship
PRACTICE TEST QUESTIONS
PART A: MULTIPLE CHOICE QUESTIONS PRACTICE TEST QUESTIONS DNA & PROTEIN SYNTHESIS B 1. One of the functions of DNA is to A. secrete vacuoles. B. make copies of itself. C. join amino acids to each other.
Statistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
An Environment Model for N onstationary Reinforcement Learning
An Environment Model for N onstationary Reinforcement Learning Samuel P. M. Choi Dit-Yan Yeung Nevin L. Zhang pmchoi~cs.ust.hk dyyeung~cs.ust.hk lzhang~cs.ust.hk Department of Computer Science, Hong Kong
Chapter 11 Monte Carlo Simulation
Chapter 11 Monte Carlo Simulation 11.1 Introduction The basic idea of simulation is to build an experimental device, or simulator, that will act like (simulate) the system of interest in certain important
1.2 Solving a System of Linear Equations
1.. SOLVING A SYSTEM OF LINEAR EQUATIONS 1. Solving a System of Linear Equations 1..1 Simple Systems - Basic De nitions As noticed above, the general form of a linear system of m equations in n variables
HUMAN PROTEINS FROM GENETIC ENGINEERING OF ORGANISMS
HUMAN PROTEINS FROM GM BACTERIA Injecting insulin is an everyday event for many people with diabetes. GENETIC ENGINEERING OF ORGANISMS involves transferring genes from one species into another. Genetic
Gene Finding and HMMs
6.096 Algorithms for Computational Biology Lecture 7 Gene Finding and HMMs Lecture 1 Lecture 2 Lecture 3 Lecture 4 Lecture 5 Lecture 6 Lecture 7 - Introduction - Hashing and BLAST - Combinatorial Motif
Using simulation to calculate the NPV of a project
Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial
DERIVATIVES AS MATRICES; CHAIN RULE
DERIVATIVES AS MATRICES; CHAIN RULE 1. Derivatives of Real-valued Functions Let s first consider functions f : R 2 R. Recall that if the partial derivatives of f exist at the point (x 0, y 0 ), then we
PS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng [email protected] The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
BayesX - Software for Bayesian Inference in Structured Additive Regression
BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Comp 14112 Fundamentals of Artificial Intelligence Lecture notes, 2015-16 Speech recognition
Comp 14112 Fundamentals of Artificial Intelligence Lecture notes, 2015-16 Speech recognition Tim Morris School of Computer Science, University of Manchester 1 Introduction to speech recognition 1.1 The
Numerical Methods for Option Pricing
Chapter 9 Numerical Methods for Option Pricing Equation (8.26) provides a way to evaluate option prices. For some simple options, such as the European call and put options, one can integrate (8.26) directly
Name Class Date. Figure 13 1. 2. Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.
13 Multiple Choice RNA and Protein Synthesis Chapter Test A Write the letter that best answers the question or completes the statement on the line provided. 1. Which of the following are found in both
Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary
Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:
4.5 Linear Dependence and Linear Independence
4.5 Linear Dependence and Linear Independence 267 32. {v 1, v 2 }, where v 1, v 2 are collinear vectors in R 3. 33. Prove that if S and S are subsets of a vector space V such that S is a subset of S, then
A disaccharide is formed when a dehydration reaction joins two monosaccharides. This covalent bond is called a glycosidic linkage.
CH 5 Structure & Function of Large Molecules: Macromolecules Molecules of Life All living things are made up of four classes of large biological molecules: carbohydrates, lipids, proteins, and nucleic
Replication Study Guide
Replication Study Guide This study guide is a written version of the material you have seen presented in the replication unit. Self-reproduction is a function of life that human-engineered systems have
FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL
FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL STATIsTICs 4 IV. RANDOm VECTORs 1. JOINTLY DIsTRIBUTED RANDOm VARIABLEs If are two rom variables defined on the same sample space we define the joint
SNP Essentials The same SNP story
HOW SNPS HELP RESEARCHERS FIND THE GENETIC CAUSES OF DISEASE SNP Essentials One of the findings of the Human Genome Project is that the DNA of any two people, all 3.1 billion molecules of it, is more than
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
Metric Spaces. Chapter 7. 7.1. Metrics
Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some
Binomial lattice model for stock prices
Copyright c 2007 by Karl Sigman Binomial lattice model for stock prices Here we model the price of a stock in discrete time by a Markov chain of the recursive form S n+ S n Y n+, n 0, where the {Y i }
