Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006



Similar documents
Bioinformatics Resources at a Glance

Hidden Markov models in gene finding. Bioinformatics research group David R. Cheriton School of Computer Science University of Waterloo

Gene Finding CMSC 423

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

GenBank, Entrez, & FASTA

Gene Models & Bed format: What they represent.

Hidden Markov Models

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Bioinformatics Grid - Enabled Tools For Biologists.

Overview of Eukaryotic Gene Prediction

Protein Synthesis How Genes Become Constituent Molecules

Gene Finding and HMMs

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Tagging with Hidden Markov Models

Transcription and Translation of DNA

Introduction to Genome Annotation

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Biological Sequence Data Formats

Gene Prediction with a Hidden Markov Model

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Concluding lesson. Student manual. What kind of protein are you? (Basic)

Hidden Markov Models with Applications to DNA Sequence Analysis. Christopher Nemeth, STOR-i

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Translation Study Guide

RNA & Protein Synthesis

Basic Concepts of DNA, Proteins, Genes and Genomes

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

RNA Structure and folding

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

Genetic programming with regular expressions

Mass Spectra Alignments and their Significance

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Final Project Report

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

Frequently Asked Questions Next Generation Sequencing

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Discrete Optimization Introduction & applications

Guidelines for Establishment of Contract Areas Computer Science Department

The Basics of Graphical Models

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Lecture 1 MODULE 3 GENE EXPRESSION AND REGULATION OF GENE EXPRESSION. Professor Bharat Patel Office: Science 2, b.patel@griffith.edu.

To be able to describe polypeptide synthesis including transcription and splicing

Conditional Random Fields: An Introduction

agucacaaacgcu agugcuaguuua uaugcagucuua

Statistical Machine Translation: IBM Models 1 and 2

Hidden Markov models in biological sequence analysis

Genomes and SNPs in Malaria and Sickle Cell Anemia

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

Statistics Graduate Courses

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

HMM : Viterbi algorithm - a toy example

Regular Expressions and Automata using Haskell

CSC 2427: Algorithms for Molecular Biology Spring Lecture 16 March 10

Genetic Algorithms and Sudoku

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Problem Set 3 KEY

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Graph theoretic approach to analyze amino acid network

Homologous Gene Finding with a Hidden Markov Model

1 Mutation and Genetic Change

Machine Learning and Statistics: What s the Connection?

Molecular Genetics. RNA, Transcription, & Protein Synthesis

HMM Profiles for Network Traffic Classification

Structure and Function of DNA

13.4 Gene Regulation and Expression

Module 3 Questions. 7. Chemotaxis is an example of signal transduction. Explain, with the use of diagrams.

T cell Epitope Prediction

Introduction to Algorithmic Trading Strategies Lecture 2

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Pairwise Sequence Alignment

Activity recognition in ADL settings. Ben Kröse

Data, Measurements, Features

CCR Biology - Chapter 8 Practice Test - Summer 2012

mrna EDITING Watson et al., BIOLOGIA MOLECOLARE DEL GENE, Zanichelli editore S.p.A. Copyright 2005

Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons

Turing Machines: An Introduction

How To Use Neural Networks In Data Mining

Introduction to Programming with Python. A Useful Reference.

Human Genome Organization: An Update. Genome Organization: An Update

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Lecture Series 7. From DNA to Protein. Genotype to Phenotype. Reading Assignments. A. Genes and the Synthesis of Polypeptides

Symbol Tables. Introduction

Introduction to Learning & Decision Trees

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

AP BIOLOGY 2009 SCORING GUIDELINES

Transcription:

Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006

Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm HMM Advantages HMM Disadvantages Conclusions

Markov Chain Definition: A Markov chain is a triplet (Q, {p(x 1 = s)}, A), where: Q is a finite set of states. Each state corresponds to a symbol in the alphabet p is the initial state probabilities. A is the state transition probabilities, denoted by a st for each s, t Q. For each s, t Q the transition probability is: a st P(x i = t x i-1 = s) Output: The output of the model is the set of states at each instant time => the set of states are observable Property: The probability of each symbol x i depends only on the value of the preceding symbol x i-1 : P (x i x i-1,, x 1 ) = P (x i x i-1 )

Example of a Markov Model

HMM (Hidden Markov Model Definition: An HMM is a 5-tuple (Q, V, p, A, E), where: Q is a finite set of states, Q =N V is a finite set of observation symbols per state, V =M p is the initial state probabilities. A is the state transition probabilities, denoted by a st for each s, t Q. For each s, t Q the transition probability is: a st P(x i = t x i-1 = s) E is a probability emission matrix, e sk P (v k at time t q t = s) Output: Only emitted symbols are observable by the system but not the underlying random walk between states -> hidden

Example of a Hidden Markov Model

The HMMs can be applied efficently to well known biological problems. That why HMMs gained popularity in bioinformatics, and are used for a variety of biological problems like: protein secondary structure recognition multiple sequence alignment gene finding

What HMMs do? A HMM is a statistical model for sequences of discrete simbols. Hmms are used for many years in speech recognition. HMMs are perfect for the gene finding task. Categorizing nucleotids within a genomic sequence can be interpreted as a clasification problem with a set of ordered observations that posses hidden structure, that is a suitable problem for the application of hidden Markov models.

Hidden Markov Models in Bioinformatics The most challenging and interesting problems in computational biology at the moment is finding genes in DNA sequences. With so many genomes being sequenced so rapidly, it remains important to begin by identifying genes computationally.

Gene Finding Gene finding refers to identifying stretches of nucleotide sequences in genomic DNA that are biologically functional. Computational gene finding deals with algorithmically identifying protein-coding genes. Gene finding is not an easy task, as gene structure can be very complex.

Objective: Motivation: To find the coding and non-coding regions of an unlabeled string of DNA nucleotides Assist in the annotation of genomic data produced by genome sequencing methods Gain insight into the mechanisms involved in transcription, splicing and other processes

Structure of a gene

The gene is discontinous, coding both: exons (a region that encodes a sequence of amino acids). introns (non-coding polynucleotide sequences that interrupts the coding sequences, the exons, of a gene).

In gene finding there are some important biological rules: Translation starts with a start codon (ATG). Translation ends with a stop codon (TAG, TGA, TAA). Exon can never follow an exon without an intron in between. Complete genes can never end with an intron.

Gene Finding Models When using HMMs first we have to specify a model. When choosing the model we have to take into consideration their complexity by: The number of states and allowed transitions. How sophisticated the learning methods are. The learning time.

The Model consists of a finite set of states, each of which can emit a symbol from a finite alphabet with a fixed probability distribution over those symbols, and a set of transitions between states, which allow the model to change state after each symbol is emitted. The models can have different complexity, and different built in biological knowledge.

The model for the Viterbi algorithm

states = ('Begin', 'Exon', 'Donor', 'Intron') observations = ('A', 'C', 'G', 'T')

The Model Probabilities Transition probability: transition_probability = { 'Begin' : {'Begin' : 0.0, 'Exon' : 1.0, 'Donor' : 0.0, 'Intron' : 0.0}, 'Exon' : {'Begin' : 0.0, 'Exon' : 0.9, 'Donor' : 0.1, 'Intron' : 0.0}, 'Donor' : {'Begin' : 0.0, 'Exon' : 0.0, 'Donor' : 0.0, 'Intron' : 1.0}, 'Intron' : {'Begin' : 0.0, 'Exon' : 0.0, 'Donor' : 0.0, 'Intron' : 1.0} }

Emission probability: emission_probability = { 'Begin' : {'A' :0.00, 'C' :0.00, 'G' :0.00, 'T' :0.00}, 'Exon' : {'A' :0.25, 'C' :0.25, 'G' :0.25, 'T' :0.25}, 'Donor' : {'A' :0.05, 'C' :0.00, 'G' :0.95, 'T' :0.00}, 'Intron' : {'A' :0.40, 'C' :0.10, 'G' :0.10, 'T' :0.40} }

Viterbi algorithm Dynamic programming algorithm for finding the most likely sequence of hidden states. The Vitebi algorithm finds the most probable path called the Viterbi path.

The main idea of the Viterbi algorithm is to find the most probable path for each intermediate state, until it reaches the end state. At each time only the most likely path leading to each state survives.

The steps of the Viterbi algorithm

The arguments of the Viterbi algorithm viterbi(observations, states, start_probability, transition_probability, emission_probability)

The working of the Viterbi algorithm The algorithm works on the mappings T and U. The algorithm calculates prob, v_path, and v_prob where prob is the total probability of all paths from the start to the current state, v_path is the Viterbi path, and v_prob is the probability of the Viterbi path, and The mapping T holds this information for a given point t in time, and the main loop constructs U, which holds similar information for time t+1.

The algorithm computes the triple (prob, v_path, v_prob) for each possible next state. The total probability of a given next state, total is obtained by adding up the probabilities of all paths reaching that state. More precisely, the algorithm iterates over all possible source states. For each source state, T holds the total probability of all paths to that state. This probability is then multiplied by the emission probability of the current observation and the transition probability from the source state to the next state. The resulting probability prob is then added to total.

For each source state, the probability of the Viterbi path to that state is known. This too is multiplied with the emission and transition probabilities and replaces valmax if it is greater than its current value. The Viterbi path itself is computed as the corresponding argmax of that maximization, by extending the Viterbi path that leads to the current state with the next state. The triple (prob, v_path, v_prob) computed in this fashion is stored in U and once U has been computed for all possible next states, it replaces T, thus ensuring that the loop invariant holds at the end of the iteration.

Example Input DNA sequence: CTTCATGTGAAAGCAGACGTAAGTCA Result: Total: 2.6339193049977711e-17 the sum of all the calculated probabilities

Viterbi Path: ['Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Exon', 'Donor', 'Intron', 'Intron', 'Intron', 'Intron', 'Intron', 'Intron', 'I ntron', 'Intron'] Viterbi probability: 7.0825171238258092e-18

HMM Advantages Statistics HMMs are very powerful modeling tools Statisticians are comfortable with the theory behind hidden Markov models Mathematical / theoretical analysis of the results and processes Modularity HMMs can be combined into larger HMMs

Transparency People can read the model and make sense of it The model itself can help increase understanding Prior Knowledge Incorporate prior knowledge into the architecture Initialize the model close to something believed to be correct Use prior knowledge to constrain training process

HMM Disadvantages State independence States are supposed to be independent, P(y) must be independent of P(x), and vice versa. This usually isn t true Can get around it when relationships are local Not good for RNA folding problems

Over-fitting You re only as good as your training set More training is not always good Local maximums Model may not converge to a truly optimal parameter set for a given training set Speed Almost everything one does in an HMM involves: enumerating all possible paths through the model Still slow in comparison to other methods

Conclusions HMMs have problems where they excel, and problems where they do not You should consider using one if: The problem can be phrased as classification The observations are ordered The observations follow some sort of grammatical structure If an HMM does not fit, there s all sorts of other methods to try: Neural Networks, Decision Trees have all been applied to Bioinformatics

Bibliography Pierre Baldi, Soren Brunak: The machine learning approach http://www1.imim.es/courses/bioinformaticaupf/ttreballs/ programming/donorsitemodel/index.html http://en.wikipedia.org/wiki/viterbi_algorithm

Thank you.