Hidden Markov models in biological sequence analysis
|
|
|
- Eileen Miles
- 9 years ago
- Views:
Transcription
1 Hidden Markov models in biological sequence analysis by E. Birney The vast increase of data in biology has meant that many aspects of computational science have been drawn into the field. Two areas of crucial importance are large-scale data management and machine learning. The field between computational science and biology is varyingly described as computational biology or bioinformatics. This paper reviews machine learning techniques based on the use of hidden Markov models (HMMs) for investigating biomolecular sequences. The approach is illustrated with brief descriptions of gene-prediction HMMs and protein family HMMs. Introduction There has been a revolution in molecular biology over the last decade due to a simple economic fact: The price of data gathering has fallen drastically. Nowhere is this better illustrated than in large-scale DNA sequencing. At current costs, it is economical to determine the DNA sequence of the entire genome of a species (the genome is all of the DNA sequence passed from one generation to the next), even for species with large genomes, such as humans. The basic information of interest in bioinformatics pertains to DNA, RNA, and proteins. Molecules of DNA are usually designated by different sequences of the letters A, T, G, and C, representing their four different types of bases. RNA molecules are usually designated by similar sequences, but with the Ts replaced by Us, representing a different type of base. Proteins are represented by 20 letters, corresponding to the 20 amino acids of which they are composed. A one-to-one letter mapping occurs between a DNA molecule and its associated RNA molecule; and a three-to-one letter mapping occurs between the RNA molecule and its associated protein molecule. A protein sequence folds in a defined threedimensional structure, for which, in a small number of cases, the coordinates are known. The defined structure is what actually provides the molecular function of the protein sequence. The basic paradigm of biology is shown graphically in Figure 1. Depicted in the figure is a region of DNA that produces a single RNA molecule, which subsequently produces a single protein having a well-defined biological function. Roughly speaking, the time and cost of determining information increases from the top of the diagram to the bottom. Determining DNA and RNA sequences is relatively cheap; determining protein sequences and protein Copyright 2001 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor /01/$ IBM 449 IBM J. RES. & DEV. VOL. 45 NO. 3/4 MAY/JULY 2001 E. BIRNEY
2 450 DNA Pre-mRNA mrna Protein Structure Figure 1 Gene X Similarity measure Gene Y Function A B?? The basic paradigm of biology (DNA produces RNA, which produces protein) and how this relates to bioinformatics data processing. The two separate panels represent different genes. The top line is genomic DNA: Going from the top to the bottom of the diagram are the primary processes which transform the information in the genomic DNA to the functional aspects of the organism. The exons are shown in blue, and the introns and intergenic DNA appear as thin lines. The start and stop codons are represented as thin vertical lines. The magenta rectangle represents a linear protein sequence, and the red circles its three-dimensional structure. The process of comparing two genes at any one of these levels provides evidence for homology between the two genes. structures is far more expensive; many person-years can be spent trying to elucidate the function of a single protein. A clear goal for bioinformatics is to provide a way to convert the cheaper information at the top to the more valuable information at the bottom. Two steps have proven to be difficult. For unknown reasons, large organisms deliberately process the RNA sequence that is derived from the DNA sequence by a method known as pre-mrna splicing. This removes specific pieces of the RNA (called introns) and fuses the remaining pieces (called exons). The exons remain collinear with their original layout in the DNA sequence. The ratio of exon sequence to intron sequence is around 1:50 in human DNA, and the intron sequence appears to be extremely random in nature, making effective discrimination difficult. Despite this challenge, bioinformatics has developed a reasonably successful solution using HMMs (see below). The second problem is deducing protein structure from a linear protein sequence. This folding problem has resisted concerted attack from researchers over the last twenty years. Although there have been many exciting advances in the area of protein folding, it seems likely that there will not be a solution to this problem in the next five or more years. Bioinformatics can thankfully sidestep both of these problems by using arguments of evolution. Imagine the proto-rodent that represents the common ancestor between mouse and human. This creature had a region of its DNA sequence which made a protein with a specific function (for example, catalyzing the reduction of ethanol to acetaldehyde). At some point there was a speciation event which led eventually to man and mouse. In the two lineages, the DNA sequences were maintained from generation to generation, sometimes suffering a mutation that changed the DNA sequence. As long as the mutation did not disadvantage the individual, in general preserving the function of the protein, the mutation would be passed on to its descendants. In the extant species of man and mouse, one ends up with two similar but not identical regions of DNA sequence which form two similar proteins with similar structures and functions. This argument of common ancestry, or homology, is illustrated pictorially in Figure 1 by the horizontal arrows. Arguments of homology are the bedrock of bioinformatics. It is relatively easy to find a clearly homologous DNA sequence presupposed to exist at the first cellular organism and observable in all living organisms for example, the DNA sequence which produces the proteins found in the ribosome. This conservation in the face of potentially billions of random mutations in the DNA sequence shows how much selection (i.e., an individual with a deleterious mutation is unlikely to pass on this mutation) occurs in biology. Given that two proteins are homologous, one can deduce that, at the very least, portions of the 3D structure of the two proteins are similar, if not some functional aspects of the protein. Since there exist large databases of known proteins with known functions, a considerable amount of bioinformatics pertains to transferring knowledge from known to unknown proteins using arguments of homology. This process is very efficient. Despite the millions of organisms, each containing thousands of genes, researchers have estimated that there are only around 4000 unique protein parts which have been reused by evolution over time (although these 4000 unique protein parts form many more than 4000 molecular functions). The degree of homology is determined by calculating some metric indicating how similar two sequences are. The observed similarity can be due either to homology between the two sequences or simply the by chance score created by matching two unrelated sequences. In more advanced formalisms, a single sequence is scored against a mathematical model of a particular type of conserved sequence region, again using a hidden Markov model, as discussed later. In general, the farther down the E. BIRNEY IBM J. RES. & DEV. VOL. 45 NO. 3/4 MAY/JULY 2001
3 information flow of Figure 1, the better the measure of similarity, because it is easier to deduce that two protein sequences are homologous than two DNA sequences. The end result is that there are two everyday tasks for bioinformatics: deducing the protein sequence from the DNA sequence and comparing protein sequences to an existing database of protein sequences, both of which utilize hidden Markov models. The rest of this paper describes these two methods in broad detail. The reader should be aware that the author deliberately ignores three large areas of probabilistic models that are being used in biology. One is the use of probabilistic models to represent the alignment of two sequences (often proteins). The basic HMM used here is quite rare in other fields but ubiquitous in bioinformatics; some recent papers are a fully Bayesian approach to sequence alignment [1] and a novel accuracy-based a posteriori decoding method [2]. The second area is the use of probabilistic models for evolutionary tree analysis, concerning which there is a long-established research interest; some recent papers include the integration of an HMM with tree methods [3]. The final area is the use of stochastic-context-free grammars (SCFGs) for RNA analysis. An SCFG is to yacc what a hidden Markov model is to lex, and they are ideally suited to RNA analysis, since RNA forms stem-loop structures analogous to the nestedbracket structure found in context-free grammars. Some recent papers discuss the exciting ability to push into more-context-dependent grammars [4]. Readers should also be aware that many of these probabilistic methods have nonprobabilistic parameterized counterparts, in some cases predating the probabilistic method by more than a decade and providing very effective techniques. The author s own prejudice is to view nonprobabilistic parameterized systems as being interpretable as some sort of probabilistic model. Hidden Markov models An HMM is a graph of connected states, each state potentially able to emit a series of observations. The process evolves in some dimension, often time, though not necessarily. The model is parameterized with probabilities governing the state at a time t 1, given that one knows the previous states. Markov assumptions are used to truncate the dependency of having to know the entire history of states up to this point in order to assess the next state: Instead, only one step back is required. As the process evolves in time through the states, each state can potentially emit observations, which are regarded as a stream of observations over time. These models are often illustrated graphically as shown in Figure 2, with the states being circles and transitions as arrows between the states. Given a particular set of parameterized models, two questions can be answered: For a given observed Figure 2 Intergenic DNA Exon Exon 1 2 Gene on forward strand Gene on reverse strand Abbreviated gene HMM model. The HMM is split into two symmetrical parts: genes on the forward or reverse strand of the DNA sequence (DNA sequence can be read in two directions). Each gene model contains a central exon state which has an emission of nucleotides tuned to recognize protein coding regions. Interrupting the exons are introns; three intron states are used, since there are three relative positions at which an intron can interrupt a coding triplet of DNA bases. These introns are distinguished by their phase 0, 1, or 2. sequence, which model is the most likely to explain this data, and for a given sequence and a given model, what is the most likely reconstruction of the path through the states. In addition, models can be learned from the data, in which parameters are estimated by expectationmaximization techniques. It is very natural to use a Bayesian statistical framework with HMMs. This is because the likelihood of, for example, observing a sequence given a particular model is a natural calculation for HMMs. Bayesian statistics provide a framework for converting this likelihood into an a posteriori probability (the probability of the model, given the observed sequence) that includes the ability to integrate prior knowledge about, for example, the way in which proteins evolve. For biological sequences, the time dimension is replaced by the position in the sequence. Hidden Markov models prove so successful in this field because they can naturally accommodate variable-length models of regions of sequence. This is generally achieved by having a state which has a transition back to itself. Because most biological data has variable-length properties, machine learning techniques which require a fixed-length input, such as neural networks or support vector machines, are less successful in biological sequence analysis IBM J. RES. & DEV. VOL. 45 NO. 3/4 MAY/JULY 2001 E. BIRNEY
4 452 B Figure 3 M I D A profile HMM, which has a repetitive structure of three states (M, I, and D). Each set of three states represents a single column in the alignment of protein sequences. Gene-prediction HMMs Gene-prediction HMMs model the process of pre-mrna splicing followed by protein translation. The input of this process is the genomic DNA sequence and the output is the parse tree of exons and introns on the DNA sequence, from which the protein sequence of the gene prediction can be predicted. The gene-prediction HMMs are relatively standard: An abbreviated HMM is shown in Figure 2. There are states representing exons and introns, with specific states to model aspects of the gene parse; in particular, the crossover points between exons and introns (denoted as the 5 and 3 splice sites) have strong sequence biases. The exemplar program for the field is Genscan by Chris Burge and Samuel Karlin [5], with other good examples being Genie by David Kulp and colleagues [6] and HMMGene by Anders Krogh [7]. Depending on one s outlook, these programs either do well, with base-pair specificity in the 80% range in well-defined test sets, or badly, in the sense that 50% gene predictions appear to be completely wrong in large-scale genomic tests. Many of the new approaches are hoping to integrate additional information from similar sequences at the RNA or protein level. All of the authors mentioned above are integrating this information, and there are other approaches, such as the present author s own work, which provide a formal integration of protein similarity with gene prediction [8, 9]. Profile HMMs Anders Krogh and colleagues have developed a hidden Markov model equivalent of profile analysis for investigating protein families [10]. Profile analysis provided an ad hoc way to represent the consensus M I D N times M I D E profile of amino acids for a set of protein sequences belonging to the same family. The hidden Markov model applied was deliberately modeled on this successful technique, but introduced the notion of using probabilitybased parameterization, allowing both a principled way of setting the gap penalty scores and also more novel techniques such as expectation maximization to learn parameters from unaligned data. The architecture of the HMM is shown in Figure 3. It has a simple left-to-right structure in which there is a repetitive set of three states, designated as match, delete, and insert (M, D, and I). The match state represents a consensus amino acid for this position in the protein family. The delete state is a non-emitting state, and represents skipping this consensus position in the multiple alignment. Finally, the insert state models the insertion of any number of residues after this consensus position. This type of repetitive HMM is also common in speech recognition, where it is sometimes called a timedependent HMM or time-parameterized HMM. The use of profile HMMs was greatly enhanced in the HMMER package by Sean Eddy [11]. HMMER provided a free, stable, and effective software package to build, manipulate, and use HMMs, as well as a number of important improvements to the use of HMMs. First, HMMER provided log-odds likelihood of the model compared to a random model to indicate the relative likelihood that a new sequence belongs to this family. In the second iteration of the package, HMMER2, the HMM architecture was improved, in particular reducing the number of parameters to learn and in addition deliberately modeling repeated occurrences of a single protein domain in one protein sequence. HMMER2 also introduced a frequentist interpretation of the log-odds likelihood statistic by providing the ability to calibrate an HMM against a random distribution of sequences and fitting a distribution under the assumption that it was an extreme value distribution. This calibration and curvefitting approach produced a statistic that is far more powerful than, but still as accurate as, that produced by the Bayesian a posteriori probability approach. The success of HMMER in providing a stable, robust way to analyze protein families gave rise to a number of databases of hidden Markov models. Such databases are similar in many ways to the databases of phonemes and longer words used in speech recognition: Since biology has a limited number of protein families in existence, sheer enumeration of these protein domains is achievable. Despite the early promise of using unsupervised training approaches to derive these HMMs, highly supervised approaches by bioinformatics experts have always outperformed the more automatic approaches. The databases of profile HMMs are therefore focused around manual adjustment of the profile HMMs followed by an E. BIRNEY IBM J. RES. & DEV. VOL. 45 NO. 3/4 MAY/JULY 2001
5 automatic gathering of complete datasets from large protein sets, as illustrated by the use of the Pfam (protein family database) [12] and SMART (simple modular architecture research tool) [13] approaches. Pfam in particular now possesses coverage significant enough that 67% of proteins contain at least one Pfam profile HMM and 45% of residues in the protein database are covered in total by the HMMs. This extent of coverage, coupled with the good statistical behavior of the profile HMMs, has made Pfam an automatic protein classification system without peer. Theoretical contributions from bioinformatics It is easy to think that much of bioinformatics is the rather mundane application of existing machine learning techniques to yet another set of data. However, like any real-life problem, bioinformatics stretches the methods in ways in which other datasets have not. This has led to a number of advances in machine learning from bioinformatics. Here are some selected highlights: Small dataset usage. Bioinformatics, like some other fields, uses small sets of data but comprises a large body of theory about how certain distributions are presumed to behave. By integrating some of this theory into more standard prior distribution style methods, a number of novel methods have been developed, such as applying multiple Dirichlet priors [14] and maximizing the use of small datasets [15]. Novel decoding methods. Much of bioinformatics is less interested in the question of which model a particular set of observations come from than in the path taken through a particular model. The standard maximumlikelihood path (also called the Viterbi path) is not always the best path for a particular problem. Novel methods include a posteriori decoding to maximize the accuracy of the path [2] and decoding methods for integrated probabilistic methods [7]. General extensions of techniques. Some interesting work has occurred in bioinformatics in the integration of machine learning techniques. In the author s own work with Richard Durbin, we were led to derive a formal process to combine two separate HMMs into one [9]. David Haussler and Tommi Jaakkola provided a way of combining a discriminant method (support vector machines) with a generative HMM for providing better performance in a stringent class-distinction test [16]. Open areas for research in hidden Markov models in biology Open areas for research in HMMs in biology include the following: Integration of structural information into profile HMMs. Despite the almost obvious application of using structural information on a member protein family when one exists to better the parameterization of the HMM, this has been extremely hard to achieve in practice. Model architecture. The architectures of HMMs have largely been chosen to be the simplest architectures that can fit the observed data. Is this the best architecture to use? Can one use protein structure knowledge to make better architecture decisions, or, in limited regions, to learn the architecture directly from the data? Will these implied architectures have implications for our structural understanding? Biological mechanism. In gene prediction, the HMMs may be getting close to replicating the same sort of accuracy as the biological machine (the HMMs have the additional task of finding the gene in the genomic DNA context, which is not handled by the biological machine that processes the RNA). What constraints does our statistical model place on the biological mechanism in particular, can we consider a biological mechanism that could use the same information as the HMM? There are many other topics, both in probabilistic modeling and more generally in bioinformatics as a discipline waiting for enthusiastic machine-learning researchers. The author looks forward to the field growing over the coming decade. References 1. J. Zhu, J. S. Liu, and C. E. Lawrence, Bayesian Adaptive Sequence Alignment Algorithms, Bioinformatics 14, (1998). 2. I. Holmes and R. Durbin, Dynamic Programming Alignment Accuracy, J. Comput. Biol. 5, (1998). 3. P. Lio, J. L. Thorne, N. Goldman, and D. T. Jones, Passml: Combining Evolutionary Inference and Protein Secondary Structure Prediction, Bioinformatics 14, (1999). 4. Eleanor Rivas and Sean Eddy, A Dynamic Programming Algorithm for RNA Structure Prediction Including Pseudoknots, J. Mol. Biol. 285, (1999). 5. C. Burge and S. Karlin, Prediction of Complete Gene Structures in Human Genomic DNA, J. Mol. Biol. 268, (1997). 6. D. Kulp, D. Haussler, M. G. Reese, and F. H. Eeckman, A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA, Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, D. J. States, P. Agarwal, T. Gaasterland, L. Hunter, and R. F. Smith, Eds., AAAI Press, Menlo Park, CA, 1996, pp A. Krogh, Two Methods for Improving Performance of a HMM and Their Application for Gene Finding, Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, T. Gaasterland, P. Karp, K. Karplus, C. Ouzounis, C. Sander, and A. Valencia, Eds., AAAI Press, Menlo Park, CA, 1997, pp E. Birney and R. Durbin, Dynamite: A Flexible Code Generating Language for Dynamic Programming Methods 453 IBM J. RES. & DEV. VOL. 45 NO. 3/4 MAY/JULY 2001 E. BIRNEY
6 Used in Sequence Comparison, Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 1997, pp E. Birney, Sequence Alignment in Bioinformatics, Ph.D. thesis, The Sanger Centre, Cambridge, U.K., 2000; available from ftp://ftp.sanger.ac.uk/pub/birney/thesis. 10. A. Krogh, M. Brown, I. S. Mian, K. Sjölander, and D. Haussler, Hidden Markov Models in Computational Biology: Applications to Protein Modeling, J. Mol. Biol. 235, (1994). 11. S. R. Eddy, HMMER: A Profile Hidden Markov Modelling Package, available from A. Bateman, E. Birney, R. Durbin, S. R. Eddy, K. L. Howe, and E. L. L. Sonnhammer, The Pfam Protein Families Database, Nucleic Acids Res. 28, (2000). 13. J. Schultz, R. R. Copley, T. Doerks, C. P. Ponting, and P. Bork, SMART: A Web-Based Tool for the Study of Genetically Mobile Domains, Nucleic Acids Res. 28, (2000). 14. K. Sjölander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I. S. Mian, and D. Haussler, Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology, Comput. Appl. Biosci. 12, (1996). 15. S. R. Eddy, G. J. Mitchison, and R. Durbin, Maximum Discrimination Hidden Markov Models of Sequence Consensus, J. Comput. Biol. 2, 9 23 (1995). 16. T. Jaakkola, M. Diekhans, and D. Haussler, A Discriminative Framework for Detecting Remote Protein Homologies, Comput. Biol. 7, (2000). Ewan Birney European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, U.K. CB10 1SD ([email protected]). Dr. Birney is the Team Leader for Genomic Annotation at the European Bioinformatics Institute (EBI). He received a B.A. degree in biochemistry from Oxford University in 1996 and a Ph.D. degree in genetics from Cambridge University in Dr. Birney has pursued a career in bioinformatics since his undergraduate days, writing the GeneWise software package and becoming the Bioperl coordinator in He is now one of the leaders for the Ensembl project, which is providing an open annotation of the human genome ( Received August 2, 2000; accepted for publication December 19, E. BIRNEY IBM J. RES. & DEV. VOL. 45 NO. 3/4 MAY/JULY 2001
Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006
Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
Sequence homology search tools on the world wide web
44 Sequence Homology Search Tools Sequence homology search tools on the world wide web Ian Holmes Berkeley Drosophila Genome Project, Berkeley, CA email: [email protected] Introduction Sequence homology
Algorithms in Computational Biology (236522) spring 2007 Lecture #1
Algorithms in Computational Biology (236522) spring 2007 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: Tuesday 11:00-12:00/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office
Guide for Bioinformatics Project Module 3
Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first
Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions
SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
Searching Nucleotide Databases
Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem Elsa Bernard Laurent Jacob Julien Mairal Jean-Philippe Vert September 24, 2013 Abstract FlipFlop implements a fast method for de novo transcript
Structure and Function of DNA
Structure and Function of DNA DNA and RNA Structure DNA and RNA are nucleic acids. They consist of chemical units called nucleotides. The nucleotides are joined by a sugar-phosphate backbone. The four
PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: [email protected]
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
Hidden Markov Models
8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies
14.3 Studying the Human Genome
14.3 Studying the Human Genome Lesson Objectives Summarize the methods of DNA analysis. State the goals of the Human Genome Project and explain what we have learned so far. Lesson Summary Manipulating
Focusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
Transcription and Translation of DNA
Transcription and Translation of DNA Genotype our genetic constitution ( makeup) is determined (controlled) by the sequence of bases in its genes Phenotype determined by the proteins synthesised when genes
Translation Study Guide
Translation Study Guide This study guide is a written version of the material you have seen presented in the replication unit. In translation, the cell uses the genetic information contained in mrna to
Genome Explorer For Comparative Genome Analysis
Genome Explorer For Comparative Genome Analysis Jenn Conn 1, Jo L. Dicks 1 and Ian N. Roberts 2 Abstract Genome Explorer brings together the tools required to build and compare phylogenies from both sequence
A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 402 A Multiple DNA Sequence Translation Tool Incorporating Web
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!
DNA Replication & Protein Synthesis This isn t a baaaaaaaddd chapter!!! The Discovery of DNA s Structure Watson and Crick s discovery of DNA s structure was based on almost fifty years of research by other
RNA & Protein Synthesis
RNA & Protein Synthesis Genes send messages to cellular machinery RNA Plays a major role in process Process has three phases (Genetic) Transcription (Genetic) Translation Protein Synthesis RNA Synthesis
Introduction. Keywords: Hidden Markov model, gene finding, imum likelihood, statistical sequence analysis.
From: ISMB-97 Proceedings. Copyright 1997, AAAI (www.aaai.org). All rights reserved. Two methods for improving performance of an HMMand their application for gene finding Anders Krogh" Center for Biological
Master's projects at ITMO University. Daniil Chivilikhin PhD Student @ ITMO University
Master's projects at ITMO University Daniil Chivilikhin PhD Student @ ITMO University General information Guidance from our lab's researchers Publishable results 2 Research areas Research at ITMO Evolutionary
Lecture 19: Proteins, Primary Struture
CPS260/BGT204.1 Algorithms in Computational Biology November 04, 2003 Lecture 19: Proteins, Primary Struture Lecturer: Pankaj K. Agarwal Scribe: Qiuhua Liu 19.1 The Building Blocks of Protein [1] Proteins
Linear Sequence Analysis. 3-D Structure Analysis
Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical properties Molecular weight (MW), isoelectric point (pi), amino acid content, hydropathy (hydrophilic
High Throughput Network Analysis
High Throughput Network Analysis Sumeet Agarwal 1,2, Gabriel Villar 1,2,3, and Nick S Jones 2,4,5 1 Systems Biology Doctoral Training Centre, University of Oxford, Oxford OX1 3QD, United Kingdom 2 Department
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Basic Concepts of DNA, Proteins, Genes and Genomes
Basic Concepts of DNA, Proteins, Genes and Genomes Kun-Mao Chao 1,2,3 1 Graduate Institute of Biomedical Electronics and Bioinformatics 2 Department of Computer Science and Information Engineering 3 Graduate
Frequently Asked Questions Next Generation Sequencing
Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided
Hidden Markov models in gene finding. Bioinformatics research group David R. Cheriton School of Computer Science University of Waterloo
Hidden Markov models in gene finding Broňa Brejová Bioinformatics research group David R. Cheriton School of Computer Science University of Waterloo 1 Topics for today What is gene finding (biological
DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences
DNA and the Cell Anastasios Koutsos Alexandra Manaia Julia Willingale-Theune Version 2.3 English version ELLS European Learning Laboratory for the Life Sciences Anastasios Koutsos, Alexandra Manaia and
From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains
Proteins From DNA to Protein Chapter 13 All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequence of a gene The Path From Genes
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS
BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS NEW YORK CITY COLLEGE OF TECHNOLOGY The City University Of New York School of Arts and Sciences Biological Sciences Department Course title:
Introduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 [email protected]
Overview of Eukaryotic Gene Prediction
Overview of Eukaryotic Gene Prediction CBB 231 / COMPSCI 261 W.H. Majoros What is DNA? Nucleus Chromosome Telomere Centromere Cell Telomere base pairs histones DNA (double helix) DNA is a Double Helix
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
Gene Models & Bed format: What they represent.
GeneModels&Bedformat:Whattheyrepresent. Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically,
The world of non-coding RNA. Espen Enerly
The world of non-coding RNA Espen Enerly ncrna in general Different groups Small RNAs Outline mirnas and sirnas Speculations Common for all ncrna Per def.: never translated Not spurious transcripts Always/often
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism )
Biology 1406 Exam 3 Notes Structure of DNA Ch. 10 Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure 3.11 3.15 enzymes control cell chemistry ( metabolism ) Proteins
Lecture Series 7. From DNA to Protein. Genotype to Phenotype. Reading Assignments. A. Genes and the Synthesis of Polypeptides
Lecture Series 7 From DNA to Protein: Genotype to Phenotype Reading Assignments Read Chapter 7 From DNA to Protein A. Genes and the Synthesis of Polypeptides Genes are made up of DNA and are expressed
BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16
Course Director: Dr. Barry Grant (DCM&B, [email protected]) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
Current Motif Discovery Tools and their Limitations
Current Motif Discovery Tools and their Limitations Philipp Bucher SIB / CIG Workshop 3 October 2006 Trendy Concepts and Hypotheses Transcription regulatory elements act in a context-dependent manner.
Distributed Bioinformatics Computing System for DNA Sequence Analysis
Global Journal of Computer Science and Technology: A Hardware & Computation Volume 14 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.
Introduction to Genome Annotation
Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
RNA Structure and folding
RNA Structure and folding Overview: The main functional biomolecules in cells are polymers DNA, RNA and proteins For RNA and Proteins, the specific sequence of the polymer dictates its final structure
Protein Synthesis How Genes Become Constituent Molecules
Protein Synthesis Protein Synthesis How Genes Become Constituent Molecules Mendel and The Idea of Gene What is a Chromosome? A chromosome is a molecule of DNA 50% 50% 1. True 2. False True False Protein
GA as a Data Optimization Tool for Predictive Analytics
GA as a Data Optimization Tool for Predictive Analytics Chandra.J 1, Dr.Nachamai.M 2,Dr.Anitha.S.Pillai 3 1Assistant Professor, Department of computer Science, Christ University, Bangalore,India, [email protected]
Biological Sciences Initiative. Human Genome
Biological Sciences Initiative HHMI Human Genome Introduction In 2000, researchers from around the world published a draft sequence of the entire genome. 20 labs from 6 countries worked on the sequence.
14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)
Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,
Core Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1
Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: [email protected]
Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data
Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
RNA and Protein Synthesis
Name lass Date RN and Protein Synthesis Information and Heredity Q: How does information fl ow from DN to RN to direct the synthesis of proteins? 13.1 What is RN? WHT I KNOW SMPLE NSWER: RN is a nucleic
Module 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
Mitochondrial DNA Analysis
Mitochondrial DNA Analysis Lineage Markers Lineage markers are passed down from generation to generation without changing Except for rare mutation events They can help determine the lineage (family tree)
MASCOT Search Results Interpretation
The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually
Bio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])
820 REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf]) (See also General Regulations) BMS1 Admission to the Degree To be eligible for admission to the degree of Bachelor
Course: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 [email protected] Abstract Probability distributions on structured representation.
SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications
Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each
Biological Sequence Data Formats
Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA
FINDING RELATION BETWEEN AGING AND
FINDING RELATION BETWEEN AGING AND TELOMERE BY APRIORI AND DECISION TREE Jieun Sung 1, Youngshin Joo, and Taeseon Yoon 1 Department of National Science, Hankuk Academy of Foreign Studies, Yong-In, Republic
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in
DNA, RNA, Protein Synthesis Keystone 1. During the process shown above, the two strands of one DNA molecule are unwound. Then, DNA polymerases add complementary nucleotides to each strand which results
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
MAKING AN EVOLUTIONARY TREE
Student manual MAKING AN EVOLUTIONARY TREE THEORY The relationship between different species can be derived from different information sources. The connection between species may turn out by similarities
1 Mutation and Genetic Change
CHAPTER 14 1 Mutation and Genetic Change SECTION Genes in Action KEY IDEAS As you read this section, keep these questions in mind: What is the origin of genetic differences among organisms? What kinds
Scottish Qualifications Authority
National Unit specification: general information Unit code: FH2G 12 Superclass: RH Publication date: March 2011 Source: Scottish Qualifications Authority Version: 01 Summary This Unit is a mandatory Unit
REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])
299 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference
Gene Finding CMSC 423
Gene Finding CMSC 423 Finding Signals in DNA We just have a long string of A, C, G, Ts. How can we find the signals encoded in it? Suppose you encountered a language you didn t know. How would you decipher
a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled
Biology 101 Chapter 14 Name: Fill-in-the-Blanks Which base follows the next in a strand of DNA is referred to. as the base (1) Sequence. The region of DNA that calls for the assembly of specific amino
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
Protein Analysis and WEKA Data Mining
GenMiner: a Data Mining Tool for Protein Analysis Gerasimos Hatzidamianos 1, Sotiris Diplaris 1,2, Ioannis Athanasiadis 1,2, Pericles A. Mitkas 1,2 1 Department of Electrical and Computer Engineering Aristotle
Human Genome Organization: An Update. Genome Organization: An Update
Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion
Clone Manager. Getting Started
Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software
Amino Acids and Their Properties
Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that
AP Biology Essential Knowledge Student Diagnostic
AP Biology Essential Knowledge Student Diagnostic Background The Essential Knowledge statements provided in the AP Biology Curriculum Framework are scientific claims describing phenomenon occurring in
To be able to describe polypeptide synthesis including transcription and splicing
Thursday 8th March COPY LO: To be able to describe polypeptide synthesis including transcription and splicing Starter Explain the difference between transcription and translation BATS Describe and explain
The Steps. 1. Transcription. 2. Transferal. 3. Translation
Protein Synthesis Protein synthesis is simply the "making of proteins." Although the term itself is easy to understand, the multiple steps that a cell in a plant or animal must go through are not. In order
Graph theoretic approach to analyze amino acid network
Int. J. Adv. Appl. Math. and Mech. 2(3) (2015) 31-37 (ISSN: 2347-2529) Journal homepage: www.ijaamm.com International Journal of Advances in Applied Mathematics and Mechanics Graph theoretic approach to
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Data Isn't Everything
June 17, 2015 Innovate Forward Data Isn't Everything The Challenges of Big Data, Advanced Analytics, and Advance Computation Devices for Transportation Agencies. Using Data to Support Mission, Administration,
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
