Gene Finding CMSC 423



Similar documents
( TUTORIAL. (July 2006)

Hands on Simulation of Mutation

Mutation. Mutation provides raw material to evolution. Different kinds of mutations have different effects

Protein Synthesis Simulation

Molecular Facts and Figures

Biological One-way Functions

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

Hiding Data in DNA. 1 Introduction

DNA Bracelets

Insulin mrna to Protein Kit

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

Gene and Chromosome Mutation Worksheet (reference pgs in Modern Biology textbook)

UNIT (12) MOLECULES OF LIFE: NUCLEIC ACIDS

Problem Set 3 KEY

Provincial Exam Questions. 9. Give one role of each of the following nucleic acids in the production of an enzyme.

10 µg lyophilized plasmid DNA (store lyophilized plasmid at 20 C)

GENEWIZ, Inc. DNA Sequencing Service Details for USC Norris Comprehensive Cancer Center DNA Core

Mutations and Genetic Variability. 1. What is occurring in the diagram below?

An Introduction to Bioinformatics Algorithms Gene Prediction

Protein Synthesis How Genes Become Constituent Molecules

CCR Biology - Chapter 8 Practice Test - Summer 2012

(A) Microarray analysis was performed on ATM and MDM isolated from 4 obese donors.

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Overview of Eukaryotic Gene Prediction

UNIVERSITETET I OSLO Det matematisk-naturvitenskapelige fakultet

Protein Synthesis. Page 41 Page 44 Page 47 Page 42 Page 45 Page 48 Page 43 Page 46 Page 49. Page 41. DNA RNA Protein. Vocabulary

Table S1. Related to Figure 4

Transcription and Translation of DNA

Gene Models & Bed format: What they represent.

13.2 Ribosomes & Protein Synthesis

Ribosomal Protein Synthesis

DNA Sample preparation and Submission Guidelines

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION LIVING ENVIRONMENT. Tuesday, January 25, :15 a.m. to 12:15 p.m.

From DNA to Protein

GenBank, Entrez, & FASTA

Next Generation Sequencing

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Bio 102 Practice Problems Recombinant DNA and Biotechnology

Introduction to Perl Programming Input/Output, Regular Expressions, String Manipulation. Beginning Perl, Chap 4 6. Example 1

Mutation, Repair, and Recombination

BIOLÓGIA ANGOL NYELVEN

Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

The p53 MUTATION HANDBOOK

pcas-guide System Validation in Genome Editing

PRACTICE TEST QUESTIONS

Introduction to Genome Annotation

The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION LIVING ENVIRONMENT. Monday, January 26, :15 a.m. to 12:15 p.m.

Translation. Translation: Assembly of polypeptides on a ribosome

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Supplementary Information. Binding region and interaction properties of sulfoquinovosylacylglycerol (SQAG) with human

Part ONE. a. Assuming each of the four bases occurs with equal probability, how many bits of information does a nucleotide contain?

LESSON 4. Using Bioinformatics to Analyze Protein Sequences. Introduction. Learning Objectives. Key Concepts

Microbial Genetics (Chapter 8) Lecture Materials for Amy Warenda Czura, Ph.D. Suffolk County Community College. Eastern Campus

Bio 102 Practice Problems Genetic Code and Mutation

Homologous Gene Finding with a Hidden Markov Model

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

GenesCodeForProteins

Gene Synthesis 191. Mutagenesis 194. Gene Cloning 196. AccuGeneBlock Service 198. Gene Synthesis FAQs 201. User Protocol 204

Thymine = orange Adenine = dark green Guanine = purple Cytosine = yellow Uracil = brown

Inquiry Labs In The High School Classroom. EDTL 611 Final Project By Misha Stredrick and Laura Rettig

Translation Study Guide

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Structure and Function of DNA

Inverse PCR & Cycle Sequencing of P Element Insertions for STS Generation

The Steps. 1. Transcription. 2. Transferal. 3. Translation

Molecular analyses of EGFR: mutation and amplification detection

Biology Final Exam Study Guide: Semester 2

Module 6: Digital DNA

1 Mutation and Genetic Change

To be able to describe polypeptide synthesis including transcription and splicing

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

Module 3 Questions. 7. Chemotaxis is an example of signal transduction. Explain, with the use of diagrams.

Lecture 4. Polypeptide Synthesis Overview

SERVICES CATALOGUE WITH SUBMISSION GUIDELINES

ANALYSIS OF A CIRCULAR CODE MODEL

Sample Questions for Exam 3

Genetics Module B, Anchor 3

Transcription: RNA Synthesis, Processing & Modification

RNA & Protein Synthesis

Concluding lesson. Student manual. What kind of protein are you? (Basic)

AP BIOLOGY 2010 SCORING GUIDELINES (Form B)

Bioinformatics Resources at a Glance

Chapter 17: From Gene to Protein

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

Umm AL Qura University MUTATIONS. Dr Neda M Bogari

Introduction. What is Ecological Genetics?

Chapter 9. Applications of probability. 9.1 The genetic code

Inverse PCR and Sequencing of P-element, piggybac and Minos Insertion Sites in the Drosophila Gene Disruption Project

Transcription:

Gene Finding CMSC 423

Finding Signals in DNA We just have a long string of A, C, G, Ts. How can we find the signals encoded in it? Suppose you encountered a language you didn t know. How would you decipher it? Idea #1: Based on some external information, build a model (like an HMM) for how particular features are encoded. Idea #2: Find patterns that appear more often than you expect by chance. ( the occurs a lot in English, so it may be a word.) Today: we explore methods based mostly on idea #1. Next time, we will explore idea #2.

Central Dogma of Biology proteins Translation mrna (T U) Transcription Genome DNA = double-stranded, linear molecule each strand is string over {A,C,G,T} strands are complements of each other (A T; C G) substrings encode for genes, most of which encode for proteins

The Genetic Code There are 20 different amino acids & 64 different codons. Lots of different ways to encode for each amino acid. The 3rd base is typically less important for determining the amino acid Three different stop codons that signal the end of the gene Start codons differ depending on the organisms, but AUG is often used.

The Gene Finding Problem Genes are subsequences of DNA that (generally) tell the cell how to make specific proteins. How can we find which subsequences of DNA are genes? Start Codon: ATG Stop Codons: TGA, TAG, TAA ATAGAGGGTATGGGGGACCCGGACACGATGGCAGATGACGATGACGATGACGATGACGGGTGAAGTGAGTCAACACATGAC Challenges: The start codon can occur in the middle of a gene. The stop codon can occur in nonsense DNA between genes. The stop codon can occur out of frame inside a gene. Don t know what phase the gene starts in.

A Simple Gene Finder 1. Find all stop codons in genome 2. For each stop codon, find the in-frame start codon farthest upstream of the stop codon, without crossing another in-frame stop codon. GGC TAG ATG AGG GCT CTA ACT ATG GGC GCG TAA Each substring between the start and stop codons is called an ORF open reading frame 3. Return the long ORF as predicted genes. 3 out of the 64 possible codons are stop codons in random DNA, every 22nd codon is expected to be a stop.

Gene Finding as a Machine Learning Problem Given training examples of some known genes, can we distinguish ORFs that are genes from those that are not? Idea: can use distribution of codons to find genes. every codon should be about equally likely in non-gene DNA. every organism has a slightly different bias about how often certain codons are preferred. could also use frequencies of longer strings (k-mers).

Bacillus anthracis (anthrax) codon usage UUU F 0.76 UCU S 0.27 UAU Y 0.77 UGU C 0.73 UUC F 0.24 UCC S 0.08 UAC Y 0.23 UGC C 0.27 UUA L 0.49 UCA S 0.23 UAA * 0.66 UGA * 0.14 UUG L 0.13 UCG S 0.06 UAG * 0.20 UGG W 1.00 CUU L 0.16 CCU P 0.28 CAU H 0.79 CGU R 0.26 CUC L 0.04 CCC P 0.07 CAC H 0.21 CGC R 0.06 CUA L 0.14 CCA P 0.49 CAA Q 0.78 CGA R 0.16 CUG L 0.05 CCG P 0.16 CAG Q 0.22 CGG R 0.05 AUU I 0.57 ACU T 0.36 AAU N 0.76 AGU S 0.28 AUC I 0.15 ACC T 0.08 AAC N 0.24 AGC S 0.08 AUA I 0.28 ACA T 0.42 AAA K 0.74 AGA R 0.36 AUG M 1.00 ACG T 0.15 AAG K 0.26 AGG R 0.11 GUU V 0.32 GCU A 0.34 GAU D 0.81 GGU G 0.30 GUC V 0.07 GCC A 0.07 GAC D 0.19 GGC G 0.09 GUA V 0.43 GCA A 0.44 GAA E 0.75 GGA G 0.41 GUG V 0.18 GCG A 0.15 GAG E 0.25 GGG G 0.20

An Improved Simple Gene Finder Score each ORF using the product of the probability of each codon: GFScore(g) = Pr(codon1)xPr(codon2)xPr(codon3)x...xPr(codonn) But: as genes get longer, GFScore(g) will decrease. So: we should calculate GFScore(g[i...i+k]) for some window size k. The final GFSCORE(g) is the average of the Scores of the windows in it.

Eukaryotic Genes & Exon Splicing Prokaryotic (bacterial) genes look like this: ATG TAG Eukaryotic genes usually look like this: ATG exon intron exon intron exon intron exon TAG Introns are thrown away mrna: AUG Exons are concatenated together This spliced RNA is what is translated into a protein. UAG

A (Bad) HMM Eukaryotic Gene Finder START Pr(G) = 1 Start 1 Start 2 Start 3 Arrows show transitions with nonzero probabilities Pr(A) = 1 Pr(T) = 1 pos 1 pos 2 pos 3 What are some reasons this HMM gene finder is likely to do poorly? acceptor 2 donor 1 Stop 1 Stop 2 Stop 3 acceptor 1 donor 2 END intron

Bad Eukaryotic Gene Finder The positions in the codons are treated independently: the probability of emitting a base can t depend on which previous base was emitted. Only one strand of the DNA is considered at once. Length distributions of introns and exons are not considered.

An Generalized HMM-based Gene Finder + strand - strand GlimmerHMM model

An Generalized HMM-based Gene Finder + strand - strand GlimmerHMM model

GlimmerHMM Performance % of predicted ingene nucleotides that are correct % of predicted exons that are true exons. % of true gene nucleotides that GlimmerHMM predicts as part of genes. % of true exons that GlimmerHMM found. % of genes perfectly found

Compare with GENSCAN On 963 human genes: Note that overall accuracy is pretty low.

Recap Simple gene finding approaches use codon bias and long ORFs to identify genes. Many top gene finding programs are based on generalizations of Hidden Markov Models. Basic HMMs must be generalized to emit variable sized strings.