Introduction to Genome Annotation

Similar documents
Bioinformatics Resources at a Glance

Lecture 1 MODULE 3 GENE EXPRESSION AND REGULATION OF GENE EXPRESSION. Professor Bharat Patel Office: Science 2, b.patel@griffith.edu.

Overview of Eukaryotic Gene Prediction

Gene Models & Bed format: What they represent.

Genomes and SNPs in Malaria and Sickle Cell Anemia

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

The world of non-coding RNA. Espen Enerly

Final Project Report

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

GenBank, Entrez, & FASTA

1 Mutation and Genetic Change

Gene Finding CMSC 423

How To Understand How Gene Expression Is Regulated

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Searching Nucleotide Databases

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Module 3 Questions. 7. Chemotaxis is an example of signal transduction. Explain, with the use of diagrams.

Lecture Series 7. From DNA to Protein. Genotype to Phenotype. Reading Assignments. A. Genes and the Synthesis of Polypeptides

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

13.4 Gene Regulation and Expression

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Introduction to Bioinformatics 3. DNA editing and contig assembly

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Guide for Bioinformatics Project Module 3

BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs

Human Genome Organization: An Update. Genome Organization: An Update

PrimePCR Assay Validation Report

Chapter 18 Regulation of Gene Expression

Transcription and Translation of DNA

Protein Synthesis How Genes Become Constituent Molecules

Gene Regulation -- The Lac Operon

Protein Protein Interactions (PPI) APID (Agile Protein Interaction DataAnalyzer)

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

mrna EDITING Watson et al., BIOLOGIA MOLECOLARE DEL GENE, Zanichelli editore S.p.A. Copyright 2005

Linear Sequence Analysis. 3-D Structure Analysis

Yale Pseudogene Analysis as part of GENCODE Project

CCR Biology - Chapter 8 Practice Test - Summer 2012

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Control of Gene Expression

Concluding lesson. Student manual. What kind of protein are you? (Basic)

GENE REGULATION. Teacher Packet

Teaching Bioinformatics to Undergraduates

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

AP BIOLOGY 2009 SCORING GUIDELINES

Lecture 3: Mutations

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

( TUTORIAL. (July 2006)

PrimePCR Assay Validation Report

Basic Concepts of DNA, Proteins, Genes and Genomes

Protein Synthesis. Page 41 Page 44 Page 47 Page 42 Page 45 Page 48 Page 43 Page 46 Page 49. Page 41. DNA RNA Protein. Vocabulary

MUTATION, DNA REPAIR AND CANCER

Complex multicellular organisms are produced by cells that switch genes on and off during development.

Biological Databases and Protein Sequence Analysis

Molecular Genetics. RNA, Transcription, & Protein Synthesis

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

A Primer of Genome Science THIRD

Frequently Asked Questions Next Generation Sequencing

Structure and Function of DNA

An Overview of Cells and Cell Research

RNA & Protein Synthesis

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

Pairwise Sequence Alignment

4. DNA replication Pages: Difficulty: 2 Ans: C Which one of the following statements about enzymes that interact with DNA is true?

Integration of data management and analysis for genome research

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Sample Questions for Exam 3

NO CALCULATORS OR CELL PHONES ALLOWED

Problem Set 3 KEY

Name: Date: Period: DNA Unit: DNA Webquest

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Human Genome and Human Genome Project. Louxin Zhang

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Biological Sequence Data Formats

DNA Sequence Alignment Analysis

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

An Introduction to Bioinformatics Algorithms Gene Prediction

Analisi in silicoe relazione tra enterotossine stafilococciche e tossine ipotetiche

Recombinant DNA Technology

Activity 7.21 Transcription factors

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Bioinformatics Grid - Enabled Tools For Biologists.

13.2 Ribosomes & Protein Synthesis

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

To be able to describe polypeptide synthesis including transcription and splicing

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Figure 1: Genome sizes of different organisms.

Homologous Gene Finding with a Hidden Markov Model

Gene Finding and HMMs

Bio 102 Practice Problems Recombinant DNA and Biotechnology

Transcription:

Introduction to Genome Annotation AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATG AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAGT TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGC CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGCTAGTAT ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATG GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTC AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGC

What is Annotation? dictionary definition of to annotate : to make or furnish critical or explanatory notes or comment some of what this includes for genomics gene product names functional characteristics of gene products physical characteristics of gene/protein/genome overall metabolic profile of the organism elements of the annotation process gene finding homology searches functional assignment ORF management data availability manual vs. automatic automatic = computer makes the decisions good on easy ones bad on hard ones manual = human makes the decisions highest quality **Due to the VOLUMES of genome data today, most genome projects are annotated primarily using automated methods with limited manual annotation

Annotation pipeline Generation of Open Reading Frames Homology Searches Putative ID Frameshift Detection Ambiguity Report Role Assignment Metabolic Pathways Gene Families DNA Motifs Regulatory Elements Repetitive Sequences Comparative Genomics

Genome Structure Prokaryote Intergenic Region Eukaryote Monocistronic Gene Intergenic Region Polycistronic Genes in an Operon Gene Intergenic Region Gene Intergenic Region Gene

Prokaryotic Gene Structure and Transcript Processing http://pps00.cryst.bbk.ac.uk/course/section6/henryb/genestrp.htm

Eukaryotic Gene Structure and Transcript Processing

Structural Annotation: Finding the Genes in Genomic DNA Two main types of data used in defining gene structure: Prediction based: algorithms designed to find genes/gene structures based on nucleotide sequence and composition Sequence similarity (DNA and protein): alignment to mrna sequences (ESTs) and proteins from the same species or related species; identification of domains and motifs

Finding Genes (ORFs) Gene finders a programs that can identify genes computationally Running a Gene-finder is a two-part process 1) Train Gene finder for the organism you have sequenced. 2) Run the trained Gene finder on the completed sequence.

Candidate Genes 6-frame ORF map Stop codons (TAA, TAG, TGA) (long hash marks) Start codons (ATG, GTG, TTG) (short hash marks) +3 +2 +1-1 -2-3 Minimum ORF Length ORFs over minimum length highlighted

Annotating ORFs Possible translations represented by arrows, moving from start to stop, the dotted line represents an ORF with no start site. Glimmer chooses the set of likely genes. ORF00001 ORF00002 ORF00003 ORF00004

Eukaryotic Gene Finding Identifying the protein coding region of genes AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG Gene finder (many different programs) AAAGC ATG CAT TTA ACG A GT GCATC AG GA CTC CAT ACG TAA TGCCG *This is a eukaryotic gene as evidenced by the intron

Signals Within DNA Splice sites to identify intron/exon junctions Transcription start and stop codons Promoter regions PolyA signals

Experimental Evidence DNA sequence evidence: Transcript sequence (EST, full length cdna, other expression types); more restrictive in evolutionary terms Protein Evidence: alignment to protein that suggests structural similarity at the amino acid level; can be more distant evolutionarily

Experimental Evidence Transcript evidence: -Demonstrates gene is transcribed -Delineates exon boundaries -Defines splice sites and alternative transcripts -If EST based, indicates expression patterns

Functional Assignments Name Descriptive common name for the protein, with as much specificity as the evidence supports; gene symbol. Role Describe what the protein is doing in the cell and why. Associated information: Supporting evidence:domain and motifs EC number if protein is an enzyme. Paralogous family membership.

Evidence for Gene Function PROSITE Motifs collection of protein motifs associated with active sites, binding sites, etc. help in classifying genes into functional families when HMMs for that family have not been built InterPro Brings together HMMs (both TIGR and Pfam) Prosite motifs and other forms of motif/domain clustering Results in motif signatures for families or functions GO terms have been assigned to many of these

Sequence Alignments Compare sequence against other databases

Gene function evidence

Functional Annotation: Gene Product Names Gene Name Assignment: Based on similarity to known proteins in nraa database Categories: Known or Putative: Identical or strong similarity to documented gene(s) in Genbank or has high similarity to a Pfam domain; e.g. kinase, Rubisco Expressed Protein: Only match is to an EST with an unknown function; thus have confirmation that the gene is expressed but still do not know what the gene does Hypothetical Protein: Predicted solely by gene prediction programs and matches another hypothetical or expressed protein Hypothetical Protein: Predicted solely by gene prediction programs; no database match

Annotation example A good example of this is seen with transporters, what you ll see: Multiple hits to a specific type of transporter -The substrate identified for the proteins your protein matches are not all the same, but fall into a group, for example they are all sugars. Give the protein a name with specific function but a more general substrate specificity: sugar ABC transporter, permease protein Sometimes it will not be possible to identify particular substrate group, in that case: ABC transporter, permease protein

Automated Annotation is Not a Solved Problem What you are getting is output from a series of prediction tools or alignment programs Manual curation is often used to assess various types of evidence and improve upon automated gene calls and alignment output Ultimately, experimental verification is the only way to be sure that a gene structure is correct

Structural Annotation: Graphic Viewer Annotation Station Sequence Database Hits Top: Protein matches Bottom: EST matches Not shown graphically: gene name, nucleotide and protein sequence, MW, pi, organellar targeting sequence, membrane spanning regions, other domains. Gene Predictions Annotated Gene Top: editing panel Bottom: final curation Splice site predictions: red: acceptor sites blue: donor sites Screenshot of a component within Neomorphic s annotation station: www.neomorphic.com

Features Typically Resolved During Manual Annotation incorrect exon boundaries merged, split, missing genes missing untranslated regions (UTRs) missing alternative splicing isoform annotations degenerate transposons annotated as proteincoding genes

Increasing Complexity of Genome Annotation # Genes bp Mycoplasma pulmonis 780 964,000 Escherichia coli K-12 Saccharomyces cerevisiae Plasmodium falciparum Caenorhabditis elegans Drosophila melanogaster Arabidopsis thaliana Fugu rubripes Homo sapiens 4,300 6,300 5,400 19,000 16,000 25,000 35,000 30,000 4,641,000 12,100,000 22,850,000 97,000,000 120,000,000 115,400,000 365,000,000 2,910,000,000 Decrease in gene density and the presence of more, larger introns

Caveats of Genome Annotation -Greatly impacted by the quality of the sequence; the impact of draft sequencing on whole genome annotation has yet to be seen by Joe/Jane Scientist. There will be disappointment when the research communities realize that they don t have the gold standard of sequence as present in Arabidopsis and rice. -Annotation is challenging, highly UNDER-estimated in difficulty, highly UNDERvalued until a community goes to use its genome sequence -Annotation can be done to high accuracy on a single gene level by single investigators with expertise in gene families. The challenge is how to extrapolate this to the whole genome -Blends of automated, semi-automated, and manual annotation is perhaps the best way to approach genomes in which there are not large communities -Iterative, never perfect, can always be improved with new evidence and improved algorithms