Improving MAKER Gene Annotations in Grasses through the Use of GC Specific Hidden Markov Models

Similar documents
Bioinformatics Grid - Enabled Tools For Biologists.

Introduction to NGS data analysis

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

A Web Based Software for Synonymous Codon Usage Indices

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Module 1. Sequence Formats and Retrieval. Charles Steward

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Bioinformatics Resources at a Glance

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Introduction to Bioinformatics 3. DNA editing and contig assembly

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Tutorial for proteome data analysis using the Perseus software platform

ADVANCES IN BOTANICAL RESEARCH

Evolutionary Bioinformatics. EvoPipes.net: Bioinformatic Tools for Ecological and Evolutionary Genomics

Current Motif Discovery Tools and their Limitations

Biological Sequence Data Formats

A Primer of Genome Science THIRD

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Pairwise Sequence Alignment

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

Visualization of Phylogenetic Trees and Metadata

GenBank, Entrez, & FASTA

( TUTORIAL. (July 2006)

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

Recombinant DNA and Biotechnology

Bio-Informatics Lectures. A Short Introduction

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Human-Mouse Synteny in Functional Genomics Experiment

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Overview of Eukaryotic Gene Prediction

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

RNA- seq de novo ABiMS

Introduction to Genome Annotation

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

A Tutorial in Genetic Sequence Classification Tools and Techniques

AP Biology Essential Knowledge Student Diagnostic

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

European Medicines Agency

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Guide for Bioinformatics Project Module 3

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

CD-HIT User s Guide. Last updated: April 5,

A new type of Hidden Markov Models to predict complex domain architecture in protein sequences

Analyzing A DNA Sequence Chromatogram

An introduction to bioinformatic tools for population genomic and metagenetic data analysis, 2.5 higher education credits Third Cycle

Gene Finding CMSC 423

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

UGENE Quick Start Guide

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

2.3 Identify rrna sequences in DNA

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Gene Models & Bed format: What they represent.

Comparing Methods for Identifying Transcription Factor Target Genes

Introduction to Phylogenetic Analysis

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

G E N OM I C S S E RV I C ES

Castillo et al. Rice Archaeogenetics electronic supplementary information

Frequently Asked Questions Next Generation Sequencing

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

T cell Epitope Prediction

GeneProf and the new GeneProf Web Services

UCHIME in practice Single-region sequencing Reference database mode

Genome Explorer For Comparative Genome Analysis

Objectives. Raster Data Discrete Classes. Spatial Information in Natural Resources FANR Review the raster data model

BAPS: Bayesian Analysis of Population Structure

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

Translation Study Guide

BIOINFORMATICS TUTORIAL

Phylogenetic Trees Made Easy

BSCI222 Principles of Genetics Winter 2014 TENTATIVE

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Cloud-Based Big Data Analytics in Bioinformatics

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Biotechnology and Recombinant DNA (Chapter 9) Lecture Materials for Amy Warenda Czura, Ph.D. Suffolk County Community College

Yale Pseudogene Analysis as part of GENCODE Project

Final Project Report

Vaxign Reverse Vaccinology Software Demo Introduction Zhuoshuang Allen Xiang, Yongqun Oliver He

Linear Sequence Analysis. 3-D Structure Analysis

AS Biology Unit 2 Key Terms and Definitions. Make sure you use these terms when answering exam questions!

The Steps. 1. Transcription. 2. Transferal. 3. Translation

Activity 7.21 Transcription factors

Novel Mining of Cancer via Mutation in Tumor Protein P53 using Quick Propagation Network

Genetics Module B, Anchor 3

GC3 Use cases for the Cloud

Gramene: Exploring Function through Comparative Genomics and Network Analysis Doreen H. Ware, Ph.D. United States Department of Agriculture ARS Cold

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Transcription:

Improving MAKER Gene Annotations in Grasses through the Use of GC Specific Hidden Markov Models Megan Bowman Childs Lab Bioinformatics Seminar 22 April 2015

Outline GC content in plant genomes Codon usage Bioinformatics strategy MAKER Creation of GC specific HMM training datasets Results Ongoing work

GC content of plant genomes GC content varies among species and phylogeny impacts variation Recombination, expression level and replication may correlate with GC content Positive selection should result in high levels of codon bias and low rates of synonymous substitution Bimodal GC distribution in grasses, 5 3 GC gradient Plants are much more complicated, monocots have higher GC content than dicots and they show variation in codon usage Serres-Giardi et al., Plant Cell 2012

Why look at codon usage? Codon usage is linked to: Transcriptional selection Translation efficiency Gene expression GC content AA conservation Protein hydropathicity RNA stability Adaption to habitat Jane Pulman

CodonW Written by John Peden as part of his thesis Command line and GUI menu based Several issues Creates codon tables, determines optimal codons, COA analysis, CAI, N c, CBI, Fop, GRAVY, GC, GC3. Jane Pulman

Codon table Jane Pulman

GC3 Wobble position in codons is a marker for GC richness

How does GC impact genome annotation? Very likely that we miss gene models Prediction based on abinitio programs = HMMs Abinitio programs require training data sets for accurate gene prediction Nucleotide content of training set influences prediction Missing gene models with varying GC content that may have important functions in plants

Rapid divergence of codon usage patterns Wang and Hickley, BMC Evolutionary Biology 2007, 7(Suppl 1):S6 Jane Pulman

Hidden Markov Models (HMMs) HMMs are the Legos of computational sequence analysis Sean Eddy, Nature Biotechnology 2004

AUGUSTUS HMM

Bioinformatics strategy ü Reannotation of rice genome (Oryza sativa L. ssp japonica cv. Nipponbare) MSU v7 genome using MAKER ü Calculate distribution of GC content across all evidence supported predicted gene models ü Develop code to create new HMM training sets with extreme high and low GC content ü Retrain HMMs for abinito prediction programs with high and low GC sets ü MAKER annotations using high and low GC HMMs ü Identify newly predicted gene models and improve existing gene models ü Combine original, high and low GC annotations for new annotation

MAKER What are advantages to using MAKER for structural genome annotation? Masks repeats Aligns ESTs/RNA-seq assemblies and proteins to genome Guides abinitio gene predictors (Ex: SNAP and AUGUSTUS) Combines these into a final annotation Provides quality values for predicted gene models (AED Score)

Annotation Edit Distance (AED) Collapsed Evidence SN SP AED Perfect Accuracy 1.0 1.0 0.0 Poor Specificity 1.0 0.5 0.25 Poor Sensitivity Poor Specificity and Sensitivity 0.5 1.0 0.25 0.5 0.5 0.5 Kevin Childs Eilbeck et al BMC Bioinformatics 2009. 10:67

Michael Campbell MAKER-Standard - Evidence - Pfam - Evidence + Pfam MAKER Max + Evidence + Pfam MAKER Default MAKER Standard + Evidence - Pfam

Training HMMs and GC content Does GC content impact the training of the SNAP and AUGUSTUS HMMs? Would we predict new gene models if the average GC content of the training set used for HMM training was either higher or lower than normal? Can we add additional evidence to gene models using these new HMMs, or identify new gene models not previously found?

MAKER Annotation Reannotation of the MSU Rice v7 genome assembly Transcript Evidence Fifty SRA datasets Read cleaning and QC with FastQC and Cutadapt Transcript assembly with Trinity Trained SNAP and AUGUSTUS HMMs MAKER with default parameters Iprscan hmmscan Pfam for MAKER standard

Improving MAKER models based on GC distribution GC distribution of Maker Standard Models 4 3 Gene Frequency 2 1 0 20 40 60 80 GC Content

Creation of GC Specific HMM Training Datasets Perl Scripts: MAKER Standard GFF3 and genome FASTA file Determines GC percentage and bins GC to integers over a user defined window on each side Creates FASTA files and GFF3s of high and low GC maker standard genes for HMM training GFF3 is input for SNAP abinitio gene prediction program (maker2zff)

Workflow Align Transcripts to Genome with MAKER Calculate Lower Threshold GC Content of Aligned Transcripts Use Transcripts with Regular GC Distribution Calculate Upper Threshold GC Content of Aligned Transcripts Train SNAP/Run MAKER with Lower GC Content Train SNAP/Run MAKER Train SNAP/Run MAKER with Upper GC Content Use SNAP Gene Predictions with Lower GC Content Use SNAP Gene Predictions with Regular GC Distribution Use SNAP Gene Predictions with Upper GC Content Train Augustus with Lower GC Gene Predictions Train Augustus with Regular GC Gene Predictions Train Augustus with Upper GC Gene Predictions MAKER Annotation with Lower GC HMMs MAKER Annotation with Regular GC HMMs MAKER Annotation with Upper GC HMMs Rerun MAKER Annotation with Regular GC HMMs and Predictions from Lower and Upper GC HMMs; MAKER Retains the Best Annotations

MAKER Annotation Comparative AED Curves 40000 Total Number of Transcripts 30000 20000 10000 0 0.00 0.25 0.50 0.75 1.00 AED combo high low original

Unique gene models AED Scores of New High GC Gene Models AED Scores of New Low GC Gene Models 200 150 200 factor(high_names) factor(low_names) Count 100.25 < AED <=.50.50 < AED <=.75.75 < AED < 1 AED <=.25 AED = 1 Count.25 < AED <=.50.50 < AED <=.75.75 < AED < 1 AED <=.25 AED = 1 100 50 0 0 AED <=.25.25 < AED <=.50.50 < AED <=.75.75 < AED < 1 AED = 1 AED Scores AED <=.25.25 < AED <=.50.50 < AED <=.75.75 < AED < 1 AED = 1 AED Scores 1598 new gene models, 25,168 improved models (Decrease in AED score)

GC Distribution GC Distribution 7.5 Gene Frequency 5.0 2.5 0.0 20 40 60 80 GC Content high high only low low only original

Codon usage original Jane Pulman

Codon usage low new models Jane Pulman

Codon usage high new models Jane Pulman

The effective number of codons!! = 2 + 9!! + 1!! + 5!! + 3!!! F i is the average homozygosity estimated for SF type i. 2 is the N c for M and Try as they are always 1. 9,1,5 and 3 are the number of AA of that type. So for example there are 9 AA that have 2 synonymous codons. Jane Pulman

Nc Plot Original Jane Pulman

Nc Plot Low New Models Jane Pulman

Nc Plot High New Models Jane Pulman

Position on first and second axis for High and Low GC new models Jane Pulman

Ongoing work Functional annotation of new genes Orthologs of new genes in other species Paralogs of new genes Comparison of MAKER annotations with full length cdnas Comparison of results to randomized training sets Expression analysis of high and low GC genes Tissue specific GC content and codon usage analyses

Conclusions High and low GC content gene models may be missed by using current HMM training methods Developed a pipeline for the creation of high and low GC content training data sets for existing gene prediction programs MAKER reannotation of MSU rice genome with high and low GC HMMs Identify over 1500 new gene models specific to high and low GC specific annotations Codon usage bias with high and low GC gene models