Improving MAKER Gene Annotations in Grasses through the Use of GC Specific Hidden Markov Models

Size: px

Start display at page:

Download "Improving MAKER Gene Annotations in Grasses through the Use of GC Specific Hidden Markov Models"

Christian Patterson
9 years ago
Views:

1 Improving MAKER Gene Annotations in Grasses through the Use of GC Specific Hidden Markov Models Megan Bowman Childs Lab Bioinformatics Seminar 22 April 2015

2 Outline GC content in plant genomes Codon usage Bioinformatics strategy MAKER Creation of GC specific HMM training datasets Results Ongoing work

3 GC content of plant genomes GC content varies among species and phylogeny impacts variation Recombination, expression level and replication may correlate with GC content Positive selection should result in high levels of codon bias and low rates of synonymous substitution Bimodal GC distribution in grasses, 5 3 GC gradient Plants are much more complicated, monocots have higher GC content than dicots and they show variation in codon usage Serres-Giardi et al., Plant Cell 2012

4 Why look at codon usage? Codon usage is linked to: Transcriptional selection Translation efficiency Gene expression GC content AA conservation Protein hydropathicity RNA stability Adaption to habitat Jane Pulman

5 CodonW Written by John Peden as part of his thesis Command line and GUI menu based Several issues Creates codon tables, determines optimal codons, COA analysis, CAI, N c, CBI, Fop, GRAVY, GC, GC3. Jane Pulman

6 Codon table Jane Pulman

7 GC3 Wobble position in codons is a marker for GC richness

8 How does GC impact genome annotation? Very likely that we miss gene models Prediction based on abinitio programs = HMMs Abinitio programs require training data sets for accurate gene prediction Nucleotide content of training set influences prediction Missing gene models with varying GC content that may have important functions in plants

9 Rapid divergence of codon usage patterns Wang and Hickley, BMC Evolutionary Biology 2007, 7(Suppl 1):S6 Jane Pulman

10 Hidden Markov Models (HMMs) HMMs are the Legos of computational sequence analysis Sean Eddy, Nature Biotechnology 2004

11 AUGUSTUS HMM

12 Bioinformatics strategy ü Reannotation of rice genome (Oryza sativa L. ssp japonica cv. Nipponbare) MSU v7 genome using MAKER ü Calculate distribution of GC content across all evidence supported predicted gene models ü Develop code to create new HMM training sets with extreme high and low GC content ü Retrain HMMs for abinito prediction programs with high and low GC sets ü MAKER annotations using high and low GC HMMs ü Identify newly predicted gene models and improve existing gene models ü Combine original, high and low GC annotations for new annotation

13 MAKER What are advantages to using MAKER for structural genome annotation? Masks repeats Aligns ESTs/RNA-seq assemblies and proteins to genome Guides abinitio gene predictors (Ex: SNAP and AUGUSTUS) Combines these into a final annotation Provides quality values for predicted gene models (AED Score)

14 Annotation Edit Distance (AED) Collapsed Evidence SN SP AED Perfect Accuracy Poor Specificity Poor Sensitivity Poor Specificity and Sensitivity Kevin Childs Eilbeck et al BMC Bioinformatics :67

15 Michael Campbell MAKER-Standard - Evidence - Pfam - Evidence + Pfam MAKER Max + Evidence + Pfam MAKER Default MAKER Standard + Evidence - Pfam

16 Training HMMs and GC content Does GC content impact the training of the SNAP and AUGUSTUS HMMs? Would we predict new gene models if the average GC content of the training set used for HMM training was either higher or lower than normal? Can we add additional evidence to gene models using these new HMMs, or identify new gene models not previously found?

17 MAKER Annotation Reannotation of the MSU Rice v7 genome assembly Transcript Evidence Fifty SRA datasets Read cleaning and QC with FastQC and Cutadapt Transcript assembly with Trinity Trained SNAP and AUGUSTUS HMMs MAKER with default parameters Iprscan hmmscan Pfam for MAKER standard

18 Improving MAKER models based on GC distribution GC distribution of Maker Standard Models 4 3 Gene Frequency GC Content

19 Creation of GC Specific HMM Training Datasets Perl Scripts: MAKER Standard GFF3 and genome FASTA file Determines GC percentage and bins GC to integers over a user defined window on each side Creates FASTA files and GFF3s of high and low GC maker standard genes for HMM training GFF3 is input for SNAP abinitio gene prediction program (maker2zff)

20 Workflow Align Transcripts to Genome with MAKER Calculate Lower Threshold GC Content of Aligned Transcripts Use Transcripts with Regular GC Distribution Calculate Upper Threshold GC Content of Aligned Transcripts Train SNAP/Run MAKER with Lower GC Content Train SNAP/Run MAKER Train SNAP/Run MAKER with Upper GC Content Use SNAP Gene Predictions with Lower GC Content Use SNAP Gene Predictions with Regular GC Distribution Use SNAP Gene Predictions with Upper GC Content Train Augustus with Lower GC Gene Predictions Train Augustus with Regular GC Gene Predictions Train Augustus with Upper GC Gene Predictions MAKER Annotation with Lower GC HMMs MAKER Annotation with Regular GC HMMs MAKER Annotation with Upper GC HMMs Rerun MAKER Annotation with Regular GC HMMs and Predictions from Lower and Upper GC HMMs; MAKER Retains the Best Annotations

21 MAKER Annotation Comparative AED Curves Total Number of Transcripts AED combo high low original

22 Unique gene models AED Scores of New High GC Gene Models AED Scores of New Low GC Gene Models factor(high_names) factor(low_names) Count < AED <= < AED <= < AED < 1 AED <=.25 AED = 1 Count.25 < AED <= < AED <= < AED < 1 AED <=.25 AED = AED <= < AED <= < AED <= < AED < 1 AED = 1 AED Scores AED <= < AED <= < AED <= < AED < 1 AED = 1 AED Scores 1598 new gene models, 25,168 improved models (Decrease in AED score)

23 GC Distribution GC Distribution 7.5 Gene Frequency GC Content high high only low low only original

24 Codon usage original Jane Pulman

25 Codon usage low new models Jane Pulman

26 Codon usage high new models Jane Pulman

27 The effective number of codons!! = 2 + 9!! + 1!! + 5!! + 3!!! F i is the average homozygosity estimated for SF type i. 2 is the N c for M and Try as they are always 1. 9,1,5 and 3 are the number of AA of that type. So for example there are 9 AA that have 2 synonymous codons. Jane Pulman

28 Nc Plot Original Jane Pulman

29 Nc Plot Low New Models Jane Pulman

30 Nc Plot High New Models Jane Pulman

31 Position on first and second axis for High and Low GC new models Jane Pulman

32 Ongoing work Functional annotation of new genes Orthologs of new genes in other species Paralogs of new genes Comparison of MAKER annotations with full length cdnas Comparison of results to randomized training sets Expression analysis of high and low GC genes Tissue specific GC content and codon usage analyses

33 Conclusions High and low GC content gene models may be missed by using current HMM training methods Developed a pipeline for the creation of high and low GC content training data sets for existing gene prediction programs MAKER reannotation of MSU rice genome with high and low GC HMMs Identify over 1500 new gene models specific to high and low GC specific annotations Codon usage bias with high and low GC gene models

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis