Improving MAKER Gene Annotations in Grasses through the Use of GC Specific Hidden Markov Models Megan Bowman Childs Lab Bioinformatics Seminar 22 April 2015
Outline GC content in plant genomes Codon usage Bioinformatics strategy MAKER Creation of GC specific HMM training datasets Results Ongoing work
GC content of plant genomes GC content varies among species and phylogeny impacts variation Recombination, expression level and replication may correlate with GC content Positive selection should result in high levels of codon bias and low rates of synonymous substitution Bimodal GC distribution in grasses, 5 3 GC gradient Plants are much more complicated, monocots have higher GC content than dicots and they show variation in codon usage Serres-Giardi et al., Plant Cell 2012
Why look at codon usage? Codon usage is linked to: Transcriptional selection Translation efficiency Gene expression GC content AA conservation Protein hydropathicity RNA stability Adaption to habitat Jane Pulman
CodonW Written by John Peden as part of his thesis Command line and GUI menu based Several issues Creates codon tables, determines optimal codons, COA analysis, CAI, N c, CBI, Fop, GRAVY, GC, GC3. Jane Pulman
Codon table Jane Pulman
GC3 Wobble position in codons is a marker for GC richness
How does GC impact genome annotation? Very likely that we miss gene models Prediction based on abinitio programs = HMMs Abinitio programs require training data sets for accurate gene prediction Nucleotide content of training set influences prediction Missing gene models with varying GC content that may have important functions in plants
Rapid divergence of codon usage patterns Wang and Hickley, BMC Evolutionary Biology 2007, 7(Suppl 1):S6 Jane Pulman
Hidden Markov Models (HMMs) HMMs are the Legos of computational sequence analysis Sean Eddy, Nature Biotechnology 2004
AUGUSTUS HMM
Bioinformatics strategy ü Reannotation of rice genome (Oryza sativa L. ssp japonica cv. Nipponbare) MSU v7 genome using MAKER ü Calculate distribution of GC content across all evidence supported predicted gene models ü Develop code to create new HMM training sets with extreme high and low GC content ü Retrain HMMs for abinito prediction programs with high and low GC sets ü MAKER annotations using high and low GC HMMs ü Identify newly predicted gene models and improve existing gene models ü Combine original, high and low GC annotations for new annotation
MAKER What are advantages to using MAKER for structural genome annotation? Masks repeats Aligns ESTs/RNA-seq assemblies and proteins to genome Guides abinitio gene predictors (Ex: SNAP and AUGUSTUS) Combines these into a final annotation Provides quality values for predicted gene models (AED Score)
Annotation Edit Distance (AED) Collapsed Evidence SN SP AED Perfect Accuracy 1.0 1.0 0.0 Poor Specificity 1.0 0.5 0.25 Poor Sensitivity Poor Specificity and Sensitivity 0.5 1.0 0.25 0.5 0.5 0.5 Kevin Childs Eilbeck et al BMC Bioinformatics 2009. 10:67
Michael Campbell MAKER-Standard - Evidence - Pfam - Evidence + Pfam MAKER Max + Evidence + Pfam MAKER Default MAKER Standard + Evidence - Pfam
Training HMMs and GC content Does GC content impact the training of the SNAP and AUGUSTUS HMMs? Would we predict new gene models if the average GC content of the training set used for HMM training was either higher or lower than normal? Can we add additional evidence to gene models using these new HMMs, or identify new gene models not previously found?
MAKER Annotation Reannotation of the MSU Rice v7 genome assembly Transcript Evidence Fifty SRA datasets Read cleaning and QC with FastQC and Cutadapt Transcript assembly with Trinity Trained SNAP and AUGUSTUS HMMs MAKER with default parameters Iprscan hmmscan Pfam for MAKER standard
Improving MAKER models based on GC distribution GC distribution of Maker Standard Models 4 3 Gene Frequency 2 1 0 20 40 60 80 GC Content
Creation of GC Specific HMM Training Datasets Perl Scripts: MAKER Standard GFF3 and genome FASTA file Determines GC percentage and bins GC to integers over a user defined window on each side Creates FASTA files and GFF3s of high and low GC maker standard genes for HMM training GFF3 is input for SNAP abinitio gene prediction program (maker2zff)
Workflow Align Transcripts to Genome with MAKER Calculate Lower Threshold GC Content of Aligned Transcripts Use Transcripts with Regular GC Distribution Calculate Upper Threshold GC Content of Aligned Transcripts Train SNAP/Run MAKER with Lower GC Content Train SNAP/Run MAKER Train SNAP/Run MAKER with Upper GC Content Use SNAP Gene Predictions with Lower GC Content Use SNAP Gene Predictions with Regular GC Distribution Use SNAP Gene Predictions with Upper GC Content Train Augustus with Lower GC Gene Predictions Train Augustus with Regular GC Gene Predictions Train Augustus with Upper GC Gene Predictions MAKER Annotation with Lower GC HMMs MAKER Annotation with Regular GC HMMs MAKER Annotation with Upper GC HMMs Rerun MAKER Annotation with Regular GC HMMs and Predictions from Lower and Upper GC HMMs; MAKER Retains the Best Annotations
MAKER Annotation Comparative AED Curves 40000 Total Number of Transcripts 30000 20000 10000 0 0.00 0.25 0.50 0.75 1.00 AED combo high low original
Unique gene models AED Scores of New High GC Gene Models AED Scores of New Low GC Gene Models 200 150 200 factor(high_names) factor(low_names) Count 100.25 < AED <=.50.50 < AED <=.75.75 < AED < 1 AED <=.25 AED = 1 Count.25 < AED <=.50.50 < AED <=.75.75 < AED < 1 AED <=.25 AED = 1 100 50 0 0 AED <=.25.25 < AED <=.50.50 < AED <=.75.75 < AED < 1 AED = 1 AED Scores AED <=.25.25 < AED <=.50.50 < AED <=.75.75 < AED < 1 AED = 1 AED Scores 1598 new gene models, 25,168 improved models (Decrease in AED score)
GC Distribution GC Distribution 7.5 Gene Frequency 5.0 2.5 0.0 20 40 60 80 GC Content high high only low low only original
Codon usage original Jane Pulman
Codon usage low new models Jane Pulman
Codon usage high new models Jane Pulman
The effective number of codons!! = 2 + 9!! + 1!! + 5!! + 3!!! F i is the average homozygosity estimated for SF type i. 2 is the N c for M and Try as they are always 1. 9,1,5 and 3 are the number of AA of that type. So for example there are 9 AA that have 2 synonymous codons. Jane Pulman
Nc Plot Original Jane Pulman
Nc Plot Low New Models Jane Pulman
Nc Plot High New Models Jane Pulman
Position on first and second axis for High and Low GC new models Jane Pulman
Ongoing work Functional annotation of new genes Orthologs of new genes in other species Paralogs of new genes Comparison of MAKER annotations with full length cdnas Comparison of results to randomized training sets Expression analysis of high and low GC genes Tissue specific GC content and codon usage analyses
Conclusions High and low GC content gene models may be missed by using current HMM training methods Developed a pipeline for the creation of high and low GC content training data sets for existing gene prediction programs MAKER reannotation of MSU rice genome with high and low GC HMMs Identify over 1500 new gene models specific to high and low GC specific annotations Codon usage bias with high and low GC gene models