Gene annotation pipeline. Note: Code is highlighted in green, changes to parameter files are highlighted in gray.

Similar documents
CD-HIT User s Guide. Last updated: April 5,

Bioinformatics Resources at a Glance

Database manager does something that sounds trivial. It makes it easy to setup a new database for searching with Mascot. It also makes it easy to

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Version 5.0 Release Notes

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

Step by Step Guide to Importing Genetic Data into JMP Genomics

Creating a DUO MFA Service in AWS

Version Control with Subversion and Xcode

DNA Sequencing Overview

Copyright Texthelp Limited All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval

Microsoft Business Contact Manager Complete

WS_FTP Professional 12

Importing Contacts to Outlook

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

ProSightPC 3.0 Quick Start Guide

Getting Started with HPC

Sequence Database Administration

IMPORTANT Please Read Me First

Searching Nucleotide Databases

Exercises for the UCSC Genome Browser Introduction

BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Bioinformatics Grid - Enabled Tools For Biologists.

Moxa Device Manager 2.3 User s Manual

Introduction to Bioinformatics 3. DNA editing and contig assembly

Prepare the environment Practical Part 1.1

12 SPS DATABASE ADMINISTRATION

Installation Guide for WebSphere Application Server (WAS) and its Fix Packs on AIX V5.3L

Introduction to Operating Systems

Linux command line. An introduction to the Linux command line for genomics. Susan Fairley

LifeScope Genomic Analysis Software 2.5

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

MetaPathways v1.0 Installation

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

3 Setting up Databases on a Microsoft SQL 7.0 Server

Oracle 10g Feature: RMAN Incrementally Updated Backups

Turning HTE Agilent LC/MS Data to Knowledge in 1hr

Business Portal for Microsoft Dynamics GP. Key Performance Indicators Release 10.0

-> Integration of MAPHiTS in Galaxy

BackTrack Hard Drive Installation

Ruby On Rails. CSCI 5449 Submitted by: Bhaskar Vaish

Data formats and file conversions

Installing IBM Websphere Application Server 7 and 8 on OS4 Enterprise Linux

Exchange Account (Outlook) Mail Cleanup

1 System requirements (minimum)

Section 5: Preparing the Data Analysis Environment

FileMaker Pro and Microsoft Office Integration

A Tutorial in Genetic Sequence Classification Tools and Techniques

Client Marketing: Sets

RAST Automated Analysis. What is RAST for?

File Server Migration

TM SysAid Chat Guide Document Updated: 10 November 2009

Version Control Script

Lab 2 : Basic File Server. Introduction

ORACLE USER PRODUCTIVITY KIT USAGE TRACKING ADMINISTRATION & REPORTING RELEASE 3.6 PART NO. E

Contents. Welcome to the Priority Zoom System Version 17 for Windows. This document contains instructions for installing the system.

Tutorial for proteome data analysis using the Perseus software platform

Backup and Restore with 3 rd Party Applications

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Server & Workstation Installation of Client Profiles for Windows

Click Studios. Passwordstate. Upgrade Instructions to V7 from V5.xx

CPAS Overview. Josh Eckels LabKey Software

PCRecruiter Resume Inhaler

Setting Up Windows Perfmon to Collect Performance Data

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

IceWarp to IceWarp Server Migration

Resources You can find more resources for Sync & Save at our support site:

Package hoarder. June 30, 2015

Mascot Search Results FAQ

SQL Server Replication Guide

AWS Schema Conversion Tool. User Guide Version 1.0

Niche Market Finder. User Guide

Job Streaming User Guide

TCB No September Technical Bulletin. GS FLX+ System & GS FLX System. Installation of 454 Sequencing System Software v2.

Together with SAP MaxDB database tools, you can use third-party backup tools to backup and restore data. You can use third-party backup tools for the

Bubble Full Page Cache for Magento

Fruit Machine. Level. Activity Checklist Follow these INSTRUCTIONS one by one. Test Your Project Click on the green flag to TEST your code

Vector HelpDesk - Administrator s Guide

CanReg5 Webinar 6: Customization and Management

Partek Flow Installation Guide


Exchange Brick-level Backup and Restore

Finding and Opening Documents

AFN-FixedAssets

Getting Census Data into ArcMap or ArcView. Obtaining Shapefiles from ESRI and Data from the Census Bureau

A Program for PCB Estimation with Altium Designer

How to output SpoolFlex files directly to your Windows server

Bio-Informatics Lectures. A Short Introduction

Installing Sun's VirtualBox on Windows XP and setting up an Ubuntu VM

Transcription:

Gene annotation pipeline Note: Code is highlighted in green, changes to parameter files are highlighted in gray. References: http://gmod.org/wiki/maker_tutorial_2012 https://groups.google.com/forum/#!forum/maker-devel Training SNAP (ab initio gene prediction) Computation time: ~8 hours total (you may be able to do each step on standby) You will need: 1) A fasta file with chicken, turkey, zebra finch and pigeon proteins (my file was called avianuniprot.fasta) 2) Approximately 10 Mb of your longest contigs (I used all scaffolds > 1200000 bp, my file was called kmer70_min1200000_scaffolds) module load MAKER/2.28b maker CTL nano maker_opts.ctl Edit the file so the following applies and save/exit: genome=kmer70_min1200000_scaffolds.fasta protein=avianuniprot.fasta protein2genome=1 mpiexec -n 48 maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl &> log_file.txt cd kmer70_min1200000_scaffolds.maker.output maker2zff n d kmer70_min1200000_scaffolds_master_datastore_index.log This will create two files: genome.ann and genome.dna The following commands give you an idea of what your data looks like:

fathom genome.ann genome.dna -gene-stats fathom genome.ann genome.dna -validate To train SNAP: cd.. mkdir snap copy genome.ann and genome.dna files to snap folder cd snap fathom -categorize 1000 genome.ann genome.dna fathom -export 1000 -plus uni.ann uni.dna forge export.ann export.dna hmm-assembler.pl Pult. > Pult.hmm cd.. nano maker_opts.ctl Edit the file so that the following applies and save/exit: snaphmm=snap/pult.hmm protein2genome=0 mpiexec -n 48 maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl &> log_file.txt Wait for MAKER to finish running and manually end job. This will be much faster than the first run.

cd kmer70_min1200000_scaffolds.maker.output delete the previous genome.ann and genome.dna files that are in the maker.output directory maker2zff n d kmer70_min1200000_scaffolds_master_datastore_index.log mkdir snap2 copy genome.ann and genome.dna files to snap2 folder fathom -categorize 1000 genome.ann genome.dna fathom -export 1000 -plus uni.ann uni.dna forge export.ann export.dna hmm-assembler.pl Pult. > Pult2.hmm cd.. SNAP is now trained. 1 st MAKER run Computation time: 120-200 hours (run on fnrgenetics, I usually set the run time for the job as 300 hrs) You will need: 1) A fasta file with chicken, turkey, zebra finch and pigeon proteins (my file was called avianuniprot.fasta) 2) The SNAP file you just generated (my file was called Pult2.hmm) 3) EST sequences from a closely-related species (I used falcon ESTs, my file was called falcontrinity.fasta) 4) All of your scaffolds with a minimum length of 10000 bp (my file was called kmer70_min10000_scaffolds.fasta) 5) MIKE_compare_annotations_3.1.pl script

module load MAKER/2.28b maker CTL nano maker_opts.ctl Edit the file so the following applies and save/exit: genome=kmer70_min10000_scaffolds.fasta alttest=falcontrinity.fasta protein=avianuniprot.fasta snaphmm=snap2/pult2.hmm augustus_species=chicken keep_preds=1 tries=1 Note that alttest is used when you re providing ests from a closely-related species. If you have est sequences for your own species, use est= instead of alttest=. This will allow MAKER to run MUCH faster, as it won t be translating in 6 different reading frames. mpiexec -n 48 maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl &> log_file.txt You have finished your first MAKER run. Use fasta_merge to create genome-level fasta files with annotations based on different types of evidence (snap, augustus, ab initio + blast evidence, etc.). Use gff3_merge to create a genome-level gff file. I also make a zff file just so I can generate summary statistics. fasta_merge -d kmer70_min10000_scaffolds_revisedassembly_master_datastore_index.log gff3_merge -d kmer70_min10000_scaffolds_revisedassembly_master_datastore_index.log maker2zff n d kmer70_min1200000_scaffolds_master_datastore_index.log I use several scripts to generate summary statistics for the genome-level fasta and gff files. It is possible for incomplete gff files to be generated if there isn t enough tmp room, so it is important to make sure everything matches up before moving onto the next step. perl MIKE_compare_annotations_3.1.pl kmer70_min10000_scaffolds_revisedassembly.all.gff

fathom genome.ann genome.dna -gene-stats perl fastastats.pl I kmer70_min10000_scaffolds_revisedassembly.fasta InterProScan and downstream processing You will need: 1) The all.maker.proteins.fasta file from you MAKER run (mine was called kmer70_min10000_scaffolds.all.maker.proteins.fasta) 2) The.gff file from your MAKER run (mine was called kmer70_min10000_scaffolds_revisedassembly.all.gff) 3) InterProScan installed locally in your scratch drive (you might check the bioinformatics modules as well, InterProScan is supposed to be added at some point) 4) quality_filter.pl and iprscan_parser.pl scripts./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv I kmer70_min10000_scaffolds.all.maker.proteins.fasta If needed, modify the merged tsv file so that the first column is only the maker id perl -lane 'my @array = split(/\t/, $_); $array[0] = $F[0]; print join("\t", @array);' merged_iprscan_output.tsv > merged_iprscan_output_fixed_ids.tsv Update the gff3 file module load MAKER/2.28b ipr_update_gff kmer70_min10000_scaffolds_revisedassembly.all.gff merged_iprscan_output_fixed_ids.tsv NOTE: This modifies the gff3 file in place so if you kill it before it is done you will truncate your gff3 file, so it is a good idea to make a backup copy. Now you can use the quality_filter_2.pl script to make gff files with genes annotated with different types of evidence. The updated gff file is the max build with all possible gene annotations (including ab initio predictions for which there is no other evidence). The maker default build includes gene annotations generated from ab initio predictions + protein or EST evidence. The maker standard build includes gene annotations generated from ab initio predictions + protein or EST evidence or an InterProScan protein domain hit.

Make the default, standard, and max builds quality_filter.pl -d <ipr_updated_gff3_file> > maker_default.gff quality_filter.pl -s <ipr_updated_gff3_file> > maker_standard.gff The maker max is the original file 2 nd MAKER run: Computation time: 10-20 hours (run on fnrgenetics, I usually set the run time for the job as 48 hrs) This second MAKER retains your MAKER standard annotations while discarding the ab initio predictions that don t have any other evidence. Start by getting a gff3 file with only the gene models and none of the evidence or appended fasta sequence. You can use gff3_merge for this. gff3_merge -g -n MAKER_standard.gff The output will be a file named genome.all.gff. This file only has the MAKER gene models and the appended fasta sequence has been removed. You need to copy this file to the same directory that holds the maker_opts.ctl, genome, protein, etc. files from your first MAKER run. Next edit the maker_opts.ctl file and turn off all of the gene predictors (i.e. get rid of your snap2 file and augustus chicken setting) and put the genome.all.gff file in as model_gff in the gene prediction section and set map_forward=1. snaphmm= augustus_species= model_gff=genome.all.gff #annotated gene models from an external GFF3 file (annotation pass-through) map_forward=1 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no keep_preds=0 #Add unsupported gene prediction to final annotation set, 1 = yes, 0 = no If you run this in the same directory you did your original run in MAKER will be able to find its kmer70_min10000_scaffolds_revisedassembly.maker.output directory and won't have to realign the evidence and will make a directory for each scaffold with a gff3 file that contains the gene models you passed in and the evidence. This will run much faster than a de novo annotation because it doesn't have to realign the evidence or redo the repeat masking.

mpiexec -n 48 maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl &> log_file_2ndrun.txt You have finished your second MAKER run. Use fasta_merge to create genome-level fasta files. fasta_merge -d kmer70_min10000_scaffolds_revisedassembly_master_datastore_index.log