RAST Automated Analysis. What is RAST for?

Similar documents
Bioinformatics Resources at a Glance

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

GenBank, Entrez, & FASTA

Searching Nucleotide Databases

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

CD-HIT User s Guide. Last updated: April 5,

Introduction to Genome Annotation

A Tutorial in Genetic Sequence Classification Tools and Techniques

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Module 1. Sequence Formats and Retrieval. Charles Steward

Overview of Eukaryotic Gene Prediction

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Version 5.0 Release Notes

DNA Sequence formats

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

E. coli plasmid and gene profiling using Next Generation Sequencing

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

RESTRICTION DIGESTS Based on a handout originally available at

Databases and mapping BWA. Samtools

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

ProSightPC 3.0 Quick Start Guide

A Primer of Genome Science THIRD

Structure and Function of DNA

An Overview of DNA Sequencing

LifeScope Genomic Analysis Software 2.5

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

CLC Sequence Viewer USER MANUAL

ithenticate User Manual

1 Mutation and Genetic Change

2.3 Identify rrna sequences in DNA

Year 8 KS3 Computer Science Homework Booklet

What is a contig? What are the contig assembly programs?

Frequently Asked Questions Next Generation Sequencing

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Data formats and file conversions

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

Introduction to Bioinformatics 3. DNA editing and contig assembly

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Module 3 Questions. 7. Chemotaxis is an example of signal transduction. Explain, with the use of diagrams.

Biological Sequence Data Formats

UGENE Quick Start Guide

MoBEDAC -- Integrated data and analysis for the indoor and built environment. Folker Meyer Argonne National Laboratory GSC 13 Shenzhen, China

Regular Expressions and Pattern Matching

4. DNA replication Pages: Difficulty: 2 Ans: C Which one of the following statements about enzymes that interact with DNA is true?

Molecular Genetics. RNA, Transcription, & Protein Synthesis

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Translation Study Guide

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Vector NTI Advance 11 Quick Start Guide

Organelle Speed Dating Game Instructions and answers for teachers

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Gene Models & Bed format: What they represent.

Anoto pendocuments. User s Guide

Guide for Bioinformatics Project Module 3

ithenticate User Manual

Package hoarder. June 30, 2015

STUDENT PORTAL - TURNITIN

CHAPTER 6: RECOMBINANT DNA TECHNOLOGY YEAR III PHARM.D DR. V. CHITRA

DNA Sequencing Overview

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs

ithenticate User Manual

Protein Synthesis How Genes Become Constituent Molecules

Working with AppleScript

G E N OM I C S S E RV I C ES

Central Dogma. Lecture 10. Discussing DNA replication. DNA Replication. DNA mutation and repair. Transcription

Configuring budget planning for Microsoft Dynamics AX 2012 R2

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Toledo Electronic learning environment Associatie K.U.Leuven. Electronic submission of masterpaper through Toledo Manual for students

Biological Databases and Protein Sequence Analysis

XML in IDSS. This overview is divided broadly into two sections, each of which answers one of the following questions:

Nesstar Server Nesstar WebView Version 3.5

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

Teacher Development Workshop ACCOUNTING GRADE 11

Module 10: Bioinformatics

Name: Date: Period: DNA Unit: DNA Webquest

Scottish Qualifications Authority

HP INTEGRATED ARCHIVE PLATFORM

Next Generation Sequencing Data Visualization

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

How Sequencing Experiments Fail

Bioinformatics Grid - Enabled Tools For Biologists.

Basic attributes of genetic processes (replication, transcription, translation)

Transcription:

RAST Automated Analysis Gordon D. Pusch Fellowship for Interpretation of Genomes What is RAST for? RAST is designed to rapidly call and annotate the genes of a complete or essentially complete prokaryotic genome RAST uses a "Highest Confidence First" assignment propagation strategy based on manually curated subsystems and subsystem-based protein families that automatically guarantees a high degree of assignment consistency. RAST returns an analysis of the genes and subsystems in your genome, as supported by comparative and other forms of evidence. 1

The RAST Strategy How does RAST work? RAST applies FIG's "Subsystem Approach" using a "Highest Reliability First" strategy based on FIG's collection of manually curated Subsystems and subsystem-derived Protein Families (FIGfams). RAST's subsystem approach automatically ensures a high degree of annotation consistency. RAST also computes various derived data (sims, BBHs, PCHs, Scenarios, etc.) to support high-throughput genome annotation projects. RAST Strategy - Calling Genes Find RNAs (rrnas, trnas) Find gene candidates for "Special Proteins (selenos, pyrros) Find gene candidates for membership in: "Universal" FIGfam Protein Families FIGfams already seen in the neighboring genomes. FIGfams other than those found in the neighboring genomes. Repair frameshift errors. Promote remaining non-figfam gene candidates: With similarity to genes in neighbors Without similarity to genes in neighbors Examine suspiciously long gaps for possible "missing" genes previously found in neighboring genomes (AKA "Backfilling"). Gene candidates found during all previous stages become the "training set" for the current stage. Gene candidates are only retained if they do not overlap too much. 2

I/O - What input formats does RAST Accept? Sequence data in FASTA format (.fna), and GenBank (.gbk) format, uploaded as plain text files with no special characters, etc. RAST does not yet support other upload formats, such as EMBL, GFF3, GTF, etc. (although it can generate output in these formats). RAST will reject any file format that is not plain text, e.g. it will not accept genomes encoded as HTML, PDF, RTF, Microsoft Word, etc. I/O - Genes reannotated or recalled? If you want to keep the original gene coordinates, then you must upload a GenBank file and select the "Keep existing gene calls" option. RAST will then assign functions and perform a subsystem analysis, without recalling the genes of your genome. RAST cannot preserve existing gene calls if FASTA contig data are uploaded, because the FASTA format cannot specify gene locations. 3

I/O - Viewing Results You can browse your results and graphically compare them to other genomes using the SEED Viewer You can also download the analysis of your genome in various formats: GenBank EMBL GFF3 GTF SEED genome directory (as tarfile) Input Data Quality What is the poorest quality of data that RAST can handle? We recommend mean contig length >2 kbp, with <1% ambiguity characters. If your assembly quality is worse than this, RAST will most likely fail. It is possible that the metagenomic version of RAST may be able to do something with extremely low quality assemblies; however, MG-RAST is not really designed for this job. 4

Input Data Quality RAST is designed for and performs best on complete or essentially complete genomes. Conversely, RAST's performance degrades substantially when presented with only a small fragment of a genome. Even if you are only interested in a few genes in a small region, it is recommend that you upload as much of your genome as possible, and at minimum 100 kbp of contig data. The probability that RAST will abort with errors increases rapidly below the 100 kbp threshold, and is well in excess of 50% below 40 kbp. Input Data Quality What is meant by "essentially complete" genome? We consider a genome to be "essentially complete" at about 99% coverage, since beyond that point, the expected number of missing genes due to sequencing gaps has become less than the expected number of "false negatives" from the genefinder. From Subsystem Analysis standpoint, >99% completeness point of diminishing returns. In terms of sequence redundancy: At least 5x coverage for Sanger Sequencing, or at least 10x coverage using 454. In terms of contig length: At least 70% of the assembled sequence data are in contigs longer than 20 kbp. 5

Input Sequence Types Will RAST handle just a plasmid? RAST is not designed to handle only plasmids or small fragments. We recommend that you upload the entire genome, even if you intend to only view your plasmid. (Extension of RAST to plasmids proposed) What about Eukaryotes? No not even small ones, and not even organelles! Currently, RAST requires you to specify whether your genome is a bacterium or archaeon. If you try to submit a eukaryote, RAST will most likely abort with errors. (Extension of RAST to [called!] eukaryotes proposed) Input Sequence Types What about ESTs? RAST is not designed to analyze ESTs, and will most likely abort with errors. You can try submitting EST data to the metagenomic version of RAST but again, it is not really designed for them. What about Metagenomes? As previously mentioned, there is a special metagenomic version of RAST designed specifically to analyze the sort of massive, low-quality datasets typically generated by metagenomics projects. 6

FAQs and Common Problems Who do I contact if I have questions about or problems using RAST? All questions or problems regarding RAST should be sent to rast@mcs.anl.gov All questions or problems regarding MG-RAST should be sent to mg-rast@mcs.anl.gov FAQs and Common Problems Will RAST assemble my reads into contigs? No. You will need to assemble your reads into contigs yourself, using some other tool. Why does RAST complain that it can't find the "phylogenetic neighborhood" of my submission? Usually, this is because the submitted sequence data are too small. Experience suggests that RAST needs at least 40 kbp of sequence data to reliably place a submission's phylogenetic neighborhood. (100 kbp is better.) 7

FAQs and Common Problems RAST is complaining about "Duplicate contig IDs," but all my contig IDs appear unique to me. What's going on? Your contig IDs may contain "whitespace" characters. The FASTA standard specifies no "whitespace" between the ">" symbol and the contig ID, and that everything after the first "whitespace" character is a "comment," and not part of the identifier. Thus, the first FASTA header below is invalid (no ID, just comment), while the following two will be interpreted as a pair of "duplicate IDs, that are both named "B.": > E. coli main chromosome >B. subtilis main chromosome >B. subtilis plasmid FAQs and Common Problems Why does RAST complain about "invalid characters" in my FASTA input file? Most likely one of two reasons: Your contig sequences contain characters other than the standard IUPAC ambiguity characters [ACGTUMRWSYKBDHVN] or the "vector masking" character "X. (E.g., because you uploaded protein, not DNA sequences.) Your contig file uses nonstandard line terminators, is missing line terminators before or after a record header, or is otherwise malformed in some way. 8

FAQs and Common Problems How do I get a more detailed explanation of why my job failed? If the RAST webpage describing the error is insufficient to help you diagnose the problem, please send e-mail to <rast@mcs.anl.gov>; we will consult the error-logs for your job, and recommend a solution. FAQs and Common Problems I selected Keep existing gene calls and uploaded a GenBank file, but RAST failed with the cryptic error Zero-size or non-existent FASTA file. What does this mean? Most likely your GenBank file either has: Gene entries but no CDS entries. CDS entries lacking a /translation= field. RAST s GenBank parser expects CDS entries with /translation= fields 9

Conclusion RAST is designed to automatically call and annotate complete or near-complete prokaryotic genomes. RAST uses a Highest Confidence First assignment propagation strategy. RAST assignments are based on manually curated subsystems and subsystem-based protein families. RAST s subsystem-based annotations automatically guarantee a high degree of assignment consistency. 10