RAST Automated Analysis Gordon D. Pusch Fellowship for Interpretation of Genomes What is RAST for? RAST is designed to rapidly call and annotate the genes of a complete or essentially complete prokaryotic genome RAST uses a "Highest Confidence First" assignment propagation strategy based on manually curated subsystems and subsystem-based protein families that automatically guarantees a high degree of assignment consistency. RAST returns an analysis of the genes and subsystems in your genome, as supported by comparative and other forms of evidence. 1
The RAST Strategy How does RAST work? RAST applies FIG's "Subsystem Approach" using a "Highest Reliability First" strategy based on FIG's collection of manually curated Subsystems and subsystem-derived Protein Families (FIGfams). RAST's subsystem approach automatically ensures a high degree of annotation consistency. RAST also computes various derived data (sims, BBHs, PCHs, Scenarios, etc.) to support high-throughput genome annotation projects. RAST Strategy - Calling Genes Find RNAs (rrnas, trnas) Find gene candidates for "Special Proteins (selenos, pyrros) Find gene candidates for membership in: "Universal" FIGfam Protein Families FIGfams already seen in the neighboring genomes. FIGfams other than those found in the neighboring genomes. Repair frameshift errors. Promote remaining non-figfam gene candidates: With similarity to genes in neighbors Without similarity to genes in neighbors Examine suspiciously long gaps for possible "missing" genes previously found in neighboring genomes (AKA "Backfilling"). Gene candidates found during all previous stages become the "training set" for the current stage. Gene candidates are only retained if they do not overlap too much. 2
I/O - What input formats does RAST Accept? Sequence data in FASTA format (.fna), and GenBank (.gbk) format, uploaded as plain text files with no special characters, etc. RAST does not yet support other upload formats, such as EMBL, GFF3, GTF, etc. (although it can generate output in these formats). RAST will reject any file format that is not plain text, e.g. it will not accept genomes encoded as HTML, PDF, RTF, Microsoft Word, etc. I/O - Genes reannotated or recalled? If you want to keep the original gene coordinates, then you must upload a GenBank file and select the "Keep existing gene calls" option. RAST will then assign functions and perform a subsystem analysis, without recalling the genes of your genome. RAST cannot preserve existing gene calls if FASTA contig data are uploaded, because the FASTA format cannot specify gene locations. 3
I/O - Viewing Results You can browse your results and graphically compare them to other genomes using the SEED Viewer You can also download the analysis of your genome in various formats: GenBank EMBL GFF3 GTF SEED genome directory (as tarfile) Input Data Quality What is the poorest quality of data that RAST can handle? We recommend mean contig length >2 kbp, with <1% ambiguity characters. If your assembly quality is worse than this, RAST will most likely fail. It is possible that the metagenomic version of RAST may be able to do something with extremely low quality assemblies; however, MG-RAST is not really designed for this job. 4
Input Data Quality RAST is designed for and performs best on complete or essentially complete genomes. Conversely, RAST's performance degrades substantially when presented with only a small fragment of a genome. Even if you are only interested in a few genes in a small region, it is recommend that you upload as much of your genome as possible, and at minimum 100 kbp of contig data. The probability that RAST will abort with errors increases rapidly below the 100 kbp threshold, and is well in excess of 50% below 40 kbp. Input Data Quality What is meant by "essentially complete" genome? We consider a genome to be "essentially complete" at about 99% coverage, since beyond that point, the expected number of missing genes due to sequencing gaps has become less than the expected number of "false negatives" from the genefinder. From Subsystem Analysis standpoint, >99% completeness point of diminishing returns. In terms of sequence redundancy: At least 5x coverage for Sanger Sequencing, or at least 10x coverage using 454. In terms of contig length: At least 70% of the assembled sequence data are in contigs longer than 20 kbp. 5
Input Sequence Types Will RAST handle just a plasmid? RAST is not designed to handle only plasmids or small fragments. We recommend that you upload the entire genome, even if you intend to only view your plasmid. (Extension of RAST to plasmids proposed) What about Eukaryotes? No not even small ones, and not even organelles! Currently, RAST requires you to specify whether your genome is a bacterium or archaeon. If you try to submit a eukaryote, RAST will most likely abort with errors. (Extension of RAST to [called!] eukaryotes proposed) Input Sequence Types What about ESTs? RAST is not designed to analyze ESTs, and will most likely abort with errors. You can try submitting EST data to the metagenomic version of RAST but again, it is not really designed for them. What about Metagenomes? As previously mentioned, there is a special metagenomic version of RAST designed specifically to analyze the sort of massive, low-quality datasets typically generated by metagenomics projects. 6
FAQs and Common Problems Who do I contact if I have questions about or problems using RAST? All questions or problems regarding RAST should be sent to rast@mcs.anl.gov All questions or problems regarding MG-RAST should be sent to mg-rast@mcs.anl.gov FAQs and Common Problems Will RAST assemble my reads into contigs? No. You will need to assemble your reads into contigs yourself, using some other tool. Why does RAST complain that it can't find the "phylogenetic neighborhood" of my submission? Usually, this is because the submitted sequence data are too small. Experience suggests that RAST needs at least 40 kbp of sequence data to reliably place a submission's phylogenetic neighborhood. (100 kbp is better.) 7
FAQs and Common Problems RAST is complaining about "Duplicate contig IDs," but all my contig IDs appear unique to me. What's going on? Your contig IDs may contain "whitespace" characters. The FASTA standard specifies no "whitespace" between the ">" symbol and the contig ID, and that everything after the first "whitespace" character is a "comment," and not part of the identifier. Thus, the first FASTA header below is invalid (no ID, just comment), while the following two will be interpreted as a pair of "duplicate IDs, that are both named "B.": > E. coli main chromosome >B. subtilis main chromosome >B. subtilis plasmid FAQs and Common Problems Why does RAST complain about "invalid characters" in my FASTA input file? Most likely one of two reasons: Your contig sequences contain characters other than the standard IUPAC ambiguity characters [ACGTUMRWSYKBDHVN] or the "vector masking" character "X. (E.g., because you uploaded protein, not DNA sequences.) Your contig file uses nonstandard line terminators, is missing line terminators before or after a record header, or is otherwise malformed in some way. 8
FAQs and Common Problems How do I get a more detailed explanation of why my job failed? If the RAST webpage describing the error is insufficient to help you diagnose the problem, please send e-mail to <rast@mcs.anl.gov>; we will consult the error-logs for your job, and recommend a solution. FAQs and Common Problems I selected Keep existing gene calls and uploaded a GenBank file, but RAST failed with the cryptic error Zero-size or non-existent FASTA file. What does this mean? Most likely your GenBank file either has: Gene entries but no CDS entries. CDS entries lacking a /translation= field. RAST s GenBank parser expects CDS entries with /translation= fields 9
Conclusion RAST is designed to automatically call and annotate complete or near-complete prokaryotic genomes. RAST uses a Highest Confidence First assignment propagation strategy. RAST assignments are based on manually curated subsystems and subsystem-based protein families. RAST s subsystem-based annotations automatically guarantee a high degree of assignment consistency. 10