GenBank, Entrez, & FASTA
Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories, particularly for long-term study of bioinformatic data flat files; not built for (and not great at) querying
Nucleotide Sequence Databases Second generation: Entrez gene is an example information is gene-centric (not just sequencecentric) all sequence information for a given gene can be found in one place
Nucleotide Sequence Databases Third generation: Ensembl is a good example http://uswest.ensembl.org/in dex.html Information is organized around whole genomes; not only a specific gene s structure, but its context: position of this gene relative to others strand orientation how gene relates to presence or absence of biochemical functions in organism
GenBank First example: prokaryotic gene point your browser to: http://www.ncbi.nlm.nih.gov/entrez choose Nucleotide from the Search pull-down menu in For box, type X01714 and click Go Click the link labeled X01714 Can Send To Text if you want to save the file
GenBank fields LOCUS size of sequence (in base pairs) nature of molecule (e.g. DNA or RNA) topology (linear or circular) DEFINITION: brief description of gene ACCESSION: unique identifier for this (and some other) databases VERSION: lists synonymous or past ID numbers
GenBank fields KEYWORDS: list of terms related to entry; can be used for keyword searching for related data SOURCE: common name of relevant organism ORGANISM: complete id, with taxonomic classification note that ORGANISM is indented under SOURCE; this indicates that ORGANISM is a subordinate term, or subsection, of SOURCE
GenBank fields REFERENCE: credits author(s) who initially determined the sequence; includes subsections: AUTHOR TITLE JOURNAL PUBMED COMMENT: free-formatted text that doesn t fit in another category
GenBank fields FEATURES: table describing gene regions and associated biological properties source: origin of specific regions of sequence; useful for distinguishing cloning vectors from host sequences promoter: precise coordinates of promoter element in the sequence; may be more than one of these misc feature: in this example, indicates (putative) location of transcription start (mrna synthesis) RBS (ribosome binding site): location of last upstream element CDS (CoDing Segment): describes the ORF
GenBank fields: FEATURES: CDS gives coordinates from initial nucleotide (ATG) to last nucleotide of stop codon (TAA) several lines follow, listing protein products, reading frame to use, genetic code to apply and several IDs for the protein sequence /translation section gives computer translation of sequence into amino acid sequence
Last Section: sequence itself This is the most important section in terms of analysis using other tools Can isolate just this section and save the file, using either the Download pull-down on the FASTA format page, or the more general method discussed later
Example 2: eukaryotic mrna Can obtain this example by searching Nucleotide database for U90223 Similar to prokaryote example, because we re looking at a direct coding sequence for a protein not DNA, in other words Notes on example: KEYWORD field is empty: this is an example of an incomplete annotation remember, you re looking at a primary database! FEATURES field contains some new terms: sig_peptide: location of mitochondrial targeting sequence mat_peptide: exact boundaries of mature peptide
Example 3: Eukaryotic gene Can obtain this record by searching Nucleotide for AF018430 General information: LOCUS: same info as previous examples note the locus name is different from the accession number this time DEFINITION: specifies exon; remember, protein-coding regions in eukaryotes are not contiguous as in prokaryotes SEGMENT: indicates this is the second of 4; you d need all 4 to reconstruct the mrna that codes for the protein
Eukaryotic gene: FEATURES section source subsection includes a /map section: indicates chromosome (15) arm (q means long arm) cytogenic band (q21.1)
Eukaryotic gene: FEATURES section gene subsection: describes how to reconstruct the mrnas found in this and separate entries: the strings that begin AF refer to the GenBank entries (remember, this one was AF018430), and the numbers represent the nucleotide positions from the entries if a set of numbers (example: 1..1177) is NOT preceded by an entry indicator, it s from the current entry The < and > signs indicate that the start and stop points are only approximate
Eukaryotic gene: FEATURES section mrna section: can be read in a similar manner to the gene section note that there are two mrna sections (each followed by a CDS section) first section describes mitochondrial RNA second section describes nuclear RNA exon section: indicates position of exon(s) in sequence
Retrieving GenBank entries without accession numbers Search Nucleotide for specific product you re interested in; for example: human[organism] AND dutpase[protein name] this search yields several entries; can click the Links link to the right of one of these (AF018432) and choose Related Sequences from the pull-down that appears retrieves several more entries, some DNA and some mrna terms used in the titles of these entries can give us additional search criteria: human*organism+ AND dutp pyrophosphatase *Title+ yields somewhat different set of entries
FASTA format Standard data format for sequence data (nucleotide & protein) Includes: definition line (starts with > character) sequence data residues represented as single letters all caps Default input format for many sequence analysis tools For raw format, use FASTA without definition line(s)
Obtaining FASTA from GenBank record Click FASTA link (near top of page) Copy the entire entry (from > to last letter) and paste it into notepad Save the file under a relevant name
Obtaining protein sequence from GenBank record Scroll down the record until you find the CDS section Look for the label protein_id Click the link next to this label You can obtain FASTA format for the protein just as you did for the nucleotide sequence
Working with non-fasta data Since most sequence tools expect FASTA format, a dirty sequence (one with extraneous characters) can pose a problem The Sequence Manipulation Suite (SMS) is a tool that can be used to clean up such a sequence SMS can be found at http://www.bioinformatics.org/sms2
SMS Tool categories include: Format conversion Sequence analysis and others The figure to the right is a screen capture showing the Format conversion menu
Data conversion with Filter DNA tool
Example of results