GenBank, Entrez, & FASTA



Similar documents
Bioinformatics Resources at a Glance

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Biological Sequence Data Formats

Gene Models & Bed format: What they represent.

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009

Protein Synthesis How Genes Become Constituent Molecules

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Lecture 1 MODULE 3 GENE EXPRESSION AND REGULATION OF GENE EXPRESSION. Professor Bharat Patel Office: Science 2, b.patel@griffith.edu.

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

RNA & Protein Synthesis

Transcription and Translation of DNA

Translation Study Guide

Structure and Function of DNA

Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons

13.2 Ribosomes & Protein Synthesis

The Steps. 1. Transcription. 2. Transferal. 3. Translation

Ms. Campbell Protein Synthesis Practice Questions Regents L.E.

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

A Tutorial in Genetic Sequence Classification Tools and Techniques

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Searching Nucleotide Databases

Lecture 4. Polypeptide Synthesis Overview

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

To be able to describe polypeptide synthesis including transcription and splicing

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Activity 7.21 Transcription factors

Overview of Eukaryotic Gene Prediction

Basic Concepts of DNA, Proteins, Genes and Genomes

Modeling DNA Replication and Protein Synthesis

Chironomid DNA Barcode Database Search System. User Manual

1 Mutation and Genetic Change

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

GenBank: A Database of Genetic Sequence Data

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

DNA Sequence formats

Genomes and SNPs in Malaria and Sickle Cell Anemia

Module 10: Bioinformatics

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

From DNA to Protein

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Regents Biology REGENTS REVIEW: PROTEIN SYNTHESIS

Gene Finding CMSC 423

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

Introduction to Genome Annotation

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

Note: This document wh_informatics_practical.doc and supporting materials can be downloaded at

Sequencing the Human Genome

Translation. Translation: Assembly of polypeptides on a ribosome

How To Understand How Gene Expression Is Regulated

GENE CONSTRUCTION KIT 4

13.4 Gene Regulation and Expression

Sample Questions for Exam 3

GENE REGULATION. Teacher Packet

Biological Databases and Protein Sequence Analysis

CCR Biology - Chapter 8 Practice Test - Summer 2012

Module 1. Sequence Formats and Retrieval. Charles Steward

Integration of data management and analysis for genome research

DNA, RNA, Protein synthesis, and Mutations. Chapters

Genetics Module B, Anchor 3

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

Frequently Asked Questions Next Generation Sequencing

Provincial Exam Questions. 9. Give one role of each of the following nucleic acids in the production of an enzyme.

Transfection-Transfer of non-viral genetic material into eukaryotic cells. Infection/ Transduction- Transfer of viral genetic material into cells.

Introduction to Bioinformatics 3. DNA editing and contig assembly

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

PRACTICE TEST QUESTIONS

LESSON 4. Using Bioinformatics to Analyze Protein Sequences. Introduction. Learning Objectives. Key Concepts

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Lecture Series 7. From DNA to Protein. Genotype to Phenotype. Reading Assignments. A. Genes and the Synthesis of Polypeptides

Becker Muscular Dystrophy

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

Bio 102 Practice Problems Recombinant DNA and Biotechnology

CHAPTER 30: PROTEIN SYNTHESIS

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

AP BIOLOGY 2009 SCORING GUIDELINES

Serial Cloner 1.2. User Manual Part I -Basic functions -

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Academic Nucleic Acids and Protein Synthesis Test

Module 3 Questions. 7. Chemotaxis is an example of signal transduction. Explain, with the use of diagrams.

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Replication Study Guide

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

The world of non-coding RNA. Espen Enerly

Name: Date: Period: DNA Unit: DNA Webquest

Thymine = orange Adenine = dark green Guanine = purple Cytosine = yellow Uracil = brown

Concluding lesson. Student manual. What kind of protein are you? (Basic)

BioBoot Camp Genetics

Protein Synthesis. Page 41 Page 44 Page 47 Page 42 Page 45 Page 48 Page 43 Page 46 Page 49. Page 41. DNA RNA Protein. Vocabulary

1.5 page 3 DNA Replication S. Preston 1

Transcription:

GenBank, Entrez, & FASTA

Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories, particularly for long-term study of bioinformatic data flat files; not built for (and not great at) querying

Nucleotide Sequence Databases Second generation: Entrez gene is an example information is gene-centric (not just sequencecentric) all sequence information for a given gene can be found in one place

Nucleotide Sequence Databases Third generation: Ensembl is a good example http://uswest.ensembl.org/in dex.html Information is organized around whole genomes; not only a specific gene s structure, but its context: position of this gene relative to others strand orientation how gene relates to presence or absence of biochemical functions in organism

GenBank First example: prokaryotic gene point your browser to: http://www.ncbi.nlm.nih.gov/entrez choose Nucleotide from the Search pull-down menu in For box, type X01714 and click Go Click the link labeled X01714 Can Send To Text if you want to save the file

GenBank fields LOCUS size of sequence (in base pairs) nature of molecule (e.g. DNA or RNA) topology (linear or circular) DEFINITION: brief description of gene ACCESSION: unique identifier for this (and some other) databases VERSION: lists synonymous or past ID numbers

GenBank fields KEYWORDS: list of terms related to entry; can be used for keyword searching for related data SOURCE: common name of relevant organism ORGANISM: complete id, with taxonomic classification note that ORGANISM is indented under SOURCE; this indicates that ORGANISM is a subordinate term, or subsection, of SOURCE

GenBank fields REFERENCE: credits author(s) who initially determined the sequence; includes subsections: AUTHOR TITLE JOURNAL PUBMED COMMENT: free-formatted text that doesn t fit in another category

GenBank fields FEATURES: table describing gene regions and associated biological properties source: origin of specific regions of sequence; useful for distinguishing cloning vectors from host sequences promoter: precise coordinates of promoter element in the sequence; may be more than one of these misc feature: in this example, indicates (putative) location of transcription start (mrna synthesis) RBS (ribosome binding site): location of last upstream element CDS (CoDing Segment): describes the ORF

GenBank fields: FEATURES: CDS gives coordinates from initial nucleotide (ATG) to last nucleotide of stop codon (TAA) several lines follow, listing protein products, reading frame to use, genetic code to apply and several IDs for the protein sequence /translation section gives computer translation of sequence into amino acid sequence

Last Section: sequence itself This is the most important section in terms of analysis using other tools Can isolate just this section and save the file, using either the Download pull-down on the FASTA format page, or the more general method discussed later

Example 2: eukaryotic mrna Can obtain this example by searching Nucleotide database for U90223 Similar to prokaryote example, because we re looking at a direct coding sequence for a protein not DNA, in other words Notes on example: KEYWORD field is empty: this is an example of an incomplete annotation remember, you re looking at a primary database! FEATURES field contains some new terms: sig_peptide: location of mitochondrial targeting sequence mat_peptide: exact boundaries of mature peptide

Example 3: Eukaryotic gene Can obtain this record by searching Nucleotide for AF018430 General information: LOCUS: same info as previous examples note the locus name is different from the accession number this time DEFINITION: specifies exon; remember, protein-coding regions in eukaryotes are not contiguous as in prokaryotes SEGMENT: indicates this is the second of 4; you d need all 4 to reconstruct the mrna that codes for the protein

Eukaryotic gene: FEATURES section source subsection includes a /map section: indicates chromosome (15) arm (q means long arm) cytogenic band (q21.1)

Eukaryotic gene: FEATURES section gene subsection: describes how to reconstruct the mrnas found in this and separate entries: the strings that begin AF refer to the GenBank entries (remember, this one was AF018430), and the numbers represent the nucleotide positions from the entries if a set of numbers (example: 1..1177) is NOT preceded by an entry indicator, it s from the current entry The < and > signs indicate that the start and stop points are only approximate

Eukaryotic gene: FEATURES section mrna section: can be read in a similar manner to the gene section note that there are two mrna sections (each followed by a CDS section) first section describes mitochondrial RNA second section describes nuclear RNA exon section: indicates position of exon(s) in sequence

Retrieving GenBank entries without accession numbers Search Nucleotide for specific product you re interested in; for example: human[organism] AND dutpase[protein name] this search yields several entries; can click the Links link to the right of one of these (AF018432) and choose Related Sequences from the pull-down that appears retrieves several more entries, some DNA and some mrna terms used in the titles of these entries can give us additional search criteria: human*organism+ AND dutp pyrophosphatase *Title+ yields somewhat different set of entries

FASTA format Standard data format for sequence data (nucleotide & protein) Includes: definition line (starts with > character) sequence data residues represented as single letters all caps Default input format for many sequence analysis tools For raw format, use FASTA without definition line(s)

Obtaining FASTA from GenBank record Click FASTA link (near top of page) Copy the entire entry (from > to last letter) and paste it into notepad Save the file under a relevant name

Obtaining protein sequence from GenBank record Scroll down the record until you find the CDS section Look for the label protein_id Click the link next to this label You can obtain FASTA format for the protein just as you did for the nucleotide sequence

Working with non-fasta data Since most sequence tools expect FASTA format, a dirty sequence (one with extraneous characters) can pose a problem The Sequence Manipulation Suite (SMS) is a tool that can be used to clean up such a sequence SMS can be found at http://www.bioinformatics.org/sms2

SMS Tool categories include: Format conversion Sequence analysis and others The figure to the right is a screen capture showing the Format conversion menu

Data conversion with Filter DNA tool

Example of results