Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Similar documents
GenBank, Entrez, & FASTA

Bioinformatics Resources at a Glance

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Biological Sequence Data Formats

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

A Tutorial in Genetic Sequence Classification Tools and Techniques

Database manager does something that sounds trivial. It makes it easy to setup a new database for searching with Mascot. It also makes it easy to

Guide for Bioinformatics Project Module 3

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

This class deals with the fundamental structural features of proteins, which one can understand from the structure of amino acids, and how they are

GenBank: A Database of Genetic Sequence Data

Bioinformatics Grid - Enabled Tools For Biologists.

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Next Generation Sequencing Data Visualization

RNA & Protein Synthesis

Pairwise Sequence Alignment

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Module 1. Sequence Formats and Retrieval. Charles Steward

Basic Concepts of DNA, Proteins, Genes and Genomes

NO CALCULATORS OR CELL PHONES ALLOWED

Biological Databases and Protein Sequence Analysis

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

DNA Sequencing Overview

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Library page. SRS first view. Different types of database in SRS. Standard query form

The Steps. 1. Transcription. 2. Transferal. 3. Translation

Translation Study Guide

Protein Protein Interactions (PPI) APID (Agile Protein Interaction DataAnalyzer)

CSC 2427: Algorithms for Molecular Biology Spring Lecture 16 March 10

Problem Set 1 KEY

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

COMPARING DNA SEQUENCES TO DETERMINE EVOLUTIONARY RELATIONSHIPS AMONG MOLLUSKS

INTRODUCTION TO PROTEIN STRUCTURE

Searching Nucleotide Databases

Section I Using Jmol as a Computer Visualization Tool

Transcription and Translation of DNA

Module 10: Bioinformatics

DNA Sequence formats

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Protein Sequence Analysis - Overview -

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Clone Manager. Getting Started

BIOINFORMATICS TUTORIAL

CD-HIT User s Guide. Last updated: April 5,

Built from 20 kinds of amino acids

Disulfide Bonds at the Hair Salon

Introduction to Proteins and Enzymes

Specify the location of an HTML control stored in the application repository. See Using the XPath search method, page 2.

Myoglobin and Hemoglobin

A Primer of Genome Science THIRD

A disaccharide is formed when a dehydration reaction joins two monosaccharides. This covalent bond is called a glycosidic linkage.

UGENE Quick Start Guide

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Sickle cell anemia: Altered beta chain Single AA change (#6 Glu to Val) Consequence: Protein polymerizes Change in RBC shape ---> phenotypes

4. Which carbohydrate would you find as part of a molecule of RNA? a. Galactose b. Deoxyribose c. Ribose d. Glucose

Structures of Proteins. Primary structure - amino acid sequence

Structure and Function of DNA

Introduction to Bioinformatics 3. DNA editing and contig assembly

Peptide Bond Amino acids are linked together by peptide bonds to form polypepetide chain.

Processing Genome Data using Scalable Database Technology. My Background

Carbohydrates, proteins and lipids

Activity 7.21 Transcription factors

Transcription in prokaryotes. Elongation and termination

An agent-based layered middleware as tool integration

4. DNA replication Pages: Difficulty: 2 Ans: C Which one of the following statements about enzymes that interact with DNA is true?

Molecular Genetics. RNA, Transcription, & Protein Synthesis

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations

Data formats and file conversions

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Personal Portfolios on Blackboard

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Genomes and SNPs in Malaria and Sickle Cell Anemia

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Pipe Cleaner Proteins. Essential question: How does the structure of proteins relate to their function in the cell?

13.2 Ribosomes & Protein Synthesis

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Helices From Readily in Biological Structures

Genome Explorer For Comparative Genome Analysis

The Molecules of Cells

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Chapter 5. The Structure and Function of Macromolecule s

Transcription:

Sequence Formats and Sequence Database Searches Gloria Rendon SC11 Education June, 2011

Sequence A is the primary structure of a biological molecule. It is a chain of residues that form a precise linear order.

Other types of structures Secondary structure: is the general 3D form of local segments (alpha helices, beta sheets) and their interactions mediated by hydrogen bonds.

Other types of structures Tertiary structure: is the 3D structure as defined by its atomic coordinates

Other types of structures Quaternary structure: many proteins are actually assembles of more than one polypeptide chain. The assemblage of those protein subunits are known as the quaternary structure of the protein.

Plain Sequence Format ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG TTTAATTACAGACCTGAA Plain format. It is an ORDERED string of characters where each character represents not only the residue but also its location. The characters must belong to any of these three alphabets depending on the type of, DNA, RNA or protein. DNA s alphabet: { A, C, T, G } RNA s alphabet: { A, C, U, G } Amino acid alphabet: { A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V }

Fasta Sequence Format > gi 455952 gb S67398.1 18S rrna {3' region} [Lemna minor, entire plant, rrna Partial, 111 nt] CTCCTACCGATTGAATGGTCCGGTGAAGCGCTCGGATCGCGGCGACGAGGGCGGTCCCCCGCCCGCGACG TCGCGAGAAGTCCGTTGAACCTTATCATTTAGAGGAAGGAG Fasta format. For each there is a description line and the contents (or body) of the. The description line is always the the first line of the. The description line begins with the > symbol followed by the description itself; all in one line The description must include the identifier or name of the. But there is no universal convention for what else to specify and in which order. The body of the can occupy as many lines as necessary. Like in the plain format; this segment is an ORDERED string of characters where each character represents not only the residue but also its location.

GenBank Sequence Format There exist many other formats which arrange information about the in a more verbose manner. Here we see the same of the previous page but in GenBank format. LOCUS S67398 111 bp rrna linear PLN 18-FEB-1994 DEFINITION 18S rrna {3' region} [Lemna minor, entire plant, rrna Partial, 111 nt]. ACCESSION S67398 VERSION S67398.1 GI:455952 KEYWORDS. SOURCE Lemna minor ORGANISM Lemna minor Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; Liliopsida; Araceae; Lemnoideae; Lemna. REFERENCE 1 (bases 1 to 111) AUTHORS Senecoff,J.F. and Meagher,R.B. TITLE In vivo analysis of plant 18 S ribosomal RNA structure JOURNAL Meth. Enzymol. 224, 357-372 (1993) PUBMED 8264398 REMARK GenBank staff at the National Library of Medicine created this entry [NCBI gibbsq 140993] from the original journal article. FEATURES Location/Qualifiers source 1..111 /organism="lemna minor" /mol_type="rrna" /db_xref="taxon:4472" gene 1..111 /gene="18s rrna" rrna 1..111 /gene="18s rrna" ORIGIN 1 ctcctaccga ttgaatggtc cggtgaagcg ctcggatcgc ggcgacgagg gcggtccccc 61 gcccgcgacg tcgcgagaag tccgttgaac cttatcattt agaggaagga g //

Sequence Databases The GOOD news: There are many Repositories of s. Some of them are publicly available and can be freely obtained (no fee). Megasites like NCBI (at NIH), UniProt (at EBI), KEGG (Japan) provide storage of and access to their databases. They also provide additional services like curation, regular updates, software tools for mining the databases, etc. The BAD news: There is great proliferation of formats and conventions that are site and/or database specific There is little effort at each site to provide cross-referencing with the rest of the world Be prepared to spend a great deal of time dealing with format conversions and cross-referencing of databases

Searching Sequence Databases First a word of warning! Most publicly available databases provide web servers for users to search and access their content. Most of them, if not all, are still a long way from being user-friendly during search, retrieval and download activities. These sites do not yet have the capability of conducting meaningful searches based on natural language queries. In other words, do not expect the search engine to behave like Google.

Searching Sequence Databases These three scenarios will illustrate the most common approaches to performing searches for s in databases. They are ordered by level of difficulty. Scenario 1: you know the database and the identifier; for example, the database is GenBank, the identifier is FJ985763 Scenario 2: you know the name of the gene or protein and the organism; for example, 16s ribosomal RNA in E. Coli Scenario 3: you have a segment of the but nothing else; for example, a segment from an assembly after shotgun sequencing on a soil sample.

Searching Sequence Databases > gi 455952 gb S67398.1 18S rrna {3' region} [Lemna minor, entire plant, rrna Partial, 111 nt] CTCCTACCGATTGAATGGTCCGGTGAAGCGCTCGGATCGCGGCGACGAGGGCGGTCCCCCGCCCGCGACG TCGCGAGAAGTCCGTTGAACCTTATCATTTAGAGGAAGGAG For scenarios 1 and 2, we will be using search tools that try to match your query to the information in the description line of the. These tools assume your query will be specified in plain English but with control vocabulary. For Scenario 3 we will be using search tools that try to match your query to the body of the. These tools, like Blast, perform the match by doing alignment.

Searching Sequence Databases Scenario 1 (seq-id, db) known Let us suppose that you were extremely lucky and read a paper that included a full reference of the it was reporting on. That is, it provided the database name and the identifier. For example: Smith GJD, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, Pybus OG, Ma SK, Cheung CL, Raghwani J, Bhatt S, Peiris JSM, Guan Y, Rambaut A. Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature 2009; 459:1122-5. The supplementary material includes a table with ids on page 31. http://www.nature.com/nature/journal/v459/n7250/extref/nature08182-s1.pdf Let us search and retrieve any of the s in that table from the NCBI url

Scenario 1 (seq-id,db) known 1. Extract the id and database from the paper 2. Go to the web server of the database it belongs to 3. In the search box type the id and run the search 4. Examine the 5. Done

Scenario 1 (seq-id, db) known 1. Extract the id and DB from the paper 2. Go to the web server of the database it belongs to 3. In the search box type the id and run the search 4. Examine the 5. Done FJ985763 GenBank

Scenario 1 (seq-id, db) known 1. Extract the id and DB from the paper 2. Go to the web server of the database it belongs to 3. In the search box type the id and run the search 4. Examine the 5. Done http://www.ncbi.nlm.nih.gov/

Scenario 1 (seq-id, db) known 1. Extract the id and DB from the paper 2. Go to the web server of the database it belongs to 3. In the search box type the id and run the search 4. Examine the 5. Done FJ985763

Scenario 1 (seq-id, db) known 1. Extract the id and DB from the paper 2. Go to the web server of the database it belongs to 3. In the search box type the id and run the search 4. Examine the 5. Done In the Results page, click here to retrieve the result found in this particular database of nucleotide s

Scenario 1 (seq-id, db) known 1. Extract the id and DB from the paper 2. Go to the web server of the database it belongs to 3. In the search box type the id and run the search 4. Examine the 5. Done This is the result displayed in GenBank format. To change to FASTA format click on the fasta link

Scenario 1 (seq-id, db) known 1. Extract the id and DB from the paper 2. Go to the web server of the database it belongs to 3. In the search box type the id and run the search 4. Examine the 5. Done This is the result in FASTA format. To download it to your computer click on Send

Scenario 1 (seq-id, db) known 1. Extract the id and DB from the paper 2. Go to the web server of the database it belongs to 3. In the search box type the id and run the search 4. Examine the 5. Done You can also access the Taxonomy page of the Organism that contains this gene by clicking this link

Scenario 1 (seq-id, db) known 1. Extract the id and DB from the paper 2. Go to the web server of the database it belongs to 3. In the search box type the id and run the search 4. Examine the 5. Done From the box select File as destination; FASTA format Then click on Create File

Searching Sequence Databases Scenario 2 (gene name) known Let us suppose that you read a paper written some time ago and gives the name of the gene; but it does not give any additional information like database it is found in and id. For example: Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences of the United States of America 1977; 74:5088-90. http://www.pnas.org/content/74/11/5088.full.pdf+html We can choose a likely database that can contain the gene and formulate a query with as much information about the gene as we can gather from the paper. In this case, the database is GenBank at NCBI and the query could be: 16s/18S ribosomal RNA of each one the species listed on the table of p 5089

Scenario 2 (gene name) known 1. Extract the gene name from the paper 2. Go to the web server of a megasite like NCBI,EBI,etc. 3. In the search box type a query with the gene name and run the search 4. Select the MOST likely hit(s) from the list of results. 5. Examine the 6. Done

Scenario 2 (gene name) known 1. Extract the gene name from the paper 2. Go to the web server of a megasite like NCBI,EBI,etc. 3. In the search box type a query with the gene name and run the search 4. Select the MOST likely hit(s) from the list of results. 5. Examine the 6. Done Escherichia coli 16s ribosomal RNA

Scenario 2 (gene name) known 1. Extract the gene name from the paper 2. Go to the web server of a megasite like NCBI,EBI,etc. 3. In the search box type a query with the gene name and run the search 4. Select the MOST likely hit(s) from the list of results. 5. Examine the 6. Done Type the information in the search box The advantage of going to a megasite, like NCBI, is that their repository has good coverage and is updated regularly.

Scenario 2 (gene name) known 1. Extract the gene name from the paper 2. Go to the web server of a megasite like NCBI,EBI,etc. 3. In the search box type a query with the gene name and run the search 4. Select the MOST likely hit(s) from the list of results. 5. Examine the 6. Done Alternatively, you can go to specialized databases like this one for ribosomal RNA s only http://www.arb-silva.de/search/

Scenario 2 (gene name) known 1. Extract the gene name from the paper 2. Go to the web server of a megasite like NCBI,EBI,etc. 3. In the search box type a query with the gene name and run the search 4. Select the MOST likely hit(s) from the list of results. 5. Examine the 6. Done This is a partial view of the results page. As you can see from the figures (representing number of hits); it could easily turn into looking for a needle in a haystack.

Scenario 2 (gene name) known 1. Extract the gene name from the paper 2. Go to the web server of a megasite like NCBI,EBI,etc. 3. In the search box type a query with the gene name and run the search 4. Select the MOST likely hit(s) from the list of results. 5. Examine the 6. Done In the Nucleotide results page we will examine the first hit. To do that, you can click on it directly. If you want to examine several entries, select them first and then click on any of the selected items to retrieve the corresponding record

Scenario 2 (gene name) known 1. Extract the gene name from the paper 2. Go to the web server of a megasite like NCBI,EBI,etc. 3. In the search box type a query with the gene name and run the search 4. Select the MOST likely hit(s) from the list of results. 5. Examine the 6. Done Notice that the description is occupying two lines rather than a single line for this type of format. Make sure that the file you download contains the (s) in the proper format.

Searching Sequence Databases Scenario 3 (partial seq) known Let us suppose that you read a paper that presents only a segment of the ; but does not provide clear details about the source (database and id) For example: Peters JA, Cooper MA, Carland JE, Livesey MR, Hales TG, Lambert JJ. Novel structural determinants of single channel conductance and ion selectivity in 5- hydroxytryptamine type 3 and nicotinic acetylcholine receptors. The Journal of Physiology; 588:587-96. Figure 2 shows the () segment of interest. We have the choice of using a) the GENE name or b) the SEGMENT of the to find the entire of the gene. a) It will be similar to what we did in Scenario 2 b) It requires us to use BLAST to search for the complete by matching the segment of the to one contained in the database.

Scenario 3 (partial seq) known 1. Extract the partial from the paper 2. Go to a BLAST web server. 3. In the box type the partial 4. Select a database and run the search 5. Select the MOST likely hit(s) from the list of results. 6. Examine the 7. Done

Scenario 3 (partial seq) known 1. Extract the partial from the paper 2. Go to a BLAST web server. 3. In the box type the partial 4. Select a database and run the search 5. Select the MOST likely hit(s) from the list of results. 6. Examine the 7. Done This is the partial for the first gene 5-HT3A as it appears on Figure 2 of the paper. NSGERVSFKITLLLGYSVFLIIVSDTLPATA

Scenario 3 (partial seq) known 1. Extract the partial from the paper 2. Go to a BLAST web server. 3. In the box type the partial 4. Select a database and run the search 5. Select the MOST likely hit(s) from the list of results. 6. Examine the 7. Done www.uniprot.org

Scenario 3 (partial seq) known 1. Extract the partial from the paper 2. Go to a BLAST web server. 3. In the box type the partial 4. Select a database and run the search 5. Select the MOST likely hit(s) from the list of results. 6. Examine the 7. Done 1. Select the Blast tab first 2. Paste the partial in the box 3. Leave default parameters unchanged 4. Then click on the Blast button

Scenario 3 (partial seq) known 1. Extract the partial from the paper 2. Go to a BLAST web server. 3. In the box type the partial 4. Select a database and run the search 5. Select the MOST likely hit(s) from the list of results. 6. Examine the 7. Done The most likely hit is NOT always the top hit on the list of results. We will come back to this results page after we learn how to read global and local alignments, which are marked here by the green rectangles in columns 3 and 4 respectively.

Scenario 3 (partial seq) known 1. Extract the partial from the paper 2. Go to a BLAST web server. 3. In the box type the partial 4. Select a database and run the search 5. Select the MOST likely hit(s) from the list of results. 6. Examine the 7. Done 1. Check that the identity is 100% 2. Check that it is the right gene 3. A perfect match (of the local alignment) 4. Retrieve the by clicking on its Accession

Scenario 3 (partial seq) known 1. Extract the partial from the paper 2. Go to a BLAST web server. 3. In the box type the partial 4. Select a database and run the search 5. Select the MOST likely hit(s) from the list of results. 6. Examine the 7. Done

Scenario 3 (partial seq) known 1. Extract the partial from the paper 2. Go to a BLAST web server. 3. In the box type the partial 4. Select a database and run the search 5. Select the MOST likely hit(s) from the list of results. 6. Examine the 7. Done From the previous page, click on fasta, written in yellow, to get to this page which displays the in fasta format. The name of the gene is 5HT3A_HUMAN Its id in this database (called UniProt) is P46098 You can by saving this page to a text file. You can also bookmark this url for future reference. We will revisit this page in the next section.

Additional Resources Sequence Formats http://emboss.sourceforge.net/docs/themes/sequenceformats.html NCBI introductory video http://www.youtube.com/watch?v=x7s1zewqeli UniProt introductory video http://www.youtube.com/watch?v=tcf3qwn7sii