Biological Sequence Data Formats

Similar documents
Bioinformatics Resources at a Glance

GenBank, Entrez, & FASTA

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Translation Study Guide

Gene Models & Bed format: What they represent.

Name: Date: Period: DNA Unit: DNA Webquest

To be able to describe polypeptide synthesis including transcription and splicing

Transcription and Translation of DNA

MAKING AN EVOLUTIONARY TREE

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

A Tutorial in Genetic Sequence Classification Tools and Techniques

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

The Steps. 1. Transcription. 2. Transferal. 3. Translation

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

RNA & Protein Synthesis

Protein Synthesis How Genes Become Constituent Molecules

Introduction to The DNA Discovery Kit

Searching Nucleotide Databases

Ms. Campbell Protein Synthesis Practice Questions Regents L.E.

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Replication Study Guide

CCR Biology - Chapter 9 Practice Test - Summer 2012

DNA Sequence formats

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Bioinformatics Grid - Enabled Tools For Biologists.

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Control of Gene Expression

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Sequencing the Human Genome

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

DNA Sequence Analysis Software

Lecture Series 7. From DNA to Protein. Genotype to Phenotype. Reading Assignments. A. Genes and the Synthesis of Polypeptides

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Lecture 1 MODULE 3 GENE EXPRESSION AND REGULATION OF GENE EXPRESSION. Professor Bharat Patel Office: Science 2, b.patel@griffith.edu.

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

CCR Biology - Chapter 8 Practice Test - Summer 2012

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Committee on WIPO Standards (CWS)

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

13.2 Ribosomes & Protein Synthesis

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Module 1. Sequence Formats and Retrieval. Charles Steward

Activity 7.21 Transcription factors

Biological Databases and Protein Sequence Analysis

Basic Concepts of DNA, Proteins, Genes and Genomes

13.4 Gene Regulation and Expression

The molecules of life. The molecules that make up living things are really big They are called macromolecules

Cancer Genomics: What Does It Mean for You?

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

Human Genome Organization: An Update. Genome Organization: An Update

Module 10: Bioinformatics

Teaching Bioinformatics to Undergraduates

1 Mutation and Genetic Change

Bio-Informatics Lectures. A Short Introduction

Overview of Eukaryotic Gene Prediction

SUBMITTING DNA SEQUENCES TO THE DATABASES

Bob Jesberg. Boston, MA April 3, 2014

Biotechnology: DNA Technology & Genomics

Integration of data management and analysis for genome research

Gene Switches Teacher Information

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Integrating Bioinformatics, Medical Sciences and Drug Discovery

Frequently Asked Questions Next Generation Sequencing

Regents Biology REGENTS REVIEW: PROTEIN SYNTHESIS

Pairwise Sequence Alignment

CD-HIT User s Guide. Last updated: April 5,

Custom TaqMan Assays For New SNP Genotyping and Gene Expression Assays. Design and Ordering Guide

GenBank: A Database of Genetic Sequence Data

Structure and Function of DNA

UGENE Quick Start Guide

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

A Web Based Software for Synonymous Codon Usage Indices

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

COMPARING DNA SEQUENCES TO DETERMINE EVOLUTIONARY RELATIONSHIPS AMONG MOLLUSKS

14.3 Studying the Human Genome

Introduction to GCG and SeqLab

Genetics Module B, Anchor 3

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

DNA is found in all organisms from the smallest bacteria to humans. DNA has the same composition and structure in all organisms!

LESSON 4. Using Bioinformatics to Analyze Protein Sequences. Introduction. Learning Objectives. Key Concepts

FINDING RELATION BETWEEN AGING AND

Bioinformatics, Sequences and Genomes

Subject Area(s) Biology. Associated Unit Engineering Nature: DNA Visualization and Manipulation. Associated Lesson Imaging the DNA Structure

Lecture 3: Mutations

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Introduction to Bioinformatics 3. DNA editing and contig assembly

Transcription:

Biological Sequence Data Formats Here we present three standard formats in which biological sequence data (DNA, RNA and protein) can be stored and presented. Raw Sequence: Data without description. FASTA Format: One line of description, then sequence. GenBank Record: Lots of detailed description about the sequence.

Raw Sequence: DNA In the text box below, is an example of raw DNA sequence. Just the four different nucleotides of one particular DNA strand. No information on the gene or the organisms the sequence came from. (See Biology in the Computer.) Raw Sequence: Protein Below is an example of a raw protein sequence. The letters indicate one of the twenty different amino acids and the order tells how they are put together.

FASTA Format The fasta format (originally created for a program called, you guessed it, FASTA ) is a ubiquitous format in bioinformatics and is accepted as input to many bioinformatics analysis tools. It is almost as simple as the raw format, but has a Title Line that provides some information about the sequence. FASTA formats always have a title line, and it always begins with a > and ends with a return character. FASTA Format: DNA Below is a FASTA file for the DNA sequence that codes for the G- gamma-globin protein of a spider monkey, Ateles geoffroyi.

FASTA Format: Protein Below is a fasta file for the Protein sequence for the G-gamma-globin protein of a spider monkey, Ateles geoffroyi. This is the FASTA sequence record from GenBank, a major database of biological sequence information. The codes at the beginning of the title are tracking identifiers used by GenBank to organize and find sequences in the database. FASTA Format: Make up your own titles You do not have to have complicated titles. It is easy to make up your own titles. For example: > Seq1 CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAA

WARNING: Some programs have difficulty with titles that are too long, include spaces or non-letter or number characters. Avoid (1) Names longer than 15 character; (2) Spaces; and (3) Characters other than letters or numbers. FASTA Format: Multiple Entries Sometimes you need to input many sequences at the same time to a program, such as a multiple sequence alignment program. This is easy in FASTA format see below. (Note: These sequences are all the same length, but this does not have to be the case.) > HumanGlobin CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAA ATCTTTAAATCCACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAAT > MonkeyGlobin GTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTT TTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTTCCTTCACTTAGCTATGGA > HorseGlobin ATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTT GGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTTGGACATTT

GenBank Record The GenBank format is an example of a data-rich format. It is used by The National Center for Biotechnology Information (NCBI) and each record is given a unique identification code. (Actually more than one.) The full biological sequence of the record is always at the end of the record. To the right is the GenBank record for the Spider Monkey globin gene. Read below for more details on the types of information in a GenBank file.

GenBank Record: Background on the sequence The beginning of the GenBank file contains background information such as the source of the biological molecule (what organism) and the scientists who discovered the sequence. On the le( side of the file are keywords indica5ng the details present in the file. The accession number and the GI (GenBank Iden5fica5on) number are unique to this record. The organism, or other source, of the biological sequence. Informa5on on the discoverers of the sequence, publica5ons with the sequence, and more.

GenBank Record: The sequence At the end of the file is the biological sequence. In this case it is a DNA sequence, but it may also be RNA or protein. The nucleo5des (or amino acids in the case of protein) are numbered. The c is the first nucleo5de of the file. The end of the file is demarcated by two back slashes.

GenBank Record: Feature section In the middle of the file, the FEATURES section describe the various molecular features of the sequence and some of the biological activity. The posi5ons (length) of the sequence at the end of the file. The exon and intron features indicate their posi5ons in the sequence below. Biological informa5on. In this case the DNA was derived from spider monkey skin cells. CDS stands for CoDing Sequence. Indicates if part or all of the sequence is translated into protein. Accession/GI numbers of a different GenBank file with the protein sequence. The amino acids of the translated protein. Very convenient!