Sequence formats and databases in bioinformatics



Similar documents
Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

A Tutorial in Genetic Sequence Classification Tools and Techniques

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

GenBank, Entrez, & FASTA

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

DNA Sequence formats

Biological Databases and Protein Sequence Analysis

Module 10: Bioinformatics

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

An agent-based layered middleware as tool integration

Integrating Bioinformatics, Medical Sciences and Drug Discovery

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

(A GUIDE for the Graphical User Interface (GUI) GDE)

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Bioinformatics Resources at a Glance

Module 1. Sequence Formats and Retrieval. Charles Steward

Sequence information - lectures

UGENE Quick Start Guide

A data management framework for the Fungal Tree of Life

Core Bioinformatics. Degree Type Year Semester

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London

Data formats and file conversions

Version 5.0 Release Notes

FBIO - Fundations of Bioinformatics

Basic Concepts of DNA, Proteins, Genes and Genomes

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

SUBMITTING DNA SEQUENCES TO THE DATABASES

Library page. SRS first view. Different types of database in SRS. Standard query form

PPInterFinder A Web Server for Mining Human Protein Protein Interaction

Protein Protein Interaction Networks

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Final Project Report

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

Activity 7.21 Transcription factors

Bioinformatics Grid - Enabled Tools For Biologists.

BIOINFORMATICS TUTORIAL

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Guide for Bioinformatics Project Module 3

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

Bio-Informatics Lectures. A Short Introduction

Biological Sequence Data Formats

Pairwise Sequence Alignment

UNIT I LESSON -1 INTRODUCTION TO BIOINFORMATICS

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

Global and Discovery Proteomics Lecture Agenda

A Primer of Genome Science THIRD

Linear Sequence Analysis. 3-D Structure Analysis

THE GENBANK SEQUENCE DATABASE

Introduction to GCG and SeqLab

CD-HIT User s Guide. Last updated: April 5,

ProteinQuest user guide

Comparing Methods for Identifying Transcription Factor Target Genes

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility

Committee on WIPO Standards (CWS)

Special report. Chronic Lymphocytic Leukemia (CLL) Genomic Biology 3020 April 20, 2006

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Clone Manager. Getting Started

Teaching Bioinformatics to Undergraduates

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Dr Alexander Henzing

Big Data in Drug Discovery

Bioinformatics Tools Tutorial Project Gene ID: KRas

Introduction to Bioinformatics AS Laboratory Assignment 6

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Structure and Function of DNA

An Introduction to Genomics and SAS Scientific Discovery Solutions

Bioinformatics: course introduction

P G DIPLOMA IN BIOINFORMATICS

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Introduction to Genome Annotation

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Sequencing the Human Genome

RAST Automated Analysis. What is RAST for?

Human Genome and Human Genome Project. Louxin Zhang

Scientific databases. Biological data management

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Biological Sciences Initiative. Human Genome

Databases and mapping BWA. Samtools

GenBank: A Database of Genetic Sequence Data

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Transcription:

Sequence formats and databases in bioinformatics Definitions/Basics Sequence formats Databases in Biology Dinesh Gupta Structural and Computational Biology Group ICGEB dinesh@icgeb.res.in

What is Bioinformatics? Bioinformatics is the use of computers to solve biological and biomedical problems. Bioinformatics is the application of information technology to mine, visualize, analyze, integrate, and manage biological and genetic information, which can then be applied in, among other things, accelerating drug discovery and development. Application of tools of computation and analysis to the capture and interpretation of biological data. Biological Data management and analysis. NIH definition of Bioinformatics (http://www.bisti.nih.gov/compubiodef.pdf) Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

Use of Bioinformatics DNA analysis Genome sequencing Sequence assembly Sequence/gene annotations Genefinding/Sequence translation tools Sequence Similarity searching (eg. BLAST, ClustalW) Comparison between genomes Evolution of sequences (Phylogenetic analysis) Gene expression

Use of Bioinformatics (..contd.) Protein analysis Structure X-ray crystallography Homology based models Drug designing Sequence Sequence similarity Protein family assignments Conserved motifs Proteomics data analysis Protein Evolution

Uses of Bioinformatics (..contd.) Other uses: Drug designing Vaccine development Dairy technology Forensics Crop improvement Designing enzymes for detergents Genetic counseling

Bioinformatics: Integration of several fields Physics Computer Science Biological Science Bioinformatics Mathematics Statistics Chemistry

Recent events making bioinformatics more important Exponential expansion of biological information Expansion of multiple types of information Cheaper high throughput technologies Improvement in computation power Lack of standards/quality Need for micro and macro analysis Need for better algorithms

Vast Growth in (Structural) Data... but number of Fundementally New (Fold) Parts Not Increasing that Fast Total in Databank New Submissions New Folds

Bioinformatics Analysis? It is like any other lab analysis! You need to know your data/input sources You need to understand your methods and their assumptions You need a plan to get from point A to point B You need to understand your equipment You need to be critical and understand potential sources of error You need to interpret your results Your results need to be reproducible Your results should be testable

References, but not limited to:- http://www.ncbi.nlm.nih.gov/about/primer/bioinformatics.html http://icgeb.res.in/whotdr http://en.wikipedia.org/wiki/bioinformatics Baxevanis & Ouellette 2001. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins 2 nd Edition. John Wiley Publishing. Gibas & Jambeck 2001. Developing Bioinformatics Computer Skills. O Reilly. Bioinformatics: Genome Sequence Analysis Mount 2001 Bioinformatics For Dummies Claverie & Notredame 2003 Introduction to Bioinformatics Lesk 2002

Sequence formats: Basics Why different formats? Type of information Software requirements Database requirements

Main file formats used in Bioinformatics ASN.1 EMBL, Swiss Prot FASTA GCG GenBank/GenPept PHYLIP PIR

ASN 1: Abstract Syntax Notation 1 used by NCBI Seq-entry ::= set { class phy-set, descr { pub { pub { article { title { name "Cross-species infection of blood parasites between resident and migratory songbirds in Africa" }, authors { names std { { name name { last "Waldenstroem", first "Jonas", initials "J." } }, { name name { last "Bensch", first "Staffan", initials "S." } }, { name name { last "Kiboi", first "Sam", initials "S." } }, { name name { last "Hasselquist", first "Dennis", initials "D." } }, { name name {

The first line of each sequence entry is the ID definition line which contains entry name, dataclass, molecule, division and sequence length. XX line contains no data, just a separator The AC line lists the accession number. DE line gives description about the sequence FT precise annotation for the sequence Sequence information SQ in the first two spaces. The sequence information begins on the fifth line of the sequence entry. The last line of each sequence entry in the file is a terminator line which has the two characters // in the first two spaces. EMBL/Swiss Prot (http://www.ebi.ac.uk/help/formats_frame.html) ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518; XX AC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rrna and 5.8S rrna genes, partial sequence. DE rrna and 5.8S rrna genes, partial sequence. RX MEDLINE; 94303342. RX PUBMED; 8030378. XX FT rrna <1..20 FT /product="18s ribosomal RNA" FT misc_rna 21..205 FT /standard_name="internal transcribed spacer 1 (ITS1)" FT rrna 206..>237 FT /product="5.8s ribosomal RNA" SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237 //

FASTA A sequence in Fasta format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greaterthan (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. >U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT TTCAACAATGGATCTCTTGGTTCCGGC

GCG Exactly one sequence Begins with annotation lines Start of the sequence is marked by a line ending with ".. This line also contains the sequence identifier, the sequence length and a checksum ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rrna and 5.8S rrna genes, partial sequence. XX SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; AA03518 Length: 237 Check: 4514.. 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc

GenBank/GenPept The nucleotide (GenBank) and protein (Gen Pept) database entries are available from Entrez in this format Can contain several sequences One sequence starts with: LOCUS The sequence starts with: "ORIGIN The sequence ends with: "// LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995 DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rrna and 5.8S rrna genes, partial sequence. ACCESSION U03518 BASE COUNT 41 a 77 c 67 g 52 t ORIGIN 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc //

Phylip format 2 2000 G019uabh ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA G028uaah CATAAGCTCC TTTTAACTTG TTAAAGTCTT GCTTGAATTA AAGACTTGTT GTCTTGCTTG AATTAAAGAC TTGTTTAAAC ACAAAAATTT AGAGTTTTAC TAAACACAAA ATTTAGACTT TTACTCAACA AAAGTGATTG ATTGATTGAT TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG TGATTGATTG ATGGTTTACA GTAGGACTTC ATTCTAGTCA TTATAGCTGC The first line of the input file contains the number of sequences and their length (all should have the same length) separated by blanks. The next line contains a sequence name, next lines are the sequence itself in blocks of 10 characters. Then follow rest of sequences.

Other formats MEGA #mega Title: infile.fasta #G019uabh ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGCA AAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC #G028uaah CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA ATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACA GTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATACA TTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCT ATAGCCTCCTTCCCCATCCCATCAGTCT

ReadSeq Don Gilbert software@bio.indiana.edu, May 2001 Indiana University, Bloomington, Indiana WWW http://www.ebi.ac.uk/cgi-bin/readseq.cgi http://bioportal.bic.nus.edu.sg/readseq/readseq.html http://www-bimas.cit.nih.gov/molbio/readseq/ Seqret A program in EMBOSS suite

The Readseq package can read most common formats: examples of all these formats are included in the readseq directory. The formats include: IG/Stanford, used by Intelligenetics and others GenBank/GB, genbank flatfile format NBRF format (SAM modifications cause this to break when sequences do not have a terminating asterix) EMBL, EMBL flatfile format GCG, single sequence format of GCG software DNAStrider, for common Mac program Fitch format, limited use Pearson/Fasta, a common format used by Fasta programs and others Zuker format, limited use. Input only. Olsen, format printed by Olsen VMS sequence editor. Input only. Phylip3.2, sequential format for Phylip programs Plain/Raw, sequence data only (no name, document, numbering) MSF multi sequence format used by GCG software PAUP's multiple sequence (NEXUS) format PIR/CODATA format used by PIR

Databases in Biology

Need for databases in Biology? Need for storing and communicating large datasets has grown. Need to disseminate biological information. Provide Organized data for analysis friendly retrieval. Need to make biological data available in computerreadable form.

Different classifications of databases Type of data nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways proteomics data

Different classifications of databases. Primary or derived databases Primary databases: experimental results directly into database Secondary databases: results of analysis of primary databases Aggregate of many databases Links to other data items Combination of data Consolidation of data

Different classifications of databases. Technical design Flat-files Relational database (SQL) Exchange/publication technologies (HTML, CORBA, XML,...) Each one of the above are inter convertible

Different classifications of databases. Availability Publicly available, no restrictions Available, but with copyright Accessible, but not downloadable Academic, but not freely available Proprietary, commercial; possibly free for academics

Different classifications of databases. Content Protein/DNA/RNA/miRNA etc. Family: kinases Common physical properties: membrane bound, mitochondrial proteins Common chemical properties: Proteases, reductases etc. Sequences of a particular genome/species: e.g. Influenza sequences, plasmodium sequences etc. Motifs/domains

Where to look for databases? Search Engines Journals related to Bioinformatics Websites like: http://www.biophys.uni-duesseldorf.de/bionet/pedro/rt_all.html www.expasy.ch Several others websites

NAR DB issue 2010 58 new dbs since last year! Total >1230! (http://www.oxfordjournals.org/nar/databas e/a/ Complete list Searchable http://nar.oxfordjournals.org/cgi/content/full/gk m1037/dc1/1 (html format), also as downloadable word file)

http://www3.oup.co.uk/nar/database/c/

Database searching tips Look for links to Help or Examples Always check update dates Level of curation Try Boolean searches Be careful with UK/US spelling differences leukaemia vs leukemia haemoglobin vs hemoglobin colour vs color

Exercise Retrieve sequences from sequence databases Convert sequence formats Study different formats and flow of information