Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr



Similar documents
THE GENBANK SEQUENCE DATABASE

DNA Sequence formats

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

GenBank, Entrez, & FASTA

BioJava In Anger. A Tutorial and Recipe Book for Those in a Hurry

Bioinformatics Resources at a Glance

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Biological Sequence Data Formats

Activity 7.21 Transcription factors

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Genomes and SNPs in Malaria and Sickle Cell Anemia

The Steps. 1. Transcription. 2. Transferal. 3. Translation

Special report. Chronic Lymphocytic Leukemia (CLL) Genomic Biology 3020 April 20, 2006

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

A Tutorial in Genetic Sequence Classification Tools and Techniques

Module 1. Sequence Formats and Retrieval. Charles Steward

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Concluding lesson. Student manual. What kind of protein are you? (Basic)

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Human Genome Organization: An Update. Genome Organization: An Update

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Introduction to Genome Annotation

Regents Biology REGENTS REVIEW: PROTEIN SYNTHESIS

1 Mutation and Genetic Change

Interaktionen von RNAs und Proteinen

Gene Therapy. The use of DNA as a drug. Edited by Gavin Brooks. BPharm, PhD, MRPharmS (PP) Pharmaceutical Press

Becker Muscular Dystrophy

Module 10: Bioinformatics

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Cystic Fibrosis Webquest Sarah Follenweider, The English High School 2009 Summer Research Internship Program

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Teaching Bioinformatics to Undergraduates

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Bioinformatics, Sequences and Genomes

Integrating Bioinformatics, Medical Sciences and Drug Discovery

Biological Sciences Initiative. Human Genome

Sequencing the Human Genome

Bio 102 Practice Problems Genetic Code and Mutation

Sickle cell anemia: Altered beta chain Single AA change (#6 Glu to Val) Consequence: Protein polymerizes Change in RBC shape ---> phenotypes

Bioinformatics using Python for Biologists

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Integration of data management and analysis for genome research

Scientific databases. Biological data management

Basic Concepts of DNA, Proteins, Genes and Genomes

Genetic testing. The difference diagnostics can make. The British In Vitro Diagnostics Association

13.2 Ribosomes & Protein Synthesis

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

An agent-based layered middleware as tool integration

Lecture Series 7. From DNA to Protein. Genotype to Phenotype. Reading Assignments. A. Genes and the Synthesis of Polypeptides

The world of non-coding RNA. Espen Enerly

Worksheet - COMPARATIVE MAPPING 1

Introduction to Bioinformatics 3. DNA editing and contig assembly

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

12.1 The Role of DNA in Heredity

Biological Databases and Protein Sequence Analysis

Gene and Chromosome Mutation Worksheet (reference pgs in Modern Biology textbook)

G E N OM I C S S E RV I C ES

CCR Biology - Chapter 9 Practice Test - Summer 2012

Processing Genome Data using Scalable Database Technology. My Background

BioBoot Camp Genetics

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

BME Engineering Molecular Cell Biology. Lecture 02: Structural and Functional Organization of

Genetics Test Biology I

PRACTICE TEST QUESTIONS

BIOINFORMATICS TUTORIAL

Ms. Campbell Protein Synthesis Practice Questions Regents L.E.

Molecular Cell Biology WS2011

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Chapter 2. Using and Understanding RepeatMasker. Sébastien Tempel. Abstract. 1. Introduction

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

Bioinformatics Grid - Enabled Tools For Biologists.

escience and Post-Genome Biomedical Research

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Control of Gene Expression

Lassen Community College Course Outline

Translation Study Guide

Biomedical Big Data and Precision Medicine

SUBMITTING DNA SEQUENCES TO THE DATABASES

Pairwise Sequence Alignment

RNA & Protein Synthesis

Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility

BI122 Introduction to Human Genetics, Fall 2014

DNA, RNA, Protein synthesis, and Mutations. Chapters

Linear Sequence Analysis. 3-D Structure Analysis

A Primer of Genome Science THIRD

Central Dogma. Lecture 10. Discussing DNA replication. DNA Replication. DNA mutation and repair. Transcription

Data formats and file conversions

Transcription and Translation of DNA

Bioinformatics: course introduction

GenBank: A Database of Genetic Sequence Data

Transcription:

Introduction to Databases Shifra Ben-Dor Irit Orr Lecture Outline Introduction Data and Database types Database components Data Formats Sample databases How to text search databases What units of information do we deal with in bioinformatics? SNPs Nucleotide sequence Genes mrna AAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAG DNA RNA Protein Sequence Structure Evolution Pathways Interactions Mutations Protein primary sequence Protein 3D structure Protein Function Acts as a tumor suppressor in many tumor types. induces growth arrest or apoptosis depending on the physiological circumstances or cell type, but both activities are involved in tumor suppression. Slide provided by Dr. Vered Caspi Involved in the transport of chloride ions. Defects in CFTR are the cause of cystic fibrosis. It is the most common genetic disease in the caucasian population, with a prevalence of about 1 in 2000 live births. cf, an autosomal recessive disorder, is a common generalized disorder of exocrine gland function

All of these have databases and tools that were created to work with them What do we want from databases? Information retrieval from sequence databases Biological databases contain enormous amounts of data. Databases need to be well annotated. Databases need to be easily searched. Data found in databases should be easily retrieved. Data in databases should be in standard formats. Integrated Information Retrieval Many databases contain logical relations between specific entries. One interface - connecting many biological databases. For example: a database that connects between protein sequence, protein domain, protein structure and reference databases. (Interpro) Another example: Connection between references, protein sequence, DNA sequence, and structure databases. (Entrez) Slide provided by Dr. Vered Caspi

Core Data and Annotation Databases generally have (at least) two types of data: Core data: The data the database was generated to organize Annotation: Extra information that rounds out our picture of the core data For example in a genome database, the sequence is the core data, and the location of genes is the annotation Database Issues Printed journals vs. databases Direct submission to databases (e.g. GenBank, GDB, PDB) Archival vs. curated databases Databases that publish experimental results of large genomic centers. Public vs. private databases. Database scope For Example: Classification of Genomic Databases Information source Information type Many genomes One Genome One Subject One Gene Direct submission from scientific community Scientific literature Genome center s experimental results Other databases Mapping Sequence & annotation Protein structure & function Variations Comparative genomics gene networks Slide provided by Dr. Vered Caspi Database search free text field-specific sequence-based Database output text graphics dynamic User Interface

Data Formats There are many data formats used for sequences (both nucleic and amino acid) Fasta Format GenBank Format EMBL Format GCG Format Simplest format Least information Fasta Format Starts with a > and sequence name on one line The sequence in plain text follows >OB2T2 GTGACAACATGTACAGCTGTGAGCGGTGTAAGAAGCTGCGGAACGGAGTGAAGTACTGCA AAGTCCTGCGGTTGCCCGAGATCCTGTGCATTCACCTAAAGCGCTTTCGGCACGAGGTGA TGTACTCATTCAAGATCAACAGCCACGTCTCCTTGCCCTCGAGGGGCTCGACCTGCGCCC CTTCCTTGCCAAGGAGTGCACATCCCAGATCACCACCTACGACCTCCTCTCGGTCATCTG CCACCACGGCACGGCAGGCA >TNRC_HUMAN P36941 (tumor necrosis factor c receptor) MLLPWATSAPGLAWGPLVLGLFGLLAASQPQAVPPYASENQTCRDQEKEYYEPQHRICCS RCPPGTYVSAKCSRIRDTVCATCAENSYNEHWNYLTICQLCRPCDPVMGLEEIAPCTSKR KTQCRCQPGMFCAAWALECTHCELLSDCPPGTEAELKDEVGKGNNHCVPCKAGHFQNTSS PSARCQPHTRCENQGLVEAAPGTAQSDTTCKNPLEPLPPEMSGTMLMLAVLLPLAFFLLL ATVFSCIWKSHPSLCRKLGSLLKRRPQGEGPNPVAGSWEPPKAHPYFPDLVQPLLPISGD VSPVSTGLPAAPVLEAGVPQQQSPLDLTREPQLEPGEQSQVAHGTNGIHVTGGSMTITGN IYIYNGPVLGGPPGPGDLPATPEPPYPIPEEGDPGPPGLSTPHQEDGKAWHLAETEHCGA TPSNRGPRNQFITHD >TNRC_MOUSE P50284 lymphotoxin-beta receptor precursor MRLPRASSPCGLAWGPLLLGLSGLLVASQPQLVPPYRIENQTCWDQDKEYYEPMHDVCCS RCPPGEFVFAVCSRSQDTVCKTCPHNSYNEHWNHLSTCQLCRPCDIVLGFEEVAPCTSDR KAECRCQPGMSCVYLDNECVHCEEERLVLCQPGTEAEVTDEIMDTDVNCVPCKPGHFQNT SSPRARCQPHTRCEIQGLVEAAPGTSYSDTICKNPPEPGAMLLLAILLSLVLFLLFTTVL ACAWMRHPSLCRKLGTLLKRHPEGEESPPCPAPRADPHFPDLAEPLLPMSGDLSPSPAGP PTAPSLEEVVLQQQSPLVQARELEAEPGEHGQVAHGANGIHVTGGSVTVTGNIYIYNGPV LGGTRGPGDPPAPPEPPYPTPEEGAPGPSELSTPYQEDGKAWHLAETETLGCQDL >TNR1_RAT P22934 tumor necrosis factor receptor 1 precursor (p60) MGLPIVPGLLLSLVLLALLMGIHPSGVTGLVPSLGDREKRDNLCPQGKYAHPKNNSICCT KCHKGTYLVSDCPSPGQETVCEVCDKGTFTASQNHVRQCLSCKTCRKEMFQVEISPCKAD MDTVCGCKKNQFQRYLSETHFQCVDCSPCFNGTVTIPCKEKQNTVCNCHAGFFLSGNECT PCSHCKKNQECMKLCLPPVANVTNPQDSGTAVLLPLVIFLGLCLLFFICISLLCRYPQWR PRVYSIICRDSAPVKEVEGEGIVTKPLTPASIPAFSPNPGFNPTLGFSTTPRFSHPVSST PISPVFGPSNWHNFVPPVREVVPTQGADPLLYGSLNPVPIPAPVRKWEDVVAAQPQRLDT ADPAMLYAVVDGVPPTRWKEFMRLLGLSEHEIERLELQNGRCLREAHYSMLEAWRRRTPR HEATLDVVGRVLCDMNLRGCLENIRETLESPAHSSTTHLPR

Genbank sequence format NM_000394. Homo sapiens crys...[gi:14043059] LOCUS NM_000394 1114 bp mrna PRI 15-MAY-2001 DEFINITION Homo sapiens crystallin, alpha A (CRYAA), mrna. ACCESSION NM_000394 VERSION NM_000394.2 GI:14043059 KEYWORDS. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1114) AUTHORS Jaworski,C.J. and Piatigorsky,J. TITLE A pseudo-exon in the functional human alpha A- crystallin gene Genbank sequence format JOURNAL Nature 337 (6209), 752-754 (1989) MEDLINE 89143747 PUBMED 2918909 REFERENCE 2 (bases 1 to 1114) AUTHORS Jaworski,C.J. TITLE A reassessment of mammalian alpha A-crystallin sequences using DNA sequencing: implications for anthropoid affinities of tarsier FEATURES Location/Qualifiers source 1..1114 /organism="homo sapiens" /db_xref="taxon:9606" /chromosome="21" /map="21q22.3" gene 1..1114 /gene="cryaa" /note="crya1" /db_xref="locusid:1409" /db_xref="mim:123580" misc_feature 70..234 /note="crystallin; Region: Alpha crystallin A chain" CDS 70..591 /gene="cryaa" /note="human alphaa-crystallin; crystallin, alpha-1" /codon_start=1 /db_xref="locusid:1409" /db_xref="mim:123580" /product="crystallin, alpha A" /protein_id="np_000385.1" /db_xref="gi:4503055" /translation="mdvtiqhpwfkrtlgpfypsrlfdqffgeglfeydllpfl SSTISPYYRQSLFRTVLDSGISEVRSDRDKFVIFLDVKHFSP EDLTVKVQDDFVEIHGKHNERQDDHGYISREFHRRYRLPS NVDQSALSCSLSADGMLTFCGPKIQTGLDATHAERAIPVSR EEKPTSAPSS" misc_feature 244..555 /note="hsp20; Region: Hsp20/alpha crystallin family" polya_signal 1092..1097

BASE COUNT 183 a 400 c 309 g 222 t ORIGIN 1 acactgcgct gcccagaggc cccgctgact cctgccagcc tccaggtccc cgtggtacca 61 aagctgaaca tggacgtgac catccagcac ccctggttca agcgcaccct ggggcccttc 121 taccccagcc ggctgttcga ccagtttttc ggcgagggcc tttttgagta tgacctgctg 181 cccttcctgt cgtccaccat cagcccctac taccgccagt ccctcttccg caccgtgctg 241 gactccggca tctctgaggt tcgatccgac cgggacaagt tcgtcatctt cctcgatgtg 301 aagcacttct ccccggagga cctcaccgtg aaggtgcagg acgactttgt ggagatccac 361 ggaaagcaca acgagcgcca ggacgaccac ggctacattt cccgtgagtt ccaccgccgc 421 taccgcctgc cgtccaacgt ggaccagtcg gccctctctt gctccctgtc tgccgatggc 481 atgctgacct tctgtggccc caagatccag actggcctgg atgccaccca cgccgagcga 541 gccatccccg tgtcgcggga ggagaagccc acctcggctc cctcgtccta agcaggcatt 601 gcctcggctg gctcccctgc agccctggcc catcatgggg ggagcaccct gagggcgggg 661 tgtctgtctt cctttgcttc ccttttttcc tttccacctt ctcacatgga atgagggttt 721 gagagagcag ccaggagagc ttagggtctc agggtgtccc agaccccgac accggccagt 781 ggcggaagtg accgcacctc acactccttt agatagcagc ctggctcccc tggggtgcag 841 gcgcctcaac tctgctgagg gtccagaagg agggggtgac ctccggccag gtgcctcctg 901 acacacctgc agcctccctc cgcggcgggc cctgcccaca cctcctgggg cgcgtgaggc 961 ccgtggggcc ggggcttctg tgcacctggg ctctcgcggc ctcttctctc agaccgtctt 1021 cctccaaccc ctctatgtag tgccgctctt ggggacatgg gtcgcccatg agagcgcagc 1081 ccgcggcaat caataaacag caggtgatac aagc // Revised: October 24, 2001. EMBL sequence format ID A4279484 standard; DNA; FUN; 581 BP. AC AJ279484; SV AJ279484.1 DT 14-JAN-2000 (Rel. 62, Created) DT 14-JAN-2000 (Rel. 62, Last updated, Version 2) DE Unidentified ascomycota sp. 4/97-9 5.8S rrna gene and ITS 1 and 2 KW 5.8S ribosomal RNA; 5.8S rrna gene; internal transcribed spacer 1; EMBL sequence format KW internal transcribed spacer 2; ITS1; ITS2. OS ascomycota sp. 4/97-9 OC Eukaryota; Fungi; Ascomycota. RN [1] RP 1-581 RA Wirsel S.G.R.; RT ; RL Submitted (21-DEC-1999) to the EMBL/GenBank/DDBJ databases. RL Wirsel S.G.R., Fakultaet fuer Biologie, Universitaet Konstanz, RL Universitaetsstr. 10, Konstanz 78434, Germany. EMBL sequence format RN [2] RA Wirsel S.G.R., Leibinger W., Mendgen K.W.; RT "Genetic diversity of fungi associated with common reed (Phragmites RT australis)"; RL Unpublished. FH Key Location/Qualifiers FH FT source 1..581 FT /db_xref="taxon:112223" FT /organism="ascomycota sp. 4/97-9" FT /isolate="4/97-9"

EMBL sequence format FT misc_feature 64..226 FT /note="internal transcribed spacer 1, ITS1" FT rrna 227..385 FT /gene="5.8s rrna" FT /product="5.8s ribosomal RNA" FT misc_feature 386..529 FT /note="internal transcribed spacer 2, ITS2" SQ Sequence 581 BP; 132 A; 164 C; 145 G; 140 T; 0 other; ccatttagag gaagtaaaag tcgtaacaag gtctccgttg gtgaaccagggagggatc 60 ttacgagagt gtcaccactc ccaacccact gtttacctac ccgtccaccg tgcttcggca 120 ggcagtcctg tgggacaggg cctcgccccc ctccgggggg tgcctgccgc EMBL entry Each line in the entry begins with a two-character line code, which indicates the type of information contained in the line. The currently used line types, along with their respective line codes, are listed below: ID - identification (begins each entry; 1 per entry) AC - accession number (>=1 per entry) SV - sequence version (1 per entry) DT - date (2 per entry) DE - description (>=1 per entry) KW - keyword (>=1 per entry) EMBL entry OS - organism species (>=1 per entry) OC - organism classification (>=1 per entry) OG - organelle (0 or 1 per entry) RN - reference number (>=1 per entry) RC - reference comment (>=0 per entry) RP - reference positions (>=1 per entry) RX - reference cross-reference (>=0 per entry) RA - reference author(s) (>=1 per entry) RT - reference title (>=1 per entry) RL - reference location (>=1 per entry) DR - database cross-reference (>=0 per entry) EMBL entry FH - feature table header (0 or 2 per entry) FT - feature table data (>=0 per entry) CC - comments or notes (>=0 per entry) - spacer line (many per entry) SQ - sequence header (1 per entry) bb - (blanks) sequence data (>=1 per entry) // - termination line (ends each entry; 1 per entry )

GCG Format Has space for comments and space for data, separated by two dots.. Can contain full sequence data like GenBank or EMBL Has a minimum of sequence name, length, date, type (nucleic or amino acid) and checksum!!na_sequence 1.0 5B3.seq Length: 744 March 18, 1999 10:43 Type: N Check: 2586.. 1 TCTAGAGGAG AYATYGTWAT GACCCAGTCT CCATCCTCCC TGAGTGTGTC 51 AGCAGGAGAG AAGGTCACTA TGAGCTGCAA GTCCAGTCAG AGTCTGTTAA 101 ACAGTAGAAA TCAAAAGAAC TACTTGGCCT GGTACCAGCA GAAACCAGGA 151 CAGCCTCCTA AACTTTTGAT CTACGGGGTA TTTATTAGGG ATTCTGGGGT 201 CCCTGATCGC TTCACAGGCA GTGGATCTGG AACCGATTTC ACTCTTACCA 251 TCAGCAGTGT GCAGGCTGAA GACCTGGCAG TTTATTACTG TCAGAATGAT 301 CATATTTATC CGTACACGTT CGGAGGGGGC ACWAAGCTGG AAATTAAAGG 351 GTCGACTTCC GGTAGCGGCA AATCCTCTGA AGGCAAAGGT SAGGTSCAGC 401 TGCAGGAGTC TGGACCTGGC CTGGTGAAGC CTTCCCAGTC TCTGTCCCTC 451 ACCTGCTCTG TCACTGGTTA CTCAATCACC AGTGGTTATG CCTGGAACTG 501 GATCCGGCAG TTTCCAGGAA ACAAACTGGA GTGGATGGGC TACATAAGCT 551 ACAGTGGTTT CACTAGCTAC AACCCATCTC TCAGAAGTCG AATCTCTTTC