Introduction to Databases Shifra Ben-Dor Irit Orr Lecture Outline Introduction Data and Database types Database components Data Formats Sample databases How to text search databases What units of information do we deal with in bioinformatics? SNPs Nucleotide sequence Genes mrna AAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAG DNA RNA Protein Sequence Structure Evolution Pathways Interactions Mutations Protein primary sequence Protein 3D structure Protein Function Acts as a tumor suppressor in many tumor types. induces growth arrest or apoptosis depending on the physiological circumstances or cell type, but both activities are involved in tumor suppression. Slide provided by Dr. Vered Caspi Involved in the transport of chloride ions. Defects in CFTR are the cause of cystic fibrosis. It is the most common genetic disease in the caucasian population, with a prevalence of about 1 in 2000 live births. cf, an autosomal recessive disorder, is a common generalized disorder of exocrine gland function
All of these have databases and tools that were created to work with them What do we want from databases? Information retrieval from sequence databases Biological databases contain enormous amounts of data. Databases need to be well annotated. Databases need to be easily searched. Data found in databases should be easily retrieved. Data in databases should be in standard formats. Integrated Information Retrieval Many databases contain logical relations between specific entries. One interface - connecting many biological databases. For example: a database that connects between protein sequence, protein domain, protein structure and reference databases. (Interpro) Another example: Connection between references, protein sequence, DNA sequence, and structure databases. (Entrez) Slide provided by Dr. Vered Caspi
Core Data and Annotation Databases generally have (at least) two types of data: Core data: The data the database was generated to organize Annotation: Extra information that rounds out our picture of the core data For example in a genome database, the sequence is the core data, and the location of genes is the annotation Database Issues Printed journals vs. databases Direct submission to databases (e.g. GenBank, GDB, PDB) Archival vs. curated databases Databases that publish experimental results of large genomic centers. Public vs. private databases. Database scope For Example: Classification of Genomic Databases Information source Information type Many genomes One Genome One Subject One Gene Direct submission from scientific community Scientific literature Genome center s experimental results Other databases Mapping Sequence & annotation Protein structure & function Variations Comparative genomics gene networks Slide provided by Dr. Vered Caspi Database search free text field-specific sequence-based Database output text graphics dynamic User Interface
Data Formats There are many data formats used for sequences (both nucleic and amino acid) Fasta Format GenBank Format EMBL Format GCG Format Simplest format Least information Fasta Format Starts with a > and sequence name on one line The sequence in plain text follows >OB2T2 GTGACAACATGTACAGCTGTGAGCGGTGTAAGAAGCTGCGGAACGGAGTGAAGTACTGCA AAGTCCTGCGGTTGCCCGAGATCCTGTGCATTCACCTAAAGCGCTTTCGGCACGAGGTGA TGTACTCATTCAAGATCAACAGCCACGTCTCCTTGCCCTCGAGGGGCTCGACCTGCGCCC CTTCCTTGCCAAGGAGTGCACATCCCAGATCACCACCTACGACCTCCTCTCGGTCATCTG CCACCACGGCACGGCAGGCA >TNRC_HUMAN P36941 (tumor necrosis factor c receptor) MLLPWATSAPGLAWGPLVLGLFGLLAASQPQAVPPYASENQTCRDQEKEYYEPQHRICCS RCPPGTYVSAKCSRIRDTVCATCAENSYNEHWNYLTICQLCRPCDPVMGLEEIAPCTSKR KTQCRCQPGMFCAAWALECTHCELLSDCPPGTEAELKDEVGKGNNHCVPCKAGHFQNTSS PSARCQPHTRCENQGLVEAAPGTAQSDTTCKNPLEPLPPEMSGTMLMLAVLLPLAFFLLL ATVFSCIWKSHPSLCRKLGSLLKRRPQGEGPNPVAGSWEPPKAHPYFPDLVQPLLPISGD VSPVSTGLPAAPVLEAGVPQQQSPLDLTREPQLEPGEQSQVAHGTNGIHVTGGSMTITGN IYIYNGPVLGGPPGPGDLPATPEPPYPIPEEGDPGPPGLSTPHQEDGKAWHLAETEHCGA TPSNRGPRNQFITHD >TNRC_MOUSE P50284 lymphotoxin-beta receptor precursor MRLPRASSPCGLAWGPLLLGLSGLLVASQPQLVPPYRIENQTCWDQDKEYYEPMHDVCCS RCPPGEFVFAVCSRSQDTVCKTCPHNSYNEHWNHLSTCQLCRPCDIVLGFEEVAPCTSDR KAECRCQPGMSCVYLDNECVHCEEERLVLCQPGTEAEVTDEIMDTDVNCVPCKPGHFQNT SSPRARCQPHTRCEIQGLVEAAPGTSYSDTICKNPPEPGAMLLLAILLSLVLFLLFTTVL ACAWMRHPSLCRKLGTLLKRHPEGEESPPCPAPRADPHFPDLAEPLLPMSGDLSPSPAGP PTAPSLEEVVLQQQSPLVQARELEAEPGEHGQVAHGANGIHVTGGSVTVTGNIYIYNGPV LGGTRGPGDPPAPPEPPYPTPEEGAPGPSELSTPYQEDGKAWHLAETETLGCQDL >TNR1_RAT P22934 tumor necrosis factor receptor 1 precursor (p60) MGLPIVPGLLLSLVLLALLMGIHPSGVTGLVPSLGDREKRDNLCPQGKYAHPKNNSICCT KCHKGTYLVSDCPSPGQETVCEVCDKGTFTASQNHVRQCLSCKTCRKEMFQVEISPCKAD MDTVCGCKKNQFQRYLSETHFQCVDCSPCFNGTVTIPCKEKQNTVCNCHAGFFLSGNECT PCSHCKKNQECMKLCLPPVANVTNPQDSGTAVLLPLVIFLGLCLLFFICISLLCRYPQWR PRVYSIICRDSAPVKEVEGEGIVTKPLTPASIPAFSPNPGFNPTLGFSTTPRFSHPVSST PISPVFGPSNWHNFVPPVREVVPTQGADPLLYGSLNPVPIPAPVRKWEDVVAAQPQRLDT ADPAMLYAVVDGVPPTRWKEFMRLLGLSEHEIERLELQNGRCLREAHYSMLEAWRRRTPR HEATLDVVGRVLCDMNLRGCLENIRETLESPAHSSTTHLPR
Genbank sequence format NM_000394. Homo sapiens crys...[gi:14043059] LOCUS NM_000394 1114 bp mrna PRI 15-MAY-2001 DEFINITION Homo sapiens crystallin, alpha A (CRYAA), mrna. ACCESSION NM_000394 VERSION NM_000394.2 GI:14043059 KEYWORDS. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1114) AUTHORS Jaworski,C.J. and Piatigorsky,J. TITLE A pseudo-exon in the functional human alpha A- crystallin gene Genbank sequence format JOURNAL Nature 337 (6209), 752-754 (1989) MEDLINE 89143747 PUBMED 2918909 REFERENCE 2 (bases 1 to 1114) AUTHORS Jaworski,C.J. TITLE A reassessment of mammalian alpha A-crystallin sequences using DNA sequencing: implications for anthropoid affinities of tarsier FEATURES Location/Qualifiers source 1..1114 /organism="homo sapiens" /db_xref="taxon:9606" /chromosome="21" /map="21q22.3" gene 1..1114 /gene="cryaa" /note="crya1" /db_xref="locusid:1409" /db_xref="mim:123580" misc_feature 70..234 /note="crystallin; Region: Alpha crystallin A chain" CDS 70..591 /gene="cryaa" /note="human alphaa-crystallin; crystallin, alpha-1" /codon_start=1 /db_xref="locusid:1409" /db_xref="mim:123580" /product="crystallin, alpha A" /protein_id="np_000385.1" /db_xref="gi:4503055" /translation="mdvtiqhpwfkrtlgpfypsrlfdqffgeglfeydllpfl SSTISPYYRQSLFRTVLDSGISEVRSDRDKFVIFLDVKHFSP EDLTVKVQDDFVEIHGKHNERQDDHGYISREFHRRYRLPS NVDQSALSCSLSADGMLTFCGPKIQTGLDATHAERAIPVSR EEKPTSAPSS" misc_feature 244..555 /note="hsp20; Region: Hsp20/alpha crystallin family" polya_signal 1092..1097
BASE COUNT 183 a 400 c 309 g 222 t ORIGIN 1 acactgcgct gcccagaggc cccgctgact cctgccagcc tccaggtccc cgtggtacca 61 aagctgaaca tggacgtgac catccagcac ccctggttca agcgcaccct ggggcccttc 121 taccccagcc ggctgttcga ccagtttttc ggcgagggcc tttttgagta tgacctgctg 181 cccttcctgt cgtccaccat cagcccctac taccgccagt ccctcttccg caccgtgctg 241 gactccggca tctctgaggt tcgatccgac cgggacaagt tcgtcatctt cctcgatgtg 301 aagcacttct ccccggagga cctcaccgtg aaggtgcagg acgactttgt ggagatccac 361 ggaaagcaca acgagcgcca ggacgaccac ggctacattt cccgtgagtt ccaccgccgc 421 taccgcctgc cgtccaacgt ggaccagtcg gccctctctt gctccctgtc tgccgatggc 481 atgctgacct tctgtggccc caagatccag actggcctgg atgccaccca cgccgagcga 541 gccatccccg tgtcgcggga ggagaagccc acctcggctc cctcgtccta agcaggcatt 601 gcctcggctg gctcccctgc agccctggcc catcatgggg ggagcaccct gagggcgggg 661 tgtctgtctt cctttgcttc ccttttttcc tttccacctt ctcacatgga atgagggttt 721 gagagagcag ccaggagagc ttagggtctc agggtgtccc agaccccgac accggccagt 781 ggcggaagtg accgcacctc acactccttt agatagcagc ctggctcccc tggggtgcag 841 gcgcctcaac tctgctgagg gtccagaagg agggggtgac ctccggccag gtgcctcctg 901 acacacctgc agcctccctc cgcggcgggc cctgcccaca cctcctgggg cgcgtgaggc 961 ccgtggggcc ggggcttctg tgcacctggg ctctcgcggc ctcttctctc agaccgtctt 1021 cctccaaccc ctctatgtag tgccgctctt ggggacatgg gtcgcccatg agagcgcagc 1081 ccgcggcaat caataaacag caggtgatac aagc // Revised: October 24, 2001. EMBL sequence format ID A4279484 standard; DNA; FUN; 581 BP. AC AJ279484; SV AJ279484.1 DT 14-JAN-2000 (Rel. 62, Created) DT 14-JAN-2000 (Rel. 62, Last updated, Version 2) DE Unidentified ascomycota sp. 4/97-9 5.8S rrna gene and ITS 1 and 2 KW 5.8S ribosomal RNA; 5.8S rrna gene; internal transcribed spacer 1; EMBL sequence format KW internal transcribed spacer 2; ITS1; ITS2. OS ascomycota sp. 4/97-9 OC Eukaryota; Fungi; Ascomycota. RN [1] RP 1-581 RA Wirsel S.G.R.; RT ; RL Submitted (21-DEC-1999) to the EMBL/GenBank/DDBJ databases. RL Wirsel S.G.R., Fakultaet fuer Biologie, Universitaet Konstanz, RL Universitaetsstr. 10, Konstanz 78434, Germany. EMBL sequence format RN [2] RA Wirsel S.G.R., Leibinger W., Mendgen K.W.; RT "Genetic diversity of fungi associated with common reed (Phragmites RT australis)"; RL Unpublished. FH Key Location/Qualifiers FH FT source 1..581 FT /db_xref="taxon:112223" FT /organism="ascomycota sp. 4/97-9" FT /isolate="4/97-9"
EMBL sequence format FT misc_feature 64..226 FT /note="internal transcribed spacer 1, ITS1" FT rrna 227..385 FT /gene="5.8s rrna" FT /product="5.8s ribosomal RNA" FT misc_feature 386..529 FT /note="internal transcribed spacer 2, ITS2" SQ Sequence 581 BP; 132 A; 164 C; 145 G; 140 T; 0 other; ccatttagag gaagtaaaag tcgtaacaag gtctccgttg gtgaaccagggagggatc 60 ttacgagagt gtcaccactc ccaacccact gtttacctac ccgtccaccg tgcttcggca 120 ggcagtcctg tgggacaggg cctcgccccc ctccgggggg tgcctgccgc EMBL entry Each line in the entry begins with a two-character line code, which indicates the type of information contained in the line. The currently used line types, along with their respective line codes, are listed below: ID - identification (begins each entry; 1 per entry) AC - accession number (>=1 per entry) SV - sequence version (1 per entry) DT - date (2 per entry) DE - description (>=1 per entry) KW - keyword (>=1 per entry) EMBL entry OS - organism species (>=1 per entry) OC - organism classification (>=1 per entry) OG - organelle (0 or 1 per entry) RN - reference number (>=1 per entry) RC - reference comment (>=0 per entry) RP - reference positions (>=1 per entry) RX - reference cross-reference (>=0 per entry) RA - reference author(s) (>=1 per entry) RT - reference title (>=1 per entry) RL - reference location (>=1 per entry) DR - database cross-reference (>=0 per entry) EMBL entry FH - feature table header (0 or 2 per entry) FT - feature table data (>=0 per entry) CC - comments or notes (>=0 per entry) - spacer line (many per entry) SQ - sequence header (1 per entry) bb - (blanks) sequence data (>=1 per entry) // - termination line (ends each entry; 1 per entry )
GCG Format Has space for comments and space for data, separated by two dots.. Can contain full sequence data like GenBank or EMBL Has a minimum of sequence name, length, date, type (nucleic or amino acid) and checksum!!na_sequence 1.0 5B3.seq Length: 744 March 18, 1999 10:43 Type: N Check: 2586.. 1 TCTAGAGGAG AYATYGTWAT GACCCAGTCT CCATCCTCCC TGAGTGTGTC 51 AGCAGGAGAG AAGGTCACTA TGAGCTGCAA GTCCAGTCAG AGTCTGTTAA 101 ACAGTAGAAA TCAAAAGAAC TACTTGGCCT GGTACCAGCA GAAACCAGGA 151 CAGCCTCCTA AACTTTTGAT CTACGGGGTA TTTATTAGGG ATTCTGGGGT 201 CCCTGATCGC TTCACAGGCA GTGGATCTGG AACCGATTTC ACTCTTACCA 251 TCAGCAGTGT GCAGGCTGAA GACCTGGCAG TTTATTACTG TCAGAATGAT 301 CATATTTATC CGTACACGTT CGGAGGGGGC ACWAAGCTGG AAATTAAAGG 351 GTCGACTTCC GGTAGCGGCA AATCCTCTGA AGGCAAAGGT SAGGTSCAGC 401 TGCAGGAGTC TGGACCTGGC CTGGTGAAGC CTTCCCAGTC TCTGTCCCTC 451 ACCTGCTCTG TCACTGGTTA CTCAATCACC AGTGGTTATG CCTGGAACTG 501 GATCCGGCAG TTTCCAGGAA ACAAACTGGA GTGGATGGGC TACATAAGCT 551 ACAGTGGTTT CACTAGCTAC AACCCATCTC TCAGAAGTCG AATCTCTTTC