Module 10: Bioinformatics 1.) Goal: To understand the general approaches for basic in silico (computer) analysis of DNA- and protein sequences. We are going to discuss sequence formatting required prior to analysis, DNA restriction mapping, DNA translation into protein coding regions (= finding open reading frames ORFs), protein sequence analysis, sequence comparisons and database searching. 2.) Introduction DNA sequencing has of late become very easy, fast and cheap. The elucidation of the complete human genome sequence (a mere 3 x 10 9 basepairs) has only been possible because of these technical advances. Protein sequencing on the other hand is also possible, but technically much harder, and slower. Because a DNA sequence predicts the encoded protein sequence thru the rules of the genetic code, we can sequence a piece of DNA and deduce or predict its encoded protein sequence, instead of painstakingly purifying and then sequencing the corresponding protein. Given the large size of some of the genomes that have been sequenced to date, it becomes clear that powerful in silico approaches must go hand in hand with the wet lab procedures. 3.) Background information Sequence input files can be generated in WORD, or can be copied from other source documents, websites etc formatting: for some applications, files first have to be converted into a specific format ( FASTA being a popular choice); removal of non- standard characters from the files is necessary (example: DNA sequence files can only contain the characters G, A, T and C) Many free sequence analysis programs are available on line; all you have to do is simply copy and paste your sequence(s) into a browser window and run the analysis, but remember in some cases the correct file format must be used. Tip: when working with sequence files in WORD, use the Courier font, as it is the only font in which each letter/character uses the same amount of space, resulting in well- aligned sequences 4.) Steps in an in silico exercise. Note: different exercises may use a different set of steps Open sequence file supplied in WORD Check for absence of non- standard characters, convert file into appropriate format Copy sequence Open browser window with particular application Paste sequence into window Choose specific analysis parameters Run analysis Results and files can be copied out of browser window and pasted/saved back into the original Word file 5.) Materials supplied: general plasmid map (Appendix A), related DHFR protein sequence for sequence alignment (Appendix B), Complete Plasmid Sequence with the Bacillus thermophilis DHFR (Appendix C), 6.) Boyer book chapter: #2 1
7.) Basic examples of things that can be done with in silico analyses With DNA sequence sequence generation: type or copy- paste in Word format sequence formatting, sequence length : http://www.ebi.ac.uk/tools/sfc/emboss_seqret/ restriction digestion and mapping (linear vs circular maps): http://www.restrictionmapper.org/ translation of open reading frames (ORFs): http://web.expasy.org/translate/ With Protein sequence determine AA sequence length, AA composition, molecular weight, pi, molar extinction coefficient : http://www.ebi.ac.uk/tools/seqstats/emboss_pepstats/ DNA of Protein Sequence comparison these programs can be used to compare complete protein sequences to establish evolutionary relationships or find single point mutations Pairwise DNA alignment: http://www.ebi.ac.uk/tools/psa/emboss_needle/nucleotide.html, Pairwise Protein alignment: http://www.ebi.ac.uk/tools/services/web/toolform.ebi?tool=emboss_needle&context=protein Multiple sequence alignment: http://www.ebi.ac.uk/tools/msa/clustalo/ Sequence Databases DNA and proteins @ Pubmed: http://www.ncbi.nlm.nih.gov/pubmed Also: www.uniprot.org (well curated protein DB, can do Blasts and other alignments) 8.) Protocol: Do the following: 1. Convert the complete plasmid sequence in Appendix C to GCG and EMBL format, indicate length of plasmid DNA in bp. Include properly labeled copies of these in your report 2. Perform restriction mapping for the complete plasmid sequence supplied in Appendix C, using the restriction enzymes NdeI and BamHI. Show result table in your report 3. In your report show translation of all 6 reading frames and indicate the frame with the DHFR ORF (open reading frame). The Bacillus thermophilus ORF starts with MISHI. Show the amino acid sequence of the complete B. thermophilus ORF in your report. 4. Protein analysis: Use the B. thermophilus DHFR protein sequence. In your report only include molecular weight, number of amino acids, pi, and molar extinction coefficient 5. Sequence comparison: use the DHFR - protein sequence from above and align one at a time to the three DHFR protein sequences supplied. Show the sequence alignment and % identity for all three (B. thermophilus with human; B. thermophilus with Bacillus amyloliquefaciens; B. thermophilus with Geobacillus thermodenitrificans) alignments in your report. 6. Sequence comparison: Align all four DHFR protein sequences. Show the sequence alignment in your report. Hand in via e- mail, as a word document, one per group 2
9.) Materials Appendix A: Plasmid Map Appendix B: DHFR sequences to be used for sequence alignment: This is the sequence for human DHFR: 1 mvgslnciva vsqnmgigkn gdlpwpplrn efryfqrmtt tssvegkqnl vimgkktwfs 61 ipeknrplkg rinlvlsrel keppqgahfl srslddalkl teqpelankv dmvwivggss 121 vykeamnhpg hlklfvtrim qdfesdtffp eidlekykll peypgvlsdv qeekgikykf 181 evyeknd This is the DHFR sequence from Bacillus amyloliquefaciens: 1 misfifamde nrligkdndl pwhlpddlay fkkvttghti vmgrktfesi grplpnrrni 61 vvtsrdeslf pgcitadsae evlklippde ecfviggaql ysalfpyadr lymtkihhvf 121 egdrffpefn eaeweltsrk qgvkdeknpy dyeylvyekk n This is the DHFR sequence from Geobacillus thermodenitrificans: 1 mnmtilkssv mtlirrlkrq wrckgektmi shivamdenr vigkdnqlpw hlpadlayfk 61 rvtmghaivm grktfeaigr plpgrdnvvv trnpqfrpeg clvlhsleev kqwiaargee 121 vfiiggaelf katmpiadrl yvtnifasfp gdtfyppise kewkvvsytp gvkdeknpye 181 hafliyerk 3
Appendix C: Complete Plasmid Sequence with the Bacillus thermophilis DHFR DHFR ORF is situated between pos 5205 (NdeI)and pos 5699 (BamHI) tggcgaatgggacgcgccctgtagcggcgcattaagcgcggcgggtgtgg tggttacgcgcagcgtgaccgctacacttgccagcgccctagcgcccgct cctttcgctttcttcccttcctttctcgccacgttcgccggctttccccg tcaagctctaaatcgggggctccctttagggttccgatttagtgctttac ggcacctcgaccccaaaaaacttgattagggtgatggttcacgtagtggg ccatcgccctgatagacggtttttcgccctttgacgttggagtccacgtt ctttaatagtggactcttgttccaaactggaacaacactcaaccctatct cggtctattcttttgatttataagggattttgccgatttcggcctattgg ttaaaaaatgagctgatttaacaaaaatttaacgcgaattttaacaaaat attaacgtttacaatttcaggtggcacttttcggggaaatgtgcgcggaa cccctatttgtttatttttctaaatacattcaaatatgtatccgctcatg agacaataaccctgataaatgcttcaataatattgaaaaaggaagagtat gagtattcaacatttccgtgtcgcccttattcccttttttgcggcatttt gccttcctgtttttgctcacccagaaacgctggtgaaagtaaaagatgct gaagatcagttgggtgcacgagtgggttacatcgaactggatctcaacag cggtaagatccttgagagttttcgccccgaagaacgttttccaatgatga gcacttttaaagttctgctatgtggcgcggtattatcccgtattgacgcc gggcaagagcaactcggtcgccgcatacactattctcagaatgacttggt tgagtactcaccagtcacagaaaagcatcttacggatggcatgacagtaa gagaattatgcagtgctgccataaccatgagtgataacactgcggccaac ttacttctgacaacgatcggaggaccgaaggagctaaccgcttttttgca caacatgggggatcatgtaactcgccttgatcgttgggaaccggagctga atgaagccataccaaacgacgagcgtgacaccacgatgcctgcagcaatg gcaacaacgttgcgcaaactattaactggcgaactacttactctagcttc ccggcaacaattaatagactggatggaggcggataaagttgcaggaccac ttctgcgctcggcccttccggctggctggtttattgctgataaatctgga gccggtgagcgtgggtctcgcggtatcattgcagcactggggccagatgg taagccctcccgtatcgtagttatctacacgacggggagtcaggcaacta tggatgaacgaaatagacagatcgctgagataggtgcctcactgattaag cattggtaactgtcagaccaagtttactcatatatactttagattgattt aaaacttcatttttaatttaaaaggatctaggtgaagatcctttttgata atctcatgaccaaaatcccttaacgtgagttttcgttccactgagcgtca gaccccgtagaaaagatcaaaggatcttcttgagatcctttttttctgcg cgtaatctgctgcttgcaaacaaaaaaaccaccgctaccagcggtggttt gtttgccggatcaagagctaccaactctttttccgaaggtaactggcttc agcagagcgcagataccaaatactgtccttctagtgtagccgtagttagg ccaccacttcaagaactctgtagcaccgcctacatacctcgctctgctaa tcctgttaccagtggctgctgccagtggcgataagtcgtgtcttaccggg ttggactcaagacgatagttaccggataaggcgcagcggtcgggctgaac ggggggttcgtgcacacagcccagcttggagcgaacgacctacaccgaac tgagatacctacagcgtgagctatgagaaagcgccacgcttcccgaaggg agaaaggcggacaggtatccggtaagcggcagggtcggaacaggagagcg cacgagggagcttccagggggaaacgcctggtatctttatagtcctgtcg ggtttcgccacctctgacttgagcgtcgatttttgtgatgctcgtcaggg gggcggagcctatggaaaaacgccagcaacgcggcctttttacggttcct ggccttttgctggccttttgctcacatgttctttcctgcgttatcccctg attctgtggataaccgtattaccgcctttgagtgagctgataccgctcgc cgcagccgaacgaccgagcgcagcgagtcagtgagcgaggaagcggaaga gcgcctgatgcggtattttctccttacgcatctgtgcggtatttcacacc gcatatatggtgcactctcagtacaatctgctctgatgccgcatagttaa gccagtatacactccgctatcgctacgtgactgggtcatggctgcgcccc gacacccgccaacacccgctgacgcgccctgacgggcttgtctgctcccg gcatccgcttacagacaagctgtgaccgtctccgggagctgcatgtgtca gaggttttcaccgtcatcaccgaaacgcgcgaggcagctgcggtaaagct catcagcgtggtcgtgaagcgattcacagatgtctgcctgttcatccgcg tccagctcgttgagtttctccagaagcgttaatgtctggcttctgataaa gcgggccatgttaagggcggttttttcctgtttggtcactgatgcctccg tgtaagggggatttctgttcatgggggtaatgataccgatgaaacgagag aggatgctcacgatacgggttactgatgatgaacatgcccggttactgga acgttgtgagggtaaacaactggcggtatggatgcggcgggaccagagaa aaatcactcagggtcaatgccagcgcttcgttaatacagatgtaggtgtt ccacagggtagccagcagcatcctgcgatgcagatccggaacataatggt gcagggcgctgacttccgcgtttccagactttacgaaacacggaaaccga agaccattcatgttgttgctcaggtcgcagacgttttgcagcagcagtcg cttcacgttcgctcgcgtatcggtgattcattctgctaaccagtaaggca accccgccagcctagccgggtcctcaacgacaggagcacgatcatgcgca cccgtggggccgccatgccggcgataatggcctgcttctcgccgaaacgt ttggtggcgggaccagtgacgaaggcttgagcgagggcgtgcaagattcc gaataccgcaagcgacaggccgatcatcgtcgcgctccagcgaaagcggt cctcgccgaaaatgacccagagcgctgccggcacctgtcctacgagttgc 4
5 atgataaagaagacagtcataagtgcggcgacgatagtcatgccccgcgc ccaccggaaggagctgactgggttgaaggctctcaagggcatcggtcgag atcccggtgcctaatgagtgagctaacttacattaattgcgttgcgctca ctgcccgctttccagtcgggaaacctgtcgtgccagctgcattaatgaat cggccaacgcgcggggagaggcggtttgcgtattgggcgccagggtggtt tttcttttcaccagtgagacgggcaacagctgattgcccttcaccgcctg gccctgagagagttgcagcaagcggtccacgctggtttgccccagcaggc gaaaatcctgtttgatggtggttaacggcgggatataacatgagctgtct tcggtatcgtcgtatcccactaccgagatatccgcaccaacgcgcagccc ggactcggtaatggcgcgcattgcgcccagcgccatctgatcgttggcaa ccagcatcgcagtgggaacgatgccctcattcagcatttgcatggtttgt tgaaaaccggacatggcactccagtcgccttcccgttccgctatcggctg aatttgattgcgagtgagatatttatgccagccagccagacgcagacgcg ccgagacagaacttaatgggcccgctaacagcgcgatttgctggtgaccc aatgcgaccagatgctccacgcccagtcgcgtaccgtcttcatgggagaa aataatactgttgatgggtgtctggtcagagacatcaagaaataacgccg gaacattagtgcaggcagcttccacagcaatggcatcctggtcatccagc ggatagttaatgatcagcccactgacgcgttgcgcgagaagattgtgcac cgccgctttacaggcttcgacgccgcttcgttctaccatcgacaccacca cgctggcacccagttgatcggcgcgagatttaatcgccgcgacaatttgc gacggcgcgtgcagggccagactggaggtggcaacgccaatcagcaacga ctgtttgcccgccagttgttgtgccacgcggttgggaatgtaattcagct ccgccatcgccgcttccactttttcccgcgttttcgcagaaacgtggctg gcctggttcaccacgcgggaaacggtctgataagagacaccggcatactc tgcgacatcgtataacgttactggtttcacattcaccaccctgaattgac tctcttccgggcgctatcatgccataccgcgaaaggttttgcgccattcg atggtgtccgggatctcgacgctctcccttatgcgactcctgcattagga agcagcccagtagtaggttgaggccgttgagcaccgccgccgcaaggaat ggtgcatgcaaggagatggcgcccaacagtcccccggccacggggcctgc caccatacccacgccgaaacaagcgctcatgagcccgaagtggcgagccc gatcttccccatcggtgatgtcggcgatataggcgccagcaaccgcacct gtggcgccggtgatgccggccacgatgcgtccggcgtagaggatcgagat ctcgatcccgcgaaattaatacgactcactataggggaattgtgagcgga taacaattcccctctagaaataattttgtttaactttaagaaggagatat acatatgatttcgcacattgtggcaatggatgaaaaccgggtgatcggca aagacaaccgcttgccttggcatttgccggccgatttggcgtattttaaa cgggtgacaatgggccatgccatcgtgatggggcgcaagacgtttgaagc gatcggccggccgcttcccggccgcgataacgtcgttgtcacgcgcaacc gctcgtttcgtccggaaggctgccttgtgcttcattcgctcgaggaagtc aagcaatggatcgcatcgcgcgctgatgaagtgtttatcatcggcggggc cgaactgtttcgggcgacgatgccgattgtcgaccggctgtatgtgacaa aaatttttgcttccttccccggcgatacgttttatccgcccatttctgac gatgaatgggaaatcgtttcctatacgccaggagggaaagatgaaaagaa tccgtatgaacacgcctttatcatttatgagcggaaaaaggcgaaataat GGATCCgaattcgagctccgtcgacaagcttgcggccgcactcgagcacc accaccaccaccactgagatccggctgctaacaaagcccgaaaggaagct gagttggctgctgccaccgctgagcaataactagcataaccccttggggc ctctaaacgggtcttgaggggttttttgctgaaaggaggaactatatccg gat