Sequence information - lectures Pairwise alignment Alignments in database searches Multiple alignments Profiles Patterns RNA secondary structure / Transformational grammars Genome organisation / Gene prediction / HMMs Phylogeny
Dates in the history of sequence information Experimental 1953 Amino acid sequence of insulin. 1965 Base sequence of a trna molecule,75 bases 1976 Sequence of MS2 RNA, 3569 bases 1970s Birth of cloning 1977 DNA sequencing (Sanger,F) 1978 Bacteriophage genome fd (6408 bases) 1995 Genomes of H influenzae & M genitalium (1.8 / 0.6 million bases) 2000 >95% of human genome (3000 million bases) Sequence analysis 1970 Global alignment Needleman-Wunsch 1975 PAM matrix (Dayhoff) 1981 Local alignment Smith-Waterman 1982 EMBL database 1985 FASTA 1985 Multiple sequence alignment 1st attempts 1988 Neural networks (protein secondary structure) 1989 Clustal 1989 Profilesearch 1991 BLAST 1994 Profile HMMs 1994 Context-free grammars (RNA secondary structure) 1997 PSI-BLAST
Sequences - The analysis of raw data from the laboratory Base-calling (sequence from raw data) : Phred Sequence assembly: Phrap CAP3 / CAP4 Cleaning up sequences: Quality filtering Vector filtering
Sequence formats / ASCII-Binary formats: In ASCII text format (= human readable) each character is stored as a byte, for instance the ASCII code of 'A' = 65 as a decimal number = 01000001 as a binary number However, sequence data is often stored in a binary format: For instance in a binary system the three bases may be stored as: A = 00 T = 01 C = 10 G = 11 In this way there is room in one byte (= 8 bits) for 4 bases. Typically databases are downloaded by the bioinformatician in ASCII format but then reformatted in a binary format for use with different sequence analysis tools. One example is databases for blast searches that may be formatted by the NCBI utility 'formatdb
How do you know if a file contains ASCII text or has a binary format? Using cat, more or less on a binary file % cat /usr/local/bin/less produces unreadable material, like: ^?ELF^A^B^A^@^@^@^@^@^@^@^@^@^@^B^^@^@^@^A^@@:\260^@^@^@4^@^Az0^@^@^@^D^@4^@ ^@ ^G^@(^@^Q^@^P^@^@^@^F^@^@^@4^@@^@4^@@^@4^@^@^@\340^@^@^@^@^@^@^@^D^@^@^@^D^@^@^@ ^C^@^@^A ^@@^A ^@@^A ^@^@^@^S^@^@^@^S^@^@^@^D^@^@^@^Dp^@^@^@^@^@^A@^@@^A@^@@^A@ ^@^@^@^X^@^@^@^X^@^@^@^D^@^@^@^D^@^@^@^B^@^@^A\200^@@^A\200^@@^A\200^@^@00^@^@00 ^@^@^@^D^@^@^@^Pp^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^D ^@^@^@^A^@^@^@^@^@@^@^@^@@^@^@^@^A@^@^@^A@^@^@^@^@^E^@^A^@^@^@^@^@^A^@^A@^@^P^@ ^@^@^P^@^@^@^@^@ ^@^@^@;\200^@^@^@^F^@^A^ The unix utility file will attempt to determine the file type, for instance % file README.txt ascii text % file /usr/local/bin/less ELF 32-bit MSB dynamic executable MIPS - version 1 % file data.z compressed data
Sequence formats Examples 1. Embl ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X64011.1 XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase...... SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180......
2. Fasta >LISOD L.ivanovii sod gene for superoxide dismutase cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg...... 3. GCG lisod.seq Length: 756 October 27, 2000 13:17 Type: N Check: 5188........ 1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc 51 gccttacaat gtaatttctt ttcacataaa taataaacaa tccgaggagg 101 aatttttaat gacttacgaa ttaccaaaat taccttatac ttatgatgct
Multiple sequence format of GCG 1.msf MSF: 44 Type: P October 24, 2002 12:41 Check: 9117.. Name: ftsy_bucai Len: 44 Check: 4221 Weight: 1.00 Name: ftsy_ecoli Len: 44 Check: 2326 Weight: 1.00 Name: ftsy_aquae Len: 44 Check: 2339 Weight: 1.00 Name: ftsy_bacsu Len: 44 Check: 6177 Weight: 1.00 Name: sr54_aciam Len: 44 Check: 6296 Weight: 1.00 Name: sr54_aerpe Len: 44 Check: 7291 Weight: 1.00 Name: sr54_arcfu Len: 44 Check: 122 Weight: 1.00 Name: sr54_aquae Len: 44 Check: 345 Weight: 1.00 // 1 44 ftsy_bucai KNS.EKLYFL LKRKMFNILK KVEIP...LE ISSHSPFVIL VVGV ftsy_ecoli RDA.EALYGL LKEEMGEILA KVDEP...LN VEGKAPFVIL MVGV ftsy_aquae KEG.EKIKEL LKKELKELLK NCQ...GELK IPEKVGAVLL FVGV ftsy_bacsu QDP.KEVQSV ISEKLVEIYN SGDEQISELN IQDGRLNVIL LVGV sr54_aciam PTYIERREWF IKIVYDELSN LFGGDKEPKV IPDKIPYVIM LVGV sr54_aerpe PPGVTRRDWM IKIVYEELVK LFGGDQEPQV DPPKTPWIVL LVGV sr54_arcfu LPALNAKEQI LKIVYEELLR GVGEGLEIPL KKAK...IM LVGL sr54_aquae PKNLSPAEFV IKTVYEELVD ILGGEK....ADLKKGTVL FVGL
FASTA multiple sequence format >ftsy_bucai, 44 bases, 52FF1180 checksum. KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV >ftsy_ecoli, 44 bases, B14506B6 checksum. RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV >ftsy_aquae, 44 bases, 27571BEE checksum. KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV >ftsy_bacsu, 44 bases, FA023A4F checksum. QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV >sr54_aciam, 44 bases, 2FC13632 checksum. PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV >sr54_aerpe, 44 bases, 37AFB895 checksum. PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV >sr54_arcfu, 44 bases, 8294461 checksum. LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL >sr54_aquae, 44 bases, 794768A2 checksum. PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL
CLUSTAL multiple sequence format CLUSTAL W (1.81) multiple sequence alignment ftsy_bucai ftsy_ecoli ftsy_aquae ftsy_bacsu sr54_aciam sr54_aerpe sr54_arcfu sr54_aquae KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL.:. :: ::.**:
Phylip multiple sequence format 8 44 ftsy_bucai KNS-EKLYFL LKRKMFNILK KVEIP---LE ISSHSPFVIL VVGV ftsy_ecoli RDA-EALYGL LKEEMGEILA KVDEP---LN VEGKAPFVIL MVGV ftsy_aquae KEG-EKIKEL LKKELKELLK NCQ---GELK IPEKVGAVLL FVGV ftsy_bacsu QDP-KEVQSV ISEKLVEIYN SGDEQISELN IQDGRLNVIL LVGV sr54_aciam PTYIERREWF IKIVYDELSN LFGGDKEPKV IPDKIPYVIM LVGV sr54_aerpe PPGVTRRDWM IKIVYEELVK LFGGDQEPQV DPPKTPWIVL LVGV sr54_arcfu LPALNAKEQI LKIVYEELLR GVGEGLEIPL KKAK----IM LVGL sr54_aquae PKNLSPAEFV IKTVYEELVD ILGGEK---- -ADLKKGTVL FVGL
Sequence editors SeqLab / GCG
Genedoc
Conversion of sequence formats - Readseq (one useful version of readseq is part of the SAM package http://www.cse.ucsc.edu/research/compbio/sam2src/) Readseq can convert between the following formats: 1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip3.2 3. NBRF 12. Phylip 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only)
GCG package utilities: Convert from to * Reformat text GCG * FromEMBL EMBL GCG * FromGenBank Genbank GCG * FromFasta Fasta GCG * ToFastA GCG Fasta Clustal: clustalw my_alignment -convert -output=gcg clustalw my_alignment -convert -output=phylip
Downloading bioinformatics data and software for local use
Downloading bioinformatics data and software for local use Data and software downloaded from the net are typically compressed in one way or the other. The most common formats are *.Z and *.gz These are binary files that cannot be read with normal editors To uncompress a *.Z file : % uncompress [file] To uncompress a *.gz file : % gunzip [file] A set of files is often downloaded as a set of compressed tar archive. For instance if you download a file called dna.tar.gz you can do % gunzip dna.tar.gz This will yield a file called dna.tar. Then do % tar -xvf dna.tar This will unpack all files contained in dna.tar
Alignments and database searches Common biological problem: We have a novel protein sequence. What can we infer from this sequence about the biological function of the protein?? * Pattern search - PROSITE * Profile search - Pfam * Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!) * Sequence homology - BLAST, FASTA, SSEARCH Simple example: unknown human protein is highly homologous to a protein with known function from another organism => The human protein has the same function (it s an ortholog or a paralog)
BLAST 1. Heuristic step where short word matches are identified 2. Extending matches using dynamic programming method as for local pairwise alignments. (Substitution matrix, gap penalties) -The nature of protein sequence and structure evolution - what is the sensitivity of BLAST searches? -General principles of database searches
Evolution of protein genes ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGT M A K L E K L N Q A G L M V A G 60% ATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGA M V R L E K I N Q A G L L V A G 69% M V R I Q K I N E K G A L L A G 38% Q V R I Q K I Y E K G A L L A A 19% ( twilight zone ) Q V R I Q K I Y E K T A L L F A 6% ( midnight zone )
Blast report Sequences producing significant alignments: (bits) Value pir F69494 (R)-hydroxyglutaryl-CoA dehydratase activator (hgdc)... 462 e-129 gb AAD31675.1 (AF123384) (R)-2-hydroxyglutaryl-CoA dehydratase... 233 1e-060 sp P39383 YJIL_ECOLI HYPOTHETICAL 27.4 KD PROTEIN IN IADA-MCRD I... 184 9e-046 emb CAA67409.1 (X98916) orf6 [Methanopyrus kandleri] 170 1e-041 gb AAF13150.1 AF156260_1 (AF156260) unknown [Methanosarcina bark... 143 2e-033 pir A69117 activator of (R)-2-hydroxyglutaryl-CoA - Methanobact... 132 4e-030 pir A72369 (R)-2-hydroxyglutaryl-CoA dehydratase activator-rela... 129 4e-029 gb AAC23928.1 (U75363) benzoyl-coa reductase subunit [Rhodopseu... 117 1e-025 pir S04476 hypothetical protein (hdga 5' region) - Acidaminococ... 104 1e-021 sp P27542 DNAK_CHLPN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 42 0.005 gb AAC15473.1 (AF016711) heat shock protein 70 [Burkholderia ps... 39 0.036 pir F75029 o-sialoglycoprotein endopeptidase (gcp) PAB1159 - Py... 38 0.082 pir F72514 probable glucokinase APE2091 - Aeropyrum pernix (str... 37 0.18 sp P42373 DNAK_BURCE DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 37 0.18 emb CAA10035.1 (AJ012470) mitochondrial-type hsp70 [Encephalito... 36 0.31 sp P56836 DNAK_CHLMU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.41 gb AAF39496.1 (AE002336) dnak protein [Chlamydia muridarum] 36 0.41 pir B70189 rod shape-determining protein (mreb-1) homolog - Lym... 36 0.41 sp O57716 GCP_PYRHO PUTATIVE O-SIALOGLYCOPROTEIN ENDOPEPTIDASE (... 36 0.54 sp O33522 DNAK_ALCEU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.54 ref NP_012874.1 Ykl050cp >gi 549677 sp P35736 YKF0_YEAST HYPOTH... 36 0.54 emb CAA53420.1 (X75781) D513 [Saccharomyces cerevisiae] >gi 158... 36 0.54 sp P30722 DNAK_PAVLU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) >gi 99... 36 0.54 pir A40158 dnak-type molecular chaperone - Chlamydia trachomati... 34 1.2 gb AAF07742.1 AE001584_39 (AE001584) hypothetical protein [Borre... 34 1.6 gb AAF07521.1 AE001577_35 (AE001577) hypothetical protein [Borre... 34 1.6 gb AAF38963.1 (AE002276) cell shape-determining protein MreB [C... 34 2.1 gb AAG08147.1 AE004889_10 (AE004889) DnaK protein [Pseudomonas a... 33 2.7 dbj BAB03215.1 (AB017035) dnak [Bacillus thermoglucosidasius] 33 2.7 sp P43736 DNAK_HAEIN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7 sp P45554 DNAK_STAAU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7 sp Q58303 FLA3_METJA FLAGELLIN B3 PRECURSOR 32 4.7 gb AAG08239.1 AE004898_10 (AE004898) phosphoribosylaminoimidazol... 32 6.1
Conclusions / comments If two proteins have significant sequence homology it is highly likely that the two proteins have the same 3D structure (and same function) If two proteins have the same 3D structure is does not necessarily mean that the sequences are related Very remote evolutionary relationships are difficult or impossible to detect with normal BLAST / pairwise alignment Database sequence searches involving proteins should be carried out at the protein level and not at the DNA level
More rules of database searches?compare sequences as proteins and not as DNA*?Use of smallest possible database (not too small though)?sequence statistics should be used rather than percent identity/similarity as criterion for homology?consider different scoring matrices and gap penalties * 1) DNA sequences encoding the same protein sequence can be very different, due to the degeneracy of the genetic code. TTTCGATTCTCAACAAGAAGC ** * ** ** * * TTCAGGTTTAGCACGCGGTCC F R F S T R S 2) Amino acid substitution matrices may be taken into account.
PAM matrices Starting point of PAM matrices are closely related orthologs, full sequences are taken into account. In the calculation of the PAM matrix extrapolation is done to more distant relationships Neurospora_crassa GSVDGYAYTD ANKQKGITWD ENTLFEYLEN PKKYIPGTKM AFGGLKKDKD Stellaria_longipes GSVEGFSYTD ANKAKGIEWN KDTLFEYLEN PKKYIPGTKM AFGGLKKDKD Thermomyces_lanuginosus GSVEGYSYTD ANKQAGITWN EDTLFEYLEN PKKFIPGTKM AFGGLKKNKD Arabidopsis_thaliana GSVAGYSYTD ANKQKGIEWK DDTLFEYLEN PKKYIPGTKM AFGGLKKPKD Aspergillus_niger GQSEGYAYTD ANKQAGVTWD ENTLFSYLEN PKKFIPGTKM AFGGLKKGKE Debaryomyces_occidentalis GQAAGYSYTD ANKKKGVEWT EQTMSDYLEN PKKYIPGTKM AFGGLKKPKD Schizosaccharomyces_pombe GQAEGFSYTE ANRDKGITWD EETLFAYLEN PKKYIPGTKM AFAGFKKPAD Fagopyrum_esculentum GTTAGYSYSA ANKNKAVTWG EDTLYEYLLN PKKYIPGTKM VFPGLKKPQE Sesamum_indicum GTTPGYSYSA ANKNMAVIWG ENTLYDYLLN PKKYIPGTKM VFPGLKKPQE Haematobia_irritans GQAAGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Lucilia_cuprina GQAPGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Ceratitis_capitata GQAAGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Sarcophaga_peregrina GQAPGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Manduca_sexta GQAPGFSYSD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM VFAGLKKANE Samia_cynthia GQAPGFSYSN ANKAKGITWG DDTLFEYLEN PKKYIPGTKM VFAGLKKANE Schistocerca_gregaria GQAPGFSYTD ANKSKGITWD ENTLFIYLEN PKKYIPGTKM VFAGLKKPEE Apis_mellifera GQAPGYSYTD ANKGKGITWN KETLFEYLEN PKKYIPGTKM VFAGLKKPQE Macaca_mulatta GQAPGYSYTA ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE Pan_troglodytes GQAPGYSYTA ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE Anas_platyrhynchos GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE Aptenodytes_patagonicus GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE
BLOSUM matrices. Starting point is conserved elements in protein families. Protein sequences are more distantly related than the sequences used for PAM matrices. Example of entry from BLOCKS DATABASE: Block BL00134A ID TRYPSIN_HIS; BLOCK AC BL00134A; distance from previous block=(23,4353) DE Serine proteases, trypsin family, histidine proteins. BL CHC; width=17; seqs=259; 99.5%=867; strength=1439 CTR2_VESCR P00769 ( 24) FCGGSISKRYVLTAAHC 63 CTRL_HALRU P35003 ( 53) CGCVLYTTSKALTAAHC 80 EAST_DROME P13582 ( 158) CGGSLISTRYVITASHC 18 GILX_HELHO P43685 ( 26) WSGVLLNRDWILTAAHC 45 NDL_DROME P98159 (1170) CGGTIYSDRWIISAAHC 10 PCE_TACTR P21902 ( 157) CGGALVTNRHVITASHC 25 TRYP_ASTFL P00765 ( 30) CGASIYNENYAITAGHC 36 TRYU_DROME P42279 ( 59) CGGCILDAVTIATAAHC 31 ACRO_HUMAN P10323 ( 73) CGGSLLNSRWVLTAAHC 4 ACRO_MOUSE P23578 ( 74) CGGSLLNSHWVLTAAHC 5 ACRO_PIG P08001 ( 71) CGGILLNSHWVLTAAHC 7 ACRO_RABIT P48038 ( 71) CGGVLLNAHWVLTAAHC 6 ACRO_RAT P29293 ( 74) CGGSLLNSHWVLTAAHC 5 ANC1_AGKRH P26324 ( 28) CGGVLIHPEWVITAEHC 25 ANC2_AGKRH P47797 ( 52) CGGVLIHPEWVITAKHC 16 ANCR_AGKCO P09872 ( 25) CGGTLINQEWVLTARHC 25 BATX_BOTAT P04971 ( 50) CGMTLINQEWVLTAAHC 49
Variants of BLAST / FASTA DNA/DNA P/P DNA/P P/DNA DNA/DNA blastall -p blastn -p blastp -p blastx -p tblastn -p tblastx fasta fasta fasta fastx,fasty* tfasta tfastx, tfasty *Compare a DNA sequence to a protein sequence database, by comparing the translated DNA sequence in three frames and allowing gaps and frameshifts. fastx3 uses a simpler, faster algorithm for alignments that allows frameshifts only between codons; fasty3 is slower but produces better alignments with poor quality sequences because frameshifts are allowed within codons.