Sequence information - lectures

Transcription

1 Sequence information - lectures Pairwise alignment Alignments in database searches Multiple alignments Profiles Patterns RNA secondary structure / Transformational grammars Genome organisation / Gene prediction / HMMs Phylogeny

2 Dates in the history of sequence information Experimental 1953 Amino acid sequence of insulin Base sequence of a trna molecule,75 bases 1976 Sequence of MS2 RNA, 3569 bases 1970s Birth of cloning 1977 DNA sequencing (Sanger,F) 1978 Bacteriophage genome fd (6408 bases) 1995 Genomes of H influenzae & M genitalium (1.8 / 0.6 million bases) 2000 >95% of human genome (3000 million bases) Sequence analysis 1970 Global alignment Needleman-Wunsch 1975 PAM matrix (Dayhoff) 1981 Local alignment Smith-Waterman 1982 EMBL database 1985 FASTA 1985 Multiple sequence alignment 1st attempts 1988 Neural networks (protein secondary structure) 1989 Clustal 1989 Profilesearch 1991 BLAST 1994 Profile HMMs 1994 Context-free grammars (RNA secondary structure) 1997 PSI-BLAST

3 Sequences - The analysis of raw data from the laboratory Base-calling (sequence from raw data) : Phred Sequence assembly: Phrap CAP3 / CAP4 Cleaning up sequences: Quality filtering Vector filtering

4 Sequence formats / ASCII-Binary formats: In ASCII text format (= human readable) each character is stored as a byte, for instance the ASCII code of 'A' = 65 as a decimal number = as a binary number However, sequence data is often stored in a binary format: For instance in a binary system the three bases may be stored as: A = 00 T = 01 C = 10 G = 11 In this way there is room in one byte (= 8 bits) for 4 bases. Typically databases are downloaded by the bioinformatician in ASCII format but then reformatted in a binary format for use with different sequence analysis tools. One example is databases for blast searches that may be formatted by the NCBI utility 'formatdb

5 How do you know if a file contains ASCII text or has a binary format? Using cat, more or less on a binary file % cat /usr/local/bin/less produces unreadable material, like: ^?ELFÂ^BÂ^@^@^@^@^@^@^@^@^@^@^B^^@^@^@Â^@@:\260^@^@^@4^@Âz0^@^@^@^D^@4^@ ^@ ^G^@(^@^Q^@^P^@^@^@^F^@^@^@4^@@^@4^@@^@4^@^@^@\340^@^@^@^@^@^@^@^D^@^@^@^D^@^@^@ ^C^@^@Â ^@@Â ^@@Â ^@^@^@^S^@^@^@^S^@^@^@^D^@^@^@^Dp^@^@^@^@^@Â@^@@Â@^@@Â@ ^@^@^@^X^@^@^@^X^@^@^@^D^@^@^@^D^@^@^@^B^@^@Â\200^@@Â\200^@@Â\200^@^@00^@^@00 ^@^@^@^D^@^@^@^Pp^@^@Â^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^D ^@^@^@Â^@^@^@^@^@@^@^@^@@^@^@^@Â@^@^@Â@^@^@^@^@Ê^@Â^@^@^@^@^@Â^@Â@^@^P^@ ^@^@^P^@^@^@^@^@ ^@^@^@;\200^@^@^@^F^@Â^ The unix utility file will attempt to determine the file type, for instance % file README.txt ascii text % file /usr/local/bin/less ELF 32-bit MSB dynamic executable MIPS - version 1 % file data.z compressed data

6 Sequence formats Examples 1. Embl ID LISOD standard; DNA; PRO; 756 BP. XX AC X64011; S78972; XX SV X XX DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) XX DE L.ivanovii sod gene for superoxide dismutase SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg

7 2. Fasta >LISOD L.ivanovii sod gene for superoxide dismutase cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg GCG lisod.seq Length: 756 October 27, :17 Type: N Check: cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc 51 gccttacaat gtaatttctt ttcacataaa taataaacaa tccgaggagg 101 aatttttaat gacttacgaa ttaccaaaat taccttatac ttatgatgct

8 Multiple sequence format of GCG 1.msf MSF: 44 Type: P October 24, :41 Check: Name: ftsy_bucai Len: 44 Check: 4221 Weight: 1.00 Name: ftsy_ecoli Len: 44 Check: 2326 Weight: 1.00 Name: ftsy_aquae Len: 44 Check: 2339 Weight: 1.00 Name: ftsy_bacsu Len: 44 Check: 6177 Weight: 1.00 Name: sr54_aciam Len: 44 Check: 6296 Weight: 1.00 Name: sr54_aerpe Len: 44 Check: 7291 Weight: 1.00 Name: sr54_arcfu Len: 44 Check: 122 Weight: 1.00 Name: sr54_aquae Len: 44 Check: 345 Weight: 1.00 // 1 44 ftsy_bucai KNS.EKLYFL LKRKMFNILK KVEIP...LE ISSHSPFVIL VVGV ftsy_ecoli RDA.EALYGL LKEEMGEILA KVDEP...LN VEGKAPFVIL MVGV ftsy_aquae KEG.EKIKEL LKKELKELLK NCQ...GELK IPEKVGAVLL FVGV ftsy_bacsu QDP.KEVQSV ISEKLVEIYN SGDEQISELN IQDGRLNVIL LVGV sr54_aciam PTYIERREWF IKIVYDELSN LFGGDKEPKV IPDKIPYVIM LVGV sr54_aerpe PPGVTRRDWM IKIVYEELVK LFGGDQEPQV DPPKTPWIVL LVGV sr54_arcfu LPALNAKEQI LKIVYEELLR GVGEGLEIPL KKAK...IM LVGL sr54_aquae PKNLSPAEFV IKTVYEELVD ILGGEK....ADLKKGTVL FVGL

9 FASTA multiple sequence format >ftsy_bucai, 44 bases, 52FF1180 checksum. KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV >ftsy_ecoli, 44 bases, B14506B6 checksum. RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV >ftsy_aquae, 44 bases, 27571BEE checksum. KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV >ftsy_bacsu, 44 bases, FA023A4F checksum. QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV >sr54_aciam, 44 bases, 2FC13632 checksum. PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV >sr54_aerpe, 44 bases, 37AFB895 checksum. PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV >sr54_arcfu, 44 bases, checksum. LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL >sr54_aquae, 44 bases, A2 checksum. PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL

10 CLUSTAL multiple sequence format CLUSTAL W (1.81) multiple sequence alignment ftsy_bucai ftsy_ecoli ftsy_aquae ftsy_bacsu sr54_aciam sr54_aerpe sr54_arcfu sr54_aquae KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL.:. :: ::.**:

11 Phylip multiple sequence format 8 44 ftsy_bucai KNS-EKLYFL LKRKMFNILK KVEIP---LE ISSHSPFVIL VVGV ftsy_ecoli RDA-EALYGL LKEEMGEILA KVDEP---LN VEGKAPFVIL MVGV ftsy_aquae KEG-EKIKEL LKKELKELLK NCQ---GELK IPEKVGAVLL FVGV ftsy_bacsu QDP-KEVQSV ISEKLVEIYN SGDEQISELN IQDGRLNVIL LVGV sr54_aciam PTYIERREWF IKIVYDELSN LFGGDKEPKV IPDKIPYVIM LVGV sr54_aerpe PPGVTRRDWM IKIVYEELVK LFGGDQEPQV DPPKTPWIVL LVGV sr54_arcfu LPALNAKEQI LKIVYEELLR GVGEGLEIPL KKAK----IM LVGL sr54_aquae PKNLSPAEFV IKTVYEELVD ILGGEK---- -ADLKKGTVL FVGL

12 Sequence editors SeqLab / GCG

13 Genedoc

14 Conversion of sequence formats - Readseq (one useful version of readseq is part of the SAM package Readseq can convert between the following formats: 1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip NBRF 12. Phylip 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only)

15 GCG package utilities: Convert from to * Reformat text GCG * FromEMBL EMBL GCG * FromGenBank Genbank GCG * FromFasta Fasta GCG * ToFastA GCG Fasta Clustal: clustalw my_alignment -convert -output=gcg clustalw my_alignment -convert -output=phylip

16 Downloading bioinformatics data and software for local use

17

18 Downloading bioinformatics data and software for local use Data and software downloaded from the net are typically compressed in one way or the other. The most common formats are *.Z and *.gz These are binary files that cannot be read with normal editors To uncompress a *.Z file : % uncompress [file] To uncompress a *.gz file : % gunzip [file] A set of files is often downloaded as a set of compressed tar archive. For instance if you download a file called dna.tar.gz you can do % gunzip dna.tar.gz This will yield a file called dna.tar. Then do % tar -xvf dna.tar This will unpack all files contained in dna.tar

19 Alignments and database searches Common biological problem: We have a novel protein sequence. What can we infer from this sequence about the biological function of the protein?? * Pattern search - PROSITE * Profile search - Pfam * Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!) * Sequence homology - BLAST, FASTA, SSEARCH Simple example: unknown human protein is highly homologous to a protein with known function from another organism => The human protein has the same function (it s an ortholog or a paralog)

20 BLAST 1. Heuristic step where short word matches are identified 2. Extending matches using dynamic programming method as for local pairwise alignments. (Substitution matrix, gap penalties) -The nature of protein sequence and structure evolution - what is the sensitivity of BLAST searches? -General principles of database searches

21 Evolution of protein genes ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGT M A K L E K L N Q A G L M V A G 60% ATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGA M V R L E K I N Q A G L L V A G 69% M V R I Q K I N E K G A L L A G 38% Q V R I Q K I Y E K G A L L A A 19% ( twilight zone ) Q V R I Q K I Y E K T A L L F A 6% ( midnight zone )

22 Blast report Sequences producing significant alignments: (bits) Value pir F69494 (R)-hydroxyglutaryl-CoA dehydratase activator (hgdc) e-129 gb AAD (AF123384) (R)-2-hydroxyglutaryl-CoA dehydratase e-060 sp P39383 YJIL_ECOLI HYPOTHETICAL 27.4 KD PROTEIN IN IADA-MCRD I e-046 emb CAA (X98916) orf6 [Methanopyrus kandleri] 170 1e-041 gb AAF AF156260_1 (AF156260) unknown [Methanosarcina bark e-033 pir A69117 activator of (R)-2-hydroxyglutaryl-CoA - Methanobact e-030 pir A72369 (R)-2-hydroxyglutaryl-CoA dehydratase activator-rela e-029 gb AAC (U75363) benzoyl-coa reductase subunit [Rhodopseu e-025 pir S04476 hypothetical protein (hdga 5' region) - Acidaminococ e-021 sp P27542 DNAK_CHLPN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP gb AAC (AF016711) heat shock protein 70 [Burkholderia ps pir F75029 o-sialoglycoprotein endopeptidase (gcp) PAB Py pir F72514 probable glucokinase APE Aeropyrum pernix (str sp P42373 DNAK_BURCE DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP emb CAA (AJ012470) mitochondrial-type hsp70 [Encephalito sp P56836 DNAK_CHLMU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP gb AAF (AE002336) dnak protein [Chlamydia muridarum] pir B70189 rod shape-determining protein (mreb-1) homolog - Lym sp O57716 GCP_PYRHO PUTATIVE O-SIALOGLYCOPROTEIN ENDOPEPTIDASE ( sp O33522 DNAK_ALCEU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP ref NP_ Ykl050cp >gi sp P35736 YKF0_YEAST HYPOTH emb CAA (X75781) D513 [Saccharomyces cerevisiae] >gi sp P30722 DNAK_PAVLU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) >gi pir A40158 dnak-type molecular chaperone - Chlamydia trachomati gb AAF AE001584_39 (AE001584) hypothetical protein [Borre gb AAF AE001577_35 (AE001577) hypothetical protein [Borre gb AAF (AE002276) cell shape-determining protein MreB [C gb AAG AE004889_10 (AE004889) DnaK protein [Pseudomonas a dbj BAB (AB017035) dnak [Bacillus thermoglucosidasius] sp P43736 DNAK_HAEIN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP sp P45554 DNAK_STAAU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP sp Q58303 FLA3_METJA FLAGELLIN B3 PRECURSOR gb AAG AE004898_10 (AE004898) phosphoribosylaminoimidazol

23 Conclusions / comments If two proteins have significant sequence homology it is highly likely that the two proteins have the same 3D structure (and same function) If two proteins have the same 3D structure is does not necessarily mean that the sequences are related Very remote evolutionary relationships are difficult or impossible to detect with normal BLAST / pairwise alignment Database sequence searches involving proteins should be carried out at the protein level and not at the DNA level

24 More rules of database searches?compare sequences as proteins and not as DNA*?Use of smallest possible database (not too small though)?sequence statistics should be used rather than percent identity/similarity as criterion for homology?consider different scoring matrices and gap penalties * 1) DNA sequences encoding the same protein sequence can be very different, due to the degeneracy of the genetic code. TTTCGATTCTCAACAAGAAGC ** * ** ** * * TTCAGGTTTAGCACGCGGTCC F R F S T R S 2) Amino acid substitution matrices may be taken into account.

25 PAM matrices Starting point of PAM matrices are closely related orthologs, full sequences are taken into account. In the calculation of the PAM matrix extrapolation is done to more distant relationships Neurospora_crassa GSVDGYAYTD ANKQKGITWD ENTLFEYLEN PKKYIPGTKM AFGGLKKDKD Stellaria_longipes GSVEGFSYTD ANKAKGIEWN KDTLFEYLEN PKKYIPGTKM AFGGLKKDKD Thermomyces_lanuginosus GSVEGYSYTD ANKQAGITWN EDTLFEYLEN PKKFIPGTKM AFGGLKKNKD Arabidopsis_thaliana GSVAGYSYTD ANKQKGIEWK DDTLFEYLEN PKKYIPGTKM AFGGLKKPKD Aspergillus_niger GQSEGYAYTD ANKQAGVTWD ENTLFSYLEN PKKFIPGTKM AFGGLKKGKE Debaryomyces_occidentalis GQAAGYSYTD ANKKKGVEWT EQTMSDYLEN PKKYIPGTKM AFGGLKKPKD Schizosaccharomyces_pombe GQAEGFSYTE ANRDKGITWD EETLFAYLEN PKKYIPGTKM AFAGFKKPAD Fagopyrum_esculentum GTTAGYSYSA ANKNKAVTWG EDTLYEYLLN PKKYIPGTKM VFPGLKKPQE Sesamum_indicum GTTPGYSYSA ANKNMAVIWG ENTLYDYLLN PKKYIPGTKM VFPGLKKPQE Haematobia_irritans GQAAGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Lucilia_cuprina GQAPGFAYTN ANKAKGITWQ DDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Ceratitis_capitata GQAAGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Sarcophaga_peregrina GQAPGFAYTD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM IFAGLKKPNE Manduca_sexta GQAPGFSYSD ANKAKGITWN EDTLFEYLEN PKKYIPGTKM VFAGLKKANE Samia_cynthia GQAPGFSYSN ANKAKGITWG DDTLFEYLEN PKKYIPGTKM VFAGLKKANE Schistocerca_gregaria GQAPGFSYTD ANKSKGITWD ENTLFIYLEN PKKYIPGTKM VFAGLKKPEE Apis_mellifera GQAPGYSYTD ANKGKGITWN KETLFEYLEN PKKYIPGTKM VFAGLKKPQE Macaca_mulatta GQAPGYSYTA ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE Pan_troglodytes GQAPGYSYTA ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE Anas_platyrhynchos GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE Aptenodytes_patagonicus GQAEGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKSE

26 BLOSUM matrices. Starting point is conserved elements in protein families. Protein sequences are more distantly related than the sequences used for PAM matrices. Example of entry from BLOCKS DATABASE: Block BL00134A ID TRYPSIN_HIS; BLOCK AC BL00134A; distance from previous block=(23,4353) DE Serine proteases, trypsin family, histidine proteins. BL CHC; width=17; seqs=259; 99.5%=867; strength=1439 CTR2_VESCR P00769 ( 24) FCGGSISKRYVLTAAHC 63 CTRL_HALRU P35003 ( 53) CGCVLYTTSKALTAAHC 80 EAST_DROME P13582 ( 158) CGGSLISTRYVITASHC 18 GILX_HELHO P43685 ( 26) WSGVLLNRDWILTAAHC 45 NDL_DROME P98159 (1170) CGGTIYSDRWIISAAHC 10 PCE_TACTR P21902 ( 157) CGGALVTNRHVITASHC 25 TRYP_ASTFL P00765 ( 30) CGASIYNENYAITAGHC 36 TRYU_DROME P42279 ( 59) CGGCILDAVTIATAAHC 31 ACRO_HUMAN P10323 ( 73) CGGSLLNSRWVLTAAHC 4 ACRO_MOUSE P23578 ( 74) CGGSLLNSHWVLTAAHC 5 ACRO_PIG P08001 ( 71) CGGILLNSHWVLTAAHC 7 ACRO_RABIT P48038 ( 71) CGGVLLNAHWVLTAAHC 6 ACRO_RAT P29293 ( 74) CGGSLLNSHWVLTAAHC 5 ANC1_AGKRH P26324 ( 28) CGGVLIHPEWVITAEHC 25 ANC2_AGKRH P47797 ( 52) CGGVLIHPEWVITAKHC 16 ANCR_AGKCO P09872 ( 25) CGGTLINQEWVLTARHC 25 BATX_BOTAT P04971 ( 50) CGMTLINQEWVLTAAHC 49

27

28 Variants of BLAST / FASTA DNA/DNA P/P DNA/P P/DNA DNA/DNA blastall -p blastn -p blastp -p blastx -p tblastn -p tblastx fasta fasta fasta fastx,fasty* tfasta tfastx, tfasty *Compare a DNA sequence to a protein sequence database, by comparing the translated DNA sequence in three frames and allowing gaps and frameshifts. fastx3 uses a simpler, faster algorithm for alignments that allows frameshifts only between codons; fasty3 is slower but produces better alignments with poor quality sequences because frameshifts are allowed within codons.