Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004
Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2
Genomic & Proteomic Data Sequence Alignment was one of the first bioinformatics techniques (~1970) pre-dates high-throughput sequencing techniques 1000 100 10 PetaBytes 1 0.1 0.01 0.001 0.0001 0.00001 1988 1990 1992 1994 1996 1998 2000 2002 2004 Years 2006 2008 2010 2012 2014 Proteomic data GenBank 2016 3
Motivation Structure Prediction Challenge 4
Guiding Principal The basic guiding principal is EVOLUTION Mutations to regions of the DNA/protein sequence that are functional units are less supportive of change Myoglobin vs Hemoglobin Zinc Finger 5
Outline Pairwise Sequence Alignment Dynamic Programming Heuristic Bayesian Multiple Sequence Alignment Heuristic Statistics (Gibbs Sampling) Conclusions 6
Pairwise Sequence Alignment One of the most commonly performed tasks in bioinformatics Method to compare two sequences and make inferences on the relationship between them Homologs = Two molecules that share a common ancestor 7
Searching for Homology Query (Unknown function/structure) >d1npx_2 3.4.1.4.5 (120-242) NADH peroxidase IPGKDLDNIYLMRGRQWAIKLKQKTVDPEVNNVVVIGSGYIGIEAAEAFAKAGKKVTVID ILDRPlGVYLDKEFTDVLTEEMEANNITIATGETVERYEGDGRVQKVVTDKNAYDADLVV VAV Target (Known function/structure) >d3lada2 3.4.1.4.6 (159-227) Dihydrolipoamide dehydrogenase PAPVDQDVIVDSTGALDFQNVPGKLGVIGAGVIGLELGSVWARLGAEVTVLEAMDKFLPA VDEQVAKEAQKILTKQGLKILLGARVTGTEVKNKQVTVKFVDAEGEKSQAFDKLIVAVG Known Unknown 8
Pairwise Sequence Alignment One-to-one correspondence between the residues of two sequences R R (1) (2) = { R = { R (1) 1 (1) 1,..., R,..., R (1) I (1) J } = { HEAGAWGHE} } = { GHEE} Global Local HEAGAWGHE- ------GHEE GHE GHE Almost all alignment is done at the local level 9
Why Sequence Alignment Algorithms? There are a huge number of alignments length 10-185,000 Alignments length 20-138 billion Exponential < O( n 2 ) n is the length of the longest sequence 10
Evolution (mutation) Coded for at DNA level Captured in 2 parameters Scoring matrices Gap penalties 11
Scoring Matrices Characterize the probability that one residue was substituted for another (log odds-ratio) A R N D C Q E G H I L K M F P S T W Y V A 4-1 -2-2 0-1 -1 0-2 -1-1 -1-1 -2-1 1 0-3 -2 0 R -1 5 0-2 -3 1 0-2 0-3 -2 2-1 -3-2 -1-1 -3-2 -3 N -2 0 6 1-3 0 0 0 1-3 -3 0-2 -3-2 1 0-4 -2-3 D -2-2 1 6-3 0 2-1 -1-3 -4-1 -3-3 -1 0-1 -4-3 -3 C 0-3 -3-3 9-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 Q -1 1 0 0-3 5 2-2 0-3 -2 1 0-3 -1 0-1 -2-1 -2 E -1 0 0 2-4 2 5-2 0-3 -3 1-2 -3-1 0-1 -3-2 -2 G 0-2 0-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 0-2 -2-3 -3 H -2 0 1-1 -3 0 0-2 8-3 -3-1 -2-1 -2-1 -2-2 2-3 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1 0-3 -2-1 -3-1 3 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2 0-3 -2-1 -2-1 1 K -1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -3-1 0-1 -3-2 -2 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 0-2 -1-1 -1-1 1 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6-4 -2-2 1 3-1 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7-1 -1-4 -3-2 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 1-3 -2-2 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 0 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 2-3 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7-1 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 BLOSUM 62 1992 12
Gap Penalties Characterize the probability that a residue was inserted or deleted Linear d = gap opening penalty Affine γ ( g) = gd e = gap extension penalty γ ( g ) = d ( g 1) e 13
Typical Output Sequence Alignment means two residues are identical : means that two residues have similar physiochemical properties Typical Objective Function Score = end start ScoreMatrix end start GapPenalties 14
Pairwise Sequence Alignment Algorithms Exhaustive Dynamic Programming Needleman-Wunsch (1970) Smith-Waterman (1981) Approximate Heuristic Methods BLAST (PSI-BLAST) (1990, 1997) FASTA (1998) Statistical Bayes Block Aligner (1998) BALSA (2002) 15
Dynamic Programming Optimization method that uses sequential decisions to solve the problem Global 1970 - Needleman-Wunsch Local 1981 - Smith-Waterman Almost all alignment is done at the local level 16
Dynamic Programming Algorithms The optimal alignment, A*, is found by fixing the scoring matrix, Θ, and the gap penalties, Λ 0 0, and maximizing the log-likelihood (1) (2) (1) (2) log( P( R, R, A* Θ0, Λ0)) = max{log( P( R, R A, Θ0)) + log( P( A Λ0))} A scoring matrix Alignment (1) (2) s( R i, R j ) (1) gap penalties d = gap opening penalty e = gap extension penalty A i, j 1 = 0 if R i otherwise is aligned with R (2) j. 17
Standard Dynamic Programming Choices Match R R (1) i (2) j Insertion into Sequence 1 Deletion from Sequence 1 (1) R i - - (2) R j 18
F( i, Algorithm (1) F( i 1, j 1) + s( Ri, R F( i 1, j) d j) = max F( i, j 1) d 0 scoring matrix 1-1 1-1 -1 Gap Penalty 1-1 -1 d = 1-1 -1-1 -1 1-1 -1 1 Smith-Waterman Simple Gap Penalty (2) j ) G C G 0 0 0 0 (+1) G A G 0 (-1) +1 (-1) (0) 0 1 (-1) 0 0 0 0 (-1) 0 1 0 1 19
Reconstructing the Alignment 20
Smith-Waterman Availability http://www.dna. affrc.go.jp/htdo cs/swsrch/ http://www.ebi. ac.uk/mpsrch/ http://wwwhto.usc.edu/sof tware/seqaln/se qaln-query.html 21
Results from SWsrch with hemoglobin Title: >hemoglobin mvlsaddktnikncwgkigghggeygeealqrmfaafpttktyfshidvspgsaqvkahg kkvadalakaadhvedlpgalstlsdlhahklrvdpvnfkflshcllvtlachhpgdftp Amhasldkflasvstvltskyr Perfect Score: 750 Sequence: 1 MVLSADDKTNIKNCWGKIGG...HASLDKFLASVSTVLTSKYR 142 SUMMARIES % Result Query No. Score Match Length DB ID Description ---------------------------------------------------------------------------- 1 750 100.0 142 1 HART1(I54239;I68531;A26903;A93047;A90284;A90285;A02268) hemoglobin alpha-1 chain - rat &RATHBAM_1(M17083 pid: 2 745 99.3 141 1 (P01946) Hemoglobin alpha-1 and alpha-2 c 3 742 98.9 142 1 RN2A1GL_1(X56325 pid:g3367722) R.norvegicus 4 632 84.3 142 1 AK003077_1(AK003077 pid:none) Mus musculus a 5 632 84.3 142 1 HAMS(A90791;I49720;A45964;I49722;I49721;B43560;A92945;) hemoglobin alpha chains - mouse &MMAGL1_1(V00714 pid: 6 630 84.0 142 1 AK011076_1(AK011076 pid:none) Mus musculus 1 7 627 83.6 141 1 (P01942) Hemoglobin alpha chain. 8 623 83.1 141 1 (P20854) Hemoglobin alpha chain. &HARTNG 9 623 83.1 142 1 L75940_1(L75940 pid:none) Mus musculus alpha 10 623 83.1 142 1 AK010422_1(AK010422 pid:none) Mus musculus E 11 615 82.0 141 1 HASL1W(S10481)hemoglobin alpha-i chain - Wed 12 611 81.5 141 1 (P01938) Hemoglobin alpha chain. &HALRN( 13 610 81.3 141 1 (P09420) Hemoglobin alpha chain. &A25359 14 609 81.2 141 1 (P01930) Hemoglobin alpha chain. &HAMQB( 15 607 80.9 141 1 (P18969) Hemoglobin alpha chain. &HAFQL( 16 605 80.7 141 1 (P01974) Hemoglobin alpha chain. &HACMA( 17 605 80.7 141 1 HAMN2F(S11533)hemoglobin alpha-ii chain - do 18 605 80.7 141 1 (P01945) Hemoglobin alpha chain. &HAHY(A 19 604 80.5 141 1 (P15163) Hemoglobin alpha-i and alpha-ii 20 603 80.4 141 1 (P01928) Hemoglobin alpha chain. &HAMQA( 22
Results from SWsrch with hemoglobin cont. RESULT 9 >L75940_1(L75940 pid:none) Mus musculus alpha-globin mrna, complete cds. &MUSALGL_1(L75940 pid:g1162945) Query Match 83.1%; Score 623; DB 1; Length 142; Best Local Similarity 83.8%; Pred. No. 1.80e-63; Matches 119; Conservative 7; Mismatches 16; Indels 0; Gaps 0; Inserts 0; InsGaps 0; Deletes 0; DelGaps 0; Db 1 mvlsgedksnikaawgkigghgaeyvaealermfasfpttktyfphfdvshgsaqvkghg 60 + + + + Qy 1 mvlsaddktnikncwgkigghggeygeealqrmfaafpttktyfshidvspgsaqvkahg 60 Db 61 kkvadalasaaghlddlpgalsalsdlhahklrvdpvnfkllshcllvtlashhpadftp 120 ++ Qy 61 kkvadalakaadhvedlpgalstlsdlhahklrvdpvnfkflshcllvtlachhpgdftp 120 Db 121 avhasldkflasvstvltskyr 142 + Qy 121 amhasldkflasvstvltskyr 142 23
BLAST (Basic Alignment Search Tool) Indexes Database Calculated neighborhood of each word in query using scoring matrix and probability threshold Look up all words and neighbors from query database index Extends High-scoring Segment Pairs (HSPs) left and right to maximal length Finds maximal segment pairs (MSPs) between query and database 24
BLAST database search 25
PSI-BLAST (Position specific iterative BLAST) A profile (position specific scoring matrix, PSSM) is constructed from a multiple alignment of the highest scoring hits in a BLAST search The PSSM is generated by calculating position-specific scores for each position in the alignment. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity. 26
Types of BLAST BLASTP search a Protein Sequence against a Protein Database. BLASTN search a Nucleotide Sequence against a Nucleotide Database. TBLASTN search a Protein Sequence against a Nucleotide Database, by translating each database Nucleotide sequence in all 6 reading frames. BLASTX Search a Nucleotide Sequence against a Protein Database, by first translating the query Nucleotide sequence in all 6 reading frames PSI-BLAST Profile generated from identified homologs, used, and iteratively updated. (Especially good for identifying remote homologies) PHI-BLAST Enforces the presence of a motif in addition to the usual PSI-BLAST criteria for matching 27
Assessing Evidence for Homology Is the score higher than expected from 2 random sequences (non-homologs) BLAST scores are not independent from the length of the sequences being aligned Fit extreme value distribution to randomly shuffled sequences BLAST returns the maximum score maximum of a larger number of i.i.d. random variables tends to an extreme distribution Expected number of HSPs for 2 sequences of length m and n λs E = Kmne Similar Approach for Smith-Waterman 28
BLAST Availability http://www.ncbi.nlm.ni h.gov/education/bla STinfo/information3.h tml http://www.ch.embnet.org/software/bottom BLAST.html http://hits.isbsib.ch/cgibin/hits_psi_blast 29
Results from BLAST with hemoglobin Distribution of 100 Blast Hits on the Query Sequence Mouse-over to show defline and scores. Click to show alignments 30
Results from BLAST cont. Sequences producing significant alignments: (Score bits) E-Value gi 6981010 ref NP_037228.1 hemoglobin, alpha 1 [Rattus nor... 214 4e-55 gi 34870607 ref XP_340780.1 similar to hemoglobin alpha ch... 211 4e-54 gi 12845853 dbj BAB26925.1 unnamed protein product [Mus mu... 184 4e-46 gi 12833511 dbj BAB22552.1 unnamed protein product [Mus mu... 183 8e-46 gi 26345020 dbj BAC36159.1 unnamed protein product [Mus mu... 183 9e-46 gi 12846963 dbj BAB27381.1 unnamed protein product [Mus mu... 183 1e-45 gi 122280 sp P11755 HBA1_TADBR Hemoglobin alpha-1 chain >gi... 182 1e-45 gi 6680175 ref NP_032244.1 hemoglobin alpha, adult chain 1... 182 1e-45 gi 1162945 gb AAB59723.1 alpha-globin [Mus musculus] 182 2e-45 gi 122352 sp P14387 HBA_ANTPA Hemoglobin alpha chain >gi 28... 180 6e-45 gi 70217 pir HASL1W hemoglobin alpha-i chain - Weddell seal 180 7e-45 gi 122341 sp P18969 HBA_AILFU Hemoglobin alpha chain >gi 70... 179 1e-44 gi 418658 pir HASHR2 hemoglobin alpha-ii chain - aoudad (t... 179 2e-44 gi 122491 sp P01950 HBA_SUNMU Hemoglobin alpha chain >gi 70... 179 2e-44 gi 122446 sp P26915 HBA_NASNA Hemoglobin alpha chain >gi 10... 178 2e-44 gi 14194808 sp Q9XSN3 HBA1_EQUBU Hemoglobin alpha-1 chain >... 178 2e-44 31
Results from BLAST cont. >gi 122352 sp P14387 HBA_ANTPA Hemoglobin alpha chain gi 281094 pir A29702 hemoglobin alpha chain - pallid bat Length = 141 Score = 180 bits (457), Expect = 6e-45 Identities = 96/141 (68%), Positives = 104/141 (73%) Query: 2 VLSADDKTNIKNCWXXXXXXXXXXXXXALQRMFAAFPTTKTYFSHIDVSPGSAQVKAHGX 61 VLS DKTN+K W AL+RMF +FPTTKTYF H D+SGSAQVK HG Sbjct: 1 VLSPADKTNVKAAWDKVGGHAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK 60 Query: 62 XXXXXXXXXXXHVEDLPGALSTLSDLHAHKLRVDPVNFKFLSHCLLVTLACHHPGDFTPA 121 H++DLPGALS LSDLHA+KLRVDPVNFK LSHCLLVTLACHHPGDFTPA Sbjct: 61 KVGDALGNAVAHMDDLPGALSALSDLHAYKLRVDPVNFKLLSHCLLVTLACHHPGDFTPA 120 Query: 122 MHASLDKFLASVSTVLTSKYR 142 +HASLDKFLASVSTVL SKYR Sbjct: 121 VHASLDKFLASVSTVLVSKYR 141 32
BALSA (Bayesian Algorithm for Local Sequence Alignment) Smith-Waterman recursion with sums Formulate sequence alignment as a Bayesian inference problem Everything is a random variable Allows multiple scoring matrices Make inferences on the parameters 33
34 BALSA Methodology BALSA Methodology BALSA Methodology Joint = likelihood*priors Priors a priori Algorithm ), ( ) ( ),, ( ),,,, ( (2) (1) (2) (1) Θ Λ Λ Θ = Λ Θ P A P A R R P A R R P Θ Λ = Θ Λ, 1 ), ( N P = = Λ A A k A l e A k o A k A l e A k o e o g g g g g g A P A P ) ( ) ( ) ( ) ( ) ( ) ( ), ( ) ( λ λ λ λ λ λ Θ = Λ = Θ Θ Λ A A k A l e A k o A A k A l e A k o A g g g g g g A R R P A P A R R P R R P ) ( ) ( ) ( ) ( ) ( ) ( (2) (1) (2) (1) (2) (1) ),, ( ) ( ),, ( ),, ( λ λ λ λ
Assessing Evidence for Homology The scores are independent of the length of the sequences being aligned. Score = P( R P( R (1) (1), R (2) ) P( R H ) (2) ) Directly Calculate Probability not homolgous from score P( H R (1), R (2) ) = Score 1 P( H ) P( H ) + 1 35
BALSA Availability http://bayesweb.wadsworth.org /balsa/balsa.ht ml 36
Conclusions on Pairwise Sequence Alignment Choice of Algorithm is based on need Trade off between sensitivity and speed SCOP40 Sensitivity 1% EPQ BLAST 14.8% FASTA 16.7% SSEARCH 18.4% BALSA w/1 19.2% BALSA w/4 19.8% Speed 37
Outline Pairwise Sequence Alignment Dynamic Programming Heuristic Bayesian Multiple Sequence Alignment Heuristic Statistics (Gibbs Sampling) Conclusions 38
Multiple Sequence Alignment Multiple sequence alignment is generally concerned with finding structural or functional patterns between sequences Develop relationships and phylogenies Determine consensus sequence Build gene families Model protein structure for threading and fold prediction 39
Motivation: Example Motif Discovery 40
Motif -> > Scoring Matrix 41
Motif Finding Programs (search against a database of motifs) GCG SEQWEB Programs http://pmgm.stanford.edu STRINGSEARCH FINDPATTERNS MOTIFS PROSITE web programs PROSITE - http://www.expasy.ch/prosite PROSITE SCAN http://www.expasy.ch/tools/scanprosite emotif web programs emotif http://motif.stanford.edu/emotif emotif-search http://motif.stanford.edu/emotif-search emotif-scan http://motif.stanford.edu/emotif-scan 3MOTIF http://3motif.stanford.edu 42
Multiple Sequence Alignment Challenges Computation complexity O(n^k) for k sequences n long Space requirements O(n^k) for k sequences n long Sequence clusters require weighting function Weighted alignments tend to overweight erroneous sequence Approximations must be used for real world data Linked lists used to find exact words shared between k sequences BLAST can find inexact shared words between k sequences FASTA can be used to do progressive pair-wise alignments GIBBs sampling to find best overall alignment stochastically Final alignment is often dependent on order data presented Gaps make alignments unnaturally long 43
Multiple Alignment Multiple Sequence Aligner (MSA) Builds linked list of words GenAlign Iteratively adds sequences ClustalW Progressively adds sequences using clusters (a dendogram) Gibbs Sampling Generates a random alignment of size k and iteratively samples and updates the alignment until convergence 44
ClustalW Step 1 Generate all pairwise alignments Generate dendogram from alignment scores 45
ClustalW Step 2 Align most similar pair Align next most similar pair Combine 2 alignments 46
ClustalW General Approach 47
http://www.ebi.ac.uk/ clustalw/ http://bioweb.pasteu r.fr/seqanal/interface s/clustalw.html http://www.ch.embn et.org/software/clust alw.html ClustalW Availability 48
Gibbs Sampling Traditional Gibbs sampling 1. Sample an alignment given parameters P( A Θ, R) 2. Sample parameters given an alignment P( Θ A, R) 49
Phylogenetic Footprinting Find DNA functional elements and signals in the non-coding region surrounding a gene 50
Transcription Regulation Gene Transcription and Regulation Transcription initiated by RNA polymerase binding Enhancers and repressors RNA polymerase Promoter region Starting codon 5 3 AUG Binding of Transcription factors inhibits or enhances expression 51
Example: Corepressor Transcription in process Transcription inhibited 52
Motif Alignment Model a 1 a 2 Motif width = w a k length n k The missing data: Alignment variable: A={a 1, a 2,, a k } Alignment: starting positions of binding sites in each sequence Apriori all positions equally likely Final alignment dependent on DNA sequence 53
Gibbs Sampler Algorithm Initialized by choosing random starting (0) (0) (0) positions a1, a2,..., ak Iterate the following steps many times: Randomly or systematically choose a sequence, say, sequence k, to exclude. Carry out the predictive-updating step to update a k Stop when changes are infrequent, or some criterion met. 54
Gibbs Sampler Availability http://bayesweb.w adsworth.org/gibb s/gibbs.html http://www.bioinfor matics.ubc.ca/reso urces/tools/index.p hp?name=gibbs 55
Conclusions Sequence Analysis is the most commonly performed task in bioinformatics The choice of algorithm is dependent upon needs Pairwise Homology detection Multiple Motif detection Building Gene Families Phylogenetic footprinting The future is in whole genome comparisons 56
Other Sources of Information http://www.ncbi.nlm.nih.gov/blast/ Extensive tutorials Bioinformatics Books Biological Sequence Analyis (Durbin et al.) Bioinformatics: The Machine Learning Approach (Baldi & Brunak) Computational Molecular Biology (Pevzner) Journal Articles Altschul et al. Journal of Molecular Biology, 1990. 215; p403-410 (Original BLAST paper) Altschul et al. Nucleic Acids Research, 197. 25; p3389-3402 (PSI- BLAST Paper) McCue et al. Nucleic Acids Research, 2001. 29; p774-782 (Phylogenetic Footprinting) 57