Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Transcription

1 Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004

2 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2

3 Genomic & Proteomic Data Sequence Alignment was one of the first bioinformatics techniques (~1970) pre-dates high-throughput sequencing techniques PetaBytes Years Proteomic data GenBank

4 Motivation Structure Prediction Challenge 4

5 Guiding Principal The basic guiding principal is EVOLUTION Mutations to regions of the DNA/protein sequence that are functional units are less supportive of change Myoglobin vs Hemoglobin Zinc Finger 5

6 Outline Pairwise Sequence Alignment Dynamic Programming Heuristic Bayesian Multiple Sequence Alignment Heuristic Statistics (Gibbs Sampling) Conclusions 6

7 Pairwise Sequence Alignment One of the most commonly performed tasks in bioinformatics Method to compare two sequences and make inferences on the relationship between them Homologs = Two molecules that share a common ancestor 7

8 Searching for Homology Query (Unknown function/structure) >d1npx_ ( ) NADH peroxidase IPGKDLDNIYLMRGRQWAIKLKQKTVDPEVNNVVVIGSGYIGIEAAEAFAKAGKKVTVID ILDRPlGVYLDKEFTDVLTEEMEANNITIATGETVERYEGDGRVQKVVTDKNAYDADLVV VAV Target (Known function/structure) >d3lada ( ) Dihydrolipoamide dehydrogenase PAPVDQDVIVDSTGALDFQNVPGKLGVIGAGVIGLELGSVWARLGAEVTVLEAMDKFLPA VDEQVAKEAQKILTKQGLKILLGARVTGTEVKNKQVTVKFVDAEGEKSQAFDKLIVAVG Known Unknown 8

9 Pairwise Sequence Alignment One-to-one correspondence between the residues of two sequences R R (1) (2) = { R = { R (1) 1 (1) 1,..., R,..., R (1) I (1) J } = { HEAGAWGHE} } = { GHEE} Global Local HEAGAWGHE GHEE GHE GHE Almost all alignment is done at the local level 9

10 Why Sequence Alignment Algorithms? There are a huge number of alignments length ,000 Alignments length billion Exponential < O( n 2 ) n is the length of the longest sequence 10

11 Evolution (mutation) Coded for at DNA level Captured in 2 parameters Scoring matrices Gap penalties 11

12 Scoring Matrices Characterize the probability that one residue was substituted for another (log odds-ratio) A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V BLOSUM

13 Gap Penalties Characterize the probability that a residue was inserted or deleted Linear d = gap opening penalty Affine γ ( g) = gd e = gap extension penalty γ ( g ) = d ( g 1) e 13

14 Typical Output Sequence Alignment means two residues are identical : means that two residues have similar physiochemical properties Typical Objective Function Score = end start ScoreMatrix end start GapPenalties 14

15 Pairwise Sequence Alignment Algorithms Exhaustive Dynamic Programming Needleman-Wunsch (1970) Smith-Waterman (1981) Approximate Heuristic Methods BLAST (PSI-BLAST) (1990, 1997) FASTA (1998) Statistical Bayes Block Aligner (1998) BALSA (2002) 15

16 Dynamic Programming Optimization method that uses sequential decisions to solve the problem Global Needleman-Wunsch Local Smith-Waterman Almost all alignment is done at the local level 16

17 Dynamic Programming Algorithms The optimal alignment, A*, is found by fixing the scoring matrix, Θ, and the gap penalties, Λ 0 0, and maximizing the log-likelihood (1) (2) (1) (2) log( P( R, R, A* Θ0, Λ0)) = max{log( P( R, R A, Θ0)) + log( P( A Λ0))} A scoring matrix Alignment (1) (2) s( R i, R j ) (1) gap penalties d = gap opening penalty e = gap extension penalty A i, j 1 = 0 if R i otherwise is aligned with R (2) j. 17

18 Standard Dynamic Programming Choices Match R R (1) i (2) j Insertion into Sequence 1 Deletion from Sequence 1 (1) R i - - (2) R j 18

19 F( i, Algorithm (1) F( i 1, j 1) + s( Ri, R F( i 1, j) d j) = max F( i, j 1) d 0 scoring matrix Gap Penalty d = Smith-Waterman Simple Gap Penalty (2) j ) G C G (+1) G A G 0 (-1) +1 (-1) (0) 0 1 (-1) (-1)

20 Reconstructing the Alignment 20

21 Smith-Waterman Availability affrc.go.jp/htdo cs/swsrch/ ac.uk/mpsrch/ tware/seqaln/se qaln-query.html 21

22 Results from SWsrch with hemoglobin Title: >hemoglobin mvlsaddktnikncwgkigghggeygeealqrmfaafpttktyfshidvspgsaqvkahg kkvadalakaadhvedlpgalstlsdlhahklrvdpvnfkflshcllvtlachhpgdftp Amhasldkflasvstvltskyr Perfect Score: 750 Sequence: 1 MVLSADDKTNIKNCWGKIGG...HASLDKFLASVSTVLTSKYR 142 SUMMARIES % Result Query No. Score Match Length DB ID Description HART1(I54239;I68531;A26903;A93047;A90284;A90285;A02268) hemoglobin alpha-1 chain - rat &RATHBAM_1(M17083 pid: (P01946) Hemoglobin alpha-1 and alpha-2 c RN2A1GL_1(X56325 pid:g ) R.norvegicus AK003077_1(AK pid:none) Mus musculus a HAMS(A90791;I49720;A45964;I49722;I49721;B43560;A92945;) hemoglobin alpha chains - mouse &MMAGL1_1(V00714 pid: AK011076_1(AK pid:none) Mus musculus (P01942) Hemoglobin alpha chain (P20854) Hemoglobin alpha chain. &HARTNG L75940_1(L75940 pid:none) Mus musculus alpha AK010422_1(AK pid:none) Mus musculus E HASL1W(S10481)hemoglobin alpha-i chain - Wed (P01938) Hemoglobin alpha chain. &HALRN( (P09420) Hemoglobin alpha chain. &A (P01930) Hemoglobin alpha chain. &HAMQB( (P18969) Hemoglobin alpha chain. &HAFQL( (P01974) Hemoglobin alpha chain. &HACMA( HAMN2F(S11533)hemoglobin alpha-ii chain - do (P01945) Hemoglobin alpha chain. &HAHY(A (P15163) Hemoglobin alpha-i and alpha-ii (P01928) Hemoglobin alpha chain. &HAMQA( 22

23 Results from SWsrch with hemoglobin cont. RESULT 9 >L75940_1(L75940 pid:none) Mus musculus alpha-globin mrna, complete cds. &MUSALGL_1(L75940 pid:g ) Query Match 83.1%; Score 623; DB 1; Length 142; Best Local Similarity 83.8%; Pred. No. 1.80e-63; Matches 119; Conservative 7; Mismatches 16; Indels 0; Gaps 0; Inserts 0; InsGaps 0; Deletes 0; DelGaps 0; Db 1 mvlsgedksnikaawgkigghgaeyvaealermfasfpttktyfphfdvshgsaqvkghg Qy 1 mvlsaddktnikncwgkigghggeygeealqrmfaafpttktyfshidvspgsaqvkahg 60 Db 61 kkvadalasaaghlddlpgalsalsdlhahklrvdpvnfkllshcllvtlashhpadftp Qy 61 kkvadalakaadhvedlpgalstlsdlhahklrvdpvnfkflshcllvtlachhpgdftp 120 Db 121 avhasldkflasvstvltskyr Qy 121 amhasldkflasvstvltskyr

24 BLAST (Basic Alignment Search Tool) Indexes Database Calculated neighborhood of each word in query using scoring matrix and probability threshold Look up all words and neighbors from query database index Extends High-scoring Segment Pairs (HSPs) left and right to maximal length Finds maximal segment pairs (MSPs) between query and database 24

25 BLAST database search 25

26 PSI-BLAST (Position specific iterative BLAST) A profile (position specific scoring matrix, PSSM) is constructed from a multiple alignment of the highest scoring hits in a BLAST search The PSSM is generated by calculating position-specific scores for each position in the alignment. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity. 26

27 Types of BLAST BLASTP search a Protein Sequence against a Protein Database. BLASTN search a Nucleotide Sequence against a Nucleotide Database. TBLASTN search a Protein Sequence against a Nucleotide Database, by translating each database Nucleotide sequence in all 6 reading frames. BLASTX Search a Nucleotide Sequence against a Protein Database, by first translating the query Nucleotide sequence in all 6 reading frames PSI-BLAST Profile generated from identified homologs, used, and iteratively updated. (Especially good for identifying remote homologies) PHI-BLAST Enforces the presence of a motif in addition to the usual PSI-BLAST criteria for matching 27

28 Assessing Evidence for Homology Is the score higher than expected from 2 random sequences (non-homologs) BLAST scores are not independent from the length of the sequences being aligned Fit extreme value distribution to randomly shuffled sequences BLAST returns the maximum score maximum of a larger number of i.i.d. random variables tends to an extreme distribution Expected number of HSPs for 2 sequences of length m and n λs E = Kmne Similar Approach for Smith-Waterman 28

29 BLAST Availability h.gov/education/bla STinfo/information3.h tml BLAST.html 29

30 Results from BLAST with hemoglobin Distribution of 100 Blast Hits on the Query Sequence Mouse-over to show defline and scores. Click to show alignments 30

31 Results from BLAST cont. Sequences producing significant alignments: (Score bits) E-Value gi ref NP_ hemoglobin, alpha 1 [Rattus nor e-55 gi ref XP_ similar to hemoglobin alpha ch e-54 gi dbj BAB unnamed protein product [Mus mu e-46 gi dbj BAB unnamed protein product [Mus mu e-46 gi dbj BAC unnamed protein product [Mus mu e-46 gi dbj BAB unnamed protein product [Mus mu e-45 gi sp P11755 HBA1_TADBR Hemoglobin alpha-1 chain >gi e-45 gi ref NP_ hemoglobin alpha, adult chain e-45 gi gb AAB alpha-globin [Mus musculus] 182 2e-45 gi sp P14387 HBA_ANTPA Hemoglobin alpha chain >gi e-45 gi pir HASL1W hemoglobin alpha-i chain - Weddell seal 180 7e-45 gi sp P18969 HBA_AILFU Hemoglobin alpha chain >gi e-44 gi pir HASHR2 hemoglobin alpha-ii chain - aoudad (t e-44 gi sp P01950 HBA_SUNMU Hemoglobin alpha chain >gi e-44 gi sp P26915 HBA_NASNA Hemoglobin alpha chain >gi e-44 gi sp Q9XSN3 HBA1_EQUBU Hemoglobin alpha-1 chain > e-44 31

32 Results from BLAST cont. >gi sp P14387 HBA_ANTPA Hemoglobin alpha chain gi pir A29702 hemoglobin alpha chain - pallid bat Length = 141 Score = 180 bits (457), Expect = 6e-45 Identities = 96/141 (68%), Positives = 104/141 (73%) Query: 2 VLSADDKTNIKNCWXXXXXXXXXXXXXALQRMFAAFPTTKTYFSHIDVSPGSAQVKAHGX 61 VLS DKTN+K W AL+RMF +FPTTKTYF H D+SGSAQVK HG Sbjct: 1 VLSPADKTNVKAAWDKVGGHAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK 60 Query: 62 XXXXXXXXXXXHVEDLPGALSTLSDLHAHKLRVDPVNFKFLSHCLLVTLACHHPGDFTPA 121 H++DLPGALS LSDLHA+KLRVDPVNFK LSHCLLVTLACHHPGDFTPA Sbjct: 61 KVGDALGNAVAHMDDLPGALSALSDLHAYKLRVDPVNFKLLSHCLLVTLACHHPGDFTPA 120 Query: 122 MHASLDKFLASVSTVLTSKYR 142 +HASLDKFLASVSTVL SKYR Sbjct: 121 VHASLDKFLASVSTVLVSKYR

33 BALSA (Bayesian Algorithm for Local Sequence Alignment) Smith-Waterman recursion with sums Formulate sequence alignment as a Bayesian inference problem Everything is a random variable Allows multiple scoring matrices Make inferences on the parameters 33

34 34 BALSA Methodology BALSA Methodology BALSA Methodology Joint = likelihood*priors Priors a priori Algorithm ), ( ) ( ),, ( ),,,, ( (2) (1) (2) (1) Θ Λ Λ Θ = Λ Θ P A P A R R P A R R P Θ Λ = Θ Λ, 1 ), ( N P = = Λ A A k A l e A k o A k A l e A k o e o g g g g g g A P A P ) ( ) ( ) ( ) ( ) ( ) ( ), ( ) ( λ λ λ λ λ λ Θ = Λ = Θ Θ Λ A A k A l e A k o A A k A l e A k o A g g g g g g A R R P A P A R R P R R P ) ( ) ( ) ( ) ( ) ( ) ( (2) (1) (2) (1) (2) (1) ),, ( ) ( ),, ( ),, ( λ λ λ λ

35 Assessing Evidence for Homology The scores are independent of the length of the sequences being aligned. Score = P( R P( R (1) (1), R (2) ) P( R H ) (2) ) Directly Calculate Probability not homolgous from score P( H R (1), R (2) ) = Score 1 P( H ) P( H )

36 BALSA Availability /balsa/balsa.ht ml 36

37 Conclusions on Pairwise Sequence Alignment Choice of Algorithm is based on need Trade off between sensitivity and speed SCOP40 Sensitivity 1% EPQ BLAST 14.8% FASTA 16.7% SSEARCH 18.4% BALSA w/1 19.2% BALSA w/4 19.8% Speed 37

38 Outline Pairwise Sequence Alignment Dynamic Programming Heuristic Bayesian Multiple Sequence Alignment Heuristic Statistics (Gibbs Sampling) Conclusions 38

39 Multiple Sequence Alignment Multiple sequence alignment is generally concerned with finding structural or functional patterns between sequences Develop relationships and phylogenies Determine consensus sequence Build gene families Model protein structure for threading and fold prediction 39

40 Motivation: Example Motif Discovery 40

41 Motif -> > Scoring Matrix 41

42 Motif Finding Programs (search against a database of motifs) GCG SEQWEB Programs STRINGSEARCH FINDPATTERNS MOTIFS PROSITE web programs PROSITE - PROSITE SCAN emotif web programs emotif emotif-search emotif-scan 3MOTIF 42

43 Multiple Sequence Alignment Challenges Computation complexity O(n^k) for k sequences n long Space requirements O(n^k) for k sequences n long Sequence clusters require weighting function Weighted alignments tend to overweight erroneous sequence Approximations must be used for real world data Linked lists used to find exact words shared between k sequences BLAST can find inexact shared words between k sequences FASTA can be used to do progressive pair-wise alignments GIBBs sampling to find best overall alignment stochastically Final alignment is often dependent on order data presented Gaps make alignments unnaturally long 43

44 Multiple Alignment Multiple Sequence Aligner (MSA) Builds linked list of words GenAlign Iteratively adds sequences ClustalW Progressively adds sequences using clusters (a dendogram) Gibbs Sampling Generates a random alignment of size k and iteratively samples and updates the alignment until convergence 44

45 ClustalW Step 1 Generate all pairwise alignments Generate dendogram from alignment scores 45

46 ClustalW Step 2 Align most similar pair Align next most similar pair Combine 2 alignments 46

47 ClustalW General Approach 47

48 clustalw/ r.fr/seqanal/interface s/clustalw.html et.org/software/clust alw.html ClustalW Availability 48

49 Gibbs Sampling Traditional Gibbs sampling 1. Sample an alignment given parameters P( A Θ, R) 2. Sample parameters given an alignment P( Θ A, R) 49

50 Phylogenetic Footprinting Find DNA functional elements and signals in the non-coding region surrounding a gene 50

51 Transcription Regulation Gene Transcription and Regulation Transcription initiated by RNA polymerase binding Enhancers and repressors RNA polymerase Promoter region Starting codon 5 3 AUG Binding of Transcription factors inhibits or enhances expression 51

52 Example: Corepressor Transcription in process Transcription inhibited 52

53 Motif Alignment Model a 1 a 2 Motif width = w a k length n k The missing data: Alignment variable: A={a 1, a 2,, a k } Alignment: starting positions of binding sites in each sequence Apriori all positions equally likely Final alignment dependent on DNA sequence 53

54 Gibbs Sampler Algorithm Initialized by choosing random starting (0) (0) (0) positions a1, a2,..., ak Iterate the following steps many times: Randomly or systematically choose a sequence, say, sequence k, to exclude. Carry out the predictive-updating step to update a k Stop when changes are infrequent, or some criterion met. 54

55 Gibbs Sampler Availability adsworth.org/gibb s/gibbs.html matics.ubc.ca/reso urces/tools/index.p hp?name=gibbs 55

56 Conclusions Sequence Analysis is the most commonly performed task in bioinformatics The choice of algorithm is dependent upon needs Pairwise Homology detection Multiple Motif detection Building Gene Families Phylogenetic footprinting The future is in whole genome comparisons 56

57 Other Sources of Information Extensive tutorials Bioinformatics Books Biological Sequence Analyis (Durbin et al.) Bioinformatics: The Machine Learning Approach (Baldi & Brunak) Computational Molecular Biology (Pevzner) Journal Articles Altschul et al. Journal of Molecular Biology, ; p (Original BLAST paper) Altschul et al. Nucleic Acids Research, ; p (PSI- BLAST Paper) McCue et al. Nucleic Acids Research, ; p (Phylogenetic Footprinting) 57