Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013
Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics
What is a Sequence Alignment? Quite simply, the comparison of two or more DNA or protein sequences to each other. The purpose of alignment is to highlight similarity between sequences. Alignment is the procedure of writing two (or more) sequences in a way that a maximum of identical or similar characters are placed in the same column by -
Word Alignment Species 1: SOMEONE Species 2: AWESOME Species 1: - - - SOMEONE Species 2: AWESOME - - -
Less trivial Species 1: ACGTTAGA Species 2: CGTTGAA Species 1: - - - - - - - ACGTTAGA Species 2: CGTTGAA - - - - - - - - Species 1: ACGTTAGA Species 2: - CGTT- GAA
Less trivial Species 1: - - - - - - - ACGTTAGA Species 2: CGTTGAA - - - - - - - - score: -15 (gaps = -1, match = 1) Species 1: ACGTTAGA Species 2: - CGTT - GAA score: 3
FASTA Format - Input Standard input format for alignment programs >Name1 ASEQUENCE1 >Name2 comments SEQU CE2 Strictly speaking, should not contain gaps
FASTA Format - Output Increasingly, multiple alignment returned in FASTAlike format >Name1 ASEQUENCE1 >Name2 comments -SEQU--CE2 etc... - Order of sequences may be different in output to input.
Relatedness of residues in same column Making these alignments is EASY... As we know where and which evolutionary events occurred - and must infer it
Quiz Which alignment (X, Y or Z) shows only residues related by substitution events in the same column?
Types of alignments methods We cannot enummerate all possible alignments. Approaches are: Dot matrix Dynamic Programming Word-based or k-tupel methods (database searches)
Dot Matrix Given two
In a dot matrix we can identify: Existing alignable parts of sequences Possible indels Duplicated sequences and repeats Self-complementarity Gene-order differences among genomes
Dot plots
a) A continuous main diagonal shows perfect similarity for symbols with the same indices. b) Parallels to the main diagonal indicate repeated regions in the same reading direction on different parts of the sequences. In this case a region D is found twice in the sequence (D1, D2, so called c) Lines perpendicular to the main diagonal indicate palindromic areas. In this case the sequence is completely palindromic in the displayed area. d) Partially palindromic sequence (For DNA sequences this refers to a perfect match of the normal strand with its reverse complement, which is frequently found for many transposable elements. e) Bold blocks on the main diagonal indicate repetition of the same symbol in both sequences, e.g. (G)50, so called microsatellite repeats f) Parallel lines indicate tandem repeats of a larger motif in both sequences, e.g. (AGCTCTGAC)20, so called minisatellite patterns. The distance between the diagonals equals the distance of the motif. g) When the diagonal is a discontinuous line this indicates that the sequences T1 and T2 share a common source. In literal analyses we may have to deal with plagiarism or in DNA analyses sequences may be homologous because of a common ancestor. The number of interruptions increases with modifications on the text or the time of independent evolution and mutation rate. h) indel sequences this can be often observed for many different types of domains, which got lost or substituted during evolution.
Aligning a pair of sequences gap = -15 match = +10, mismatch = 0 Aim: get from one corner to other Moves have a cost Choose cheapest way Fill in table Trace route backwards to find alignment
Aligning a pair of sequences (Dynamic Programming) Aim: get from one corner to other Moves have a cost Choose cheapest way Fill in table Trace route backwards to find alignment A G G G A - - G C Aim: get from one corner to other Moves have a cost Choose cheapest way Fill in table Trace route backwards to find alignment A G G G T T T G C
Needlemann-Wunsch Algorithm Initialize NxM matrix with the sequences A and B of length N and M Starting at the top left corner set the intermediate scoring value =
Substitution matrices Some amino acids are more similar than others Adjust cost according to some similarity matrix E.g. Blosum62 Leu -> Leu: 4 Leu -> Met: 2 Leu -> Pro: -3... etc.
Gap panalties Gaps tend to occur together one penalty unrealistic a gap of length three should not cost three times as much Use affine gap cost Make extending an already existing gap cheaper Gap opening (G) / gap extension (E) Total cost for gap length x: G + x E
Global vs Local Alignment Global: Find the best overall alignment between sequences. Local: Find short regions of highly conserved sequence.
Global vs Local Species 1: SOMEONE Species 2: AWESOME Species 1: - - - SOMEONE Species 2: AWESOME - - - Species 1: SOME Species 2: SOME
Smith Watermann Algorithm Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximizes the similarity For every cell the algorithm calculates ALL possible paths leading to it. These paths can be of any length and contain insertions and deletions
Calculating significance We have calculated the optimal alignment the alignment with the best score related or not call this the maximum segment pair (MSP) How many MSPs do we expect with at least the same score by chance?
Calculating significance We make use of the extreme value distribution (EVD) to calculate the number of alignments between random sequences that we expect given our score or better This is known as the e-value E(S) = Kmn K and = scaling parameters calculated based on the search space (K) and scoring scheme ( ) m, n = size of the search space The probability of finding at least one match with our score(the p value) 1-e -E(S) As both the e value and the p value decrease, the biological significance increases
BLAST Basic Local Alignment Search Tool: Used to find local sequence alignments between protein and nucleotide sequences (Altschul et al., 1990, cited over 43,000 times) Heuristic so it is an approximate best match (SW is a guarantee) calculate the high scoring matches instead of the maximum scoring matches (HSP instead of MSP)
BLAST 28, we will look at 4) GTTCACATCATCCTGC GTTC TTCA TCAC CACA ACAT CATC ATCA...
BLAST on scoring matrices) you could call this the neighborhood GTTCACATCATCCTGC GTTC: CTTC,GTTC,GATC... TTCA: TTCT,TTGA,TTGT... TCAC: AGAC,CCAC,TCTG... CACA:... ACAT:... CATC:... ATCA:......
BLAST calculate E values expectation that you would get that alignment by change given the database of sequences return significant results we already talked about these e-values and p-values with Smith-Waterman significance
BLAST Types: Nucleotide vs. Nucleotide: blastn Protein vs Protein: blastp Translated Nucleotide vs Protein: blastx Protein vs Translated Nucleotide: tblastn Translated Nucleotide vs translated database: tblastx
DNA vs protein Should you use blastn or blastp? There are four potential nucleotides A,C,GT and therefore four potential states There are 22 standard amino acids and therefore 22 potential states blastp should be more sensitive because of the lower chance of a random hit than blastn because of the state space If there is the possibility of highly similar sequences, DNA works well intergenic spacers RNA genes
Things to consider nothing is 90% homologous there may be a degree of your belief in homology statistical significance depends on the size of the alignments and the database e-value increases as database gets bigger more chance for a random hit e-value decreases as alignments get longer more significant the longer the alignment
Therefore sequence similarity can suggest homology a significant alignment over the length of both sequences strongly suggests homology homologous sequences do not always produce significant alignments! regions with low complexity (but that are not cleaned out by initial steps in BLAST) can produce significant alignments with no homology
Rules There are no hard and fast rules Nucleotides it has been suggested that sequence identity of more than 70% suggests homology e-values of 10^-6 or less too bad Proteins 25% or more sequence identity e-values of 10^-3 or less nope you have to verify somehow, and if you are high throughput, there will be errors
Next We will go over some examples in lab Needleman-Wunsch BLAST