Sequence Alignment (Inexact Matching) Overview

CSI/BINF 5330 Sequence Alignment (Inexact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment 1

Sequence Similarity Homologs similar sequence + common ancestor (divergent evolution) Orthlogs: homologs in different species by species divergence Paralogs: homologs in the same species by gene duplication Analogs similar sequence + no common ancestor (convergent evolution) How to measure sequence similarity? Sequence Alignment Definition Arranging two or more DNA or protein sequences by inserting gaps to maximize their sequence similarity e.g., DNA1 = AGA, DNA2 = CGAC Applications Given gene sequences, infer their evolutionary history (phylogenetics) Given gene sequences of known functions, infer the functions of newly sequenced genes Given genes of known functions in one organism, infer the functions of genes in another organism 2

Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment Longest Common Subsequences (1) Subsequence of x An ordered sequence of letters from x Not necessarily consecutive e.g., x= AGCA, AGCA?, CG?, AC?, GA? Common Subsequence of x and y e.g., x= ACGA and y= GCAA, CA?, GA?, AA? Longest Common Subsequence (LCS) of x and y? 3

Longest Common Subsequences (2) Definition of LCS Given two sequences, v = v 1 v 2 v m and w = w 1 w 2 w n, LCS of v and w is a sequence of positions in v: 1 i 1 < i 2 < < i t m and a sequence of positions in w: 1 j 1 < j 2 < < j t n such that i t -th letter of v equals to j t -letter of w, and t is maximal LCS in 2-Row Representation (1) Example x= ACGAG (m=8), y= GCAAC (n=7) x y 1 2 3 4 5 6 7 8 A C G A G G C A A C 1 2 3 4 5 6 7 Position in x: Position in y: LCS: CA 2 < 3 < 4 < 6 1 < 3 < 5 < 6 4

LCS in 2-Row Representation (2) Example - Continued x= ACGAG (m=8), y= GCAAC (n=7) x y 1 2 3 4 5 6 7 8 A C G A G G C A A C 1 2 3 4 5 6 7 x y 1 2 3 4 5 6 7 8 A C G A G G C A A C 1 2 3 4 5 6 7 LCS in 2-D Grid Representation Edit Graph 2-D grid structure having diagonals on the position of the same letter Example x= AGA (m=7) y= ACGAC (n=7) Strongest path in edit graph source A A C G A C (0,0) (1,1) (2,2) G (2,3) (3,4) (4,5) (5,5) (6,6) (7,6) (7,7) A sink 5

Formulation of LCS Problem Goal Finding the longest common subsequence (LCS) of two sequences (length-m, length-n) Finding the strongest path from a source to a sink in a weighted edit graph he path strength is measured by summing the weights on the path Input A weighted edit graph G with source (0,0) and sink (m,n) Output A strongest path in G from the source to the sink Solving by Exhaustive Search Algorithm (1) Enumerate all possible paths from the source to the sink (2) Compute the path strength for all possible paths (3) Find the strongest path Problems? 6

Solving by (Iterative) Greedy Algorithm Algorithm (1) Start from the source (2) Select the edge having the highest weight (i.e., if there is a diagonal edge, select it. Otherwise, select one of the other edges.) (3) Repeat (2) until it reaches the sink Problems? Runtime? Solving by Dynamic Programming Formula Algorithm 7

Example of LCS Example x= AGA (m=7), y= ACGAC (n=7) source A C G A C 0 0 0 0 0 0 0 0 A 0 1 1 1 1 1 1 1 0 1 2 2 2 2 2 2 G 0 1 2 2 3 3 3 3 0 0 A 0 0 1 2 2 3 4 4 4 1 2 2 3 4 4 4 1 2 2 3 4 5 5 1 2 2 3 4 5 5 sink Finding LCS Storing Directions Backtracking 8

Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment from LCS to Sequence Alignment LCS problem Allows only insertions and deletions - no substitutions Scores 1 for a match and 0 for an insertion or deletion Sequence Alignment Problem Allows gaps (insertions and deletions) and mismatches (substitutions) Uses any scoring schemes 9

Formulation of Global Alignment Problem Goal Finding the best alignment of two sequences under a given scoring schema Input wo sequences x (length-m) and y (length-n), and a scoring schema Output An alignment of x and y with the maximal score Basic Scoring Scheme Simplest Scoring Scheme Match premium: + Mismatch penalty: -μ Insertion and deletion (gap) penalty: -σ Score = #matches μ #mismatches σ (#insertions + #deletions) Formula 10

Example of Sequence Alignment Example x= AGA (m=7), y= ACGAC (n=7) A C G A C A G A + -μ -μ -μ -μ + -μ -μ + -μ -μ + -μ -μ -μ -μ -μ + -μ -μ -μ -μ + -μ -μ + -μ -μ -μ + -μ -μ + -μ -μ + -μ -μ -μ -μ + -μ -μ + -μ -μ + -μ -μ All horizontal and vertical edges: -σ Solving by Dynamic Programming Algorithm 11

Advanced Scoring Scheme Percent Identity Percentage of identical matches Percent Similarity Percentage of nucleotide pairs with the same type Percentage of similar amino acid pairs in biochemical structure Advanced Scoring Scheme Varying scores in similarity Varying, strong penalties for mismatches Relative likelihood of evolutionary relationship Probability of mutations Define scoring matrix for DNA or protein sequences Scoring Matrices (1) Scoring Matrix δ Also called substitution matrix 4 4 array representation for DNA sequences or (4) (4) array 20 20 array representation for protein sequences or (20) (20) array Entry of δ (i,j) has the score between i and j, i.e., the rate at which i is substituted with j over time Formula 12

Scoring Matrices (2) PAM (Point Accepted Mutations) For protein sequence alignment Amino acid substitution frequency in mutations Logarithmic matrix of mutation probabilities PAM120: Results from 120 mutations per 100 residues PAM120 vs. PAM240 BLOSUM (Block Substitution Matrix) For protein sequence alignment Applied for local sequence alignments Substitution frequencies between clustered groups BLOSUM-62: Results with a threshold (cut-off) of 62% identity BLOSUM-62 vs. BLOSUM-50 Scoring Matrices (3) Substitution Matrix Examples BLOSUM-62 PAM120 13

Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment Local Sequence Alignment (1) Example x = CAGGCGAAGA y = AGGCAGCAGGG Global Alignment C A G G C G A A G A A G G C A G C A G G G Local Alignment C A G G C G A A G A A G G C A G C A G G G 14

Local Sequence Alignment (2) Example - Continued C A G G C G A A G A A G G C A G C A G G G scoring in this range for local alignment Global vs. Local Alignment Global Alignment Problem Finds the path having the largest weight between vertices (0,0) and (m,n) in the edit graph Local Alignment Problem Finds the path having the largest weight between two arbitrary vertices, (i,j) and (i, j ), in the edit graph Score Comparison he score of local alignment must be greater than (or equal to) the score of global alignment 15

Formulation of Local Alignment Problem Goal Finding the best local alignment between two sequences Input wo sequences x and y, and a scoring matrix δ Output An alignment of substrings of x and y with the maximal score among all possible substrings of them Implementation of Local Alignment Strategy C A G G C G A A G A A G G C A G C A G G A Compute mini global alignment for local alignment 16

Solving by Exhaustive Search Process (1) Enumeration of all possible pairs of substrings (2) Global alignment for each pair of substrings (1) Enumeration of all substrings starting from (i,j) (2) Global alignment from each start position Runtime Suppose two sequences have the same length n Global alignment : otal runtime : Solution Free ride! Solving by Dynamic Programming Free Ride Assigns 0 weights from (0,0) to any other nodes (i,j) Yeah, a free ride! (0,0) Formula 0 Runtime? 17

Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment Scoring Insertions/Deletions Naïve Approach -σ for 1 insertion/deletion, -2σ for 2 consecutive insertions/deletions -3σ for 3 consecutive insertions/deletions, etc. too severe penalty for a series of 100 consecutive insertions/deletions Example x= AAGC, y= AAGC single event x= AAGGC, y= AGGC 18

Scoring Gaps of Insertions/Deletions Gap Contiguous sequence of spaces in one of the rows Contiguous sequence of insertions or deletions in 2-row representation Linear Gap Penalty Score for a gap of length x : -σ x ( Naïve approach ) Constant Gap Penalty Score for a gap of length x : -ρ Affine Gap Penalty Score for a gap of length x : - (ρ + σ x) -ρ : gap opening penalty / -σ : gap extension penalty ( ρ σ ) Solving Constant/Affine Gap Penalty Edit Graph Update Add long horizontal or vertical edges with ρ weights to the edit graph Runtime? 19

Improved Solution for Constant/Affine Gap Penalty 3-Layer Grid Structure Middle layer ( Main layer ) for diagonal edges Extends matches and mismatches Upper layer for horizontal edges Creates/extends gaps in a sequence x Lower layer for vertical edges Creates/extends gaps in a sequence y Gap opening penalty (-ρ) for jumping from middle layer to upper/lower layer Gap extension penalty (-σ) for extending on upper/lower layer Example of 3-Layer Grid upper layer ( gaps in x ) main layer (matches/mismatches) lower layer ( gaps in y ) 20

Solving by Dynamic Programming Formula Dynamic programming with recursion match / mismatch continuing gap in y starting gap in y continuing gap in x starting gap in x Runtime? Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment 21

Searching Databases Homolog Search Searching similar sequences to a query sequence in a database (e.g. GenBank) Computational issues Dynamic programming are rigorous But inefficient in searching a huge database Need heuristic approaches Homolog Search ools FASA BLAS FASA (1) FASA DNA / protein sequence alignment tool (local alignment) Applies dynamic programming in scoring selected sequences Heuristic method in candidate sequence search Process Finding all pairwise k-tuples (at least k contiguous matching residues) Scoring the k-tuples by a substitution matrix Selecting sequences with high scores for alignment 22

FASA (2) FASA package query sequence database fasta protein protein fasta DNA DNA fastx / fasty DNA (all reading frames) protein tfastx / tfasty protein DNA (all reading frames) fastf mixed peptide protein tfastf mixed peptide DNA ssearch : applies dynamic programming BLAS (1) BLAS (Basic Local Alignment Search ool) DNA / protein sequence alignment tool Finds local alignments Heuristic method in sequence search Runs faster than FASA Process Makes a list of words (word pairs) from the query sequence Chooses high-scoring words Searches database for matches (hits) with the high-scoring words Extends the matches in both directions to find high-scoring segments (HSP) 23

BLAS (2) BLAS programs query sequence database blastp protein protein blastn DNA DNA blastx DNA (all reading frames) protein tblastn protein DNA (all reading frames) tblastx DNA (all reading frames) DNA (all reading frames) Search Results BLAS Search Results FASA Search Results 24

E-value E-value Average number of alignments with a score of at least S that would be expected by chance alone in searching a database of n sequences High alignment score S Low E-value Low alignment score S High E-value Ranges of E-value: 0 ~ n Factors Alignment score he number of sequences in the database Sequence length Default E-value threshold: 0.01 ~ 0.001 Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment 25

Pairwise Alignment vs. Multiple Alignment Pairwise Alignment Alignment of two sequences Sometimes two sequences are functionally similar or have common ancestor although they have weak sequence similarity Multiple Alignment Alignment of more than two sequences Finds invisible similarity in pairwise alignment Alignment of 3 Sequences Alignment of 2 Sequences Described in a 2-row representation Best alignment is found in a 2-D matrix by dynamic programming Alignment of 3 Sequences Described in a 3-row representation x= AGG, y= ACGA, z= ACG x : A - G G - y : A - C G - A z : A C - G - Best alignment is found in a 3-D matrix by dynamic programming 26

Alignment in 3-D Grid 3-D Edit Graph 3-D grid structure (cube) with diagonals in each cell Example source x : y : z : 0 1 2 2 3 4 5 5 A - G G - 0 1 1 2 3 4 4 5 A - C G - A 0 1 2 3 3 4 5 5 A C - G - Path in 3-D grid : sink (0,0,0) (1,1,1) (2,1,2) (2,2,3) (3,3,3) (4,4,4) (5,4,5) (5,5,5) 3-D Grid Unit 2-D Grid Unit Maximum 3 edges in each unit of 2-D grid 3-D Grid Cell Maximum 7 edges in each unit of 3-D grid ( i-1, j-1, k-1 ) ( i-1, j-1, k ) ( i-1, j, k-1 ) ( i-1, j, k ) ( i, j-1, k-1 ) ( i, j, k-1 ) ( i, j-1, k ) ( i, j, k ) 27

Solving by Dynamic Programming Formula δ (x, y, z) is the entry of 3-D scoring matrix Runtime? from 3-D Alignment to Multiple Alignment Alignment of k Sequences Able to be solved by dynamic programming in k-d grid Runtime? Conclusion Dynamic programming for pairwise alignment can be extended to multiple alignment However, computationally impractical How can we solve this problem? 28

Heuristics of Multiple Alignment Intuition Implementing pairwise alignment (2-D alignment) many times is better than implementing k-d multiple alignment once Heuristic Process (1) Implementing all possible pairwise alignments (2) Combining the most similar pair iteratively Multiple Alignment to Pairwise Alignment Multiple Alignment Pairwise Alignments Multiple alignment induces pairwise alignments x: A C G C G G C y: A C G C G A G x: A C G C G G C y: A C G C G A G z: G C C G C G A G x: A C G C G G C z: G C C G C G A G y: A C G C G A G z: G C C G C G A G 29

Multiple Alignment Projection Multiple Alignment Pairwise Alignments x: A C G C G G C y: A C G C G A G z: G C C G C G A G x: A C G C G G C y: A C G C G A G x: A C G C G G C z: G C C G C G A G y: A C G C G A G z: G C C G C G A G Projection Pairwise Alignment to Multiple Alignment (1) Pairwise Alignments Multiple Alignment x: A C G C G G C y: A C G C G A G x: A C G C G G C z: G C C G C G A G x: A C G C G G C y: A C G C G A G z: G C C G C G A G y: A C G C G A G z: G C C G C G A G Can we construct a multiple alignment that induces pairwise alignments? 30

Pairwise Alignment to Multiple Alignment (2) Examples x= AAA, y= GGG, z= AAAGGG x= AAA, y= GGG, z= GGGAAA Conclusion Difficult to infer optimal multiple alignment from all possible optimal pairwise alignments Greedy Approach (1) Process (1) Implement pairwise alignment for all possible pairs of sequences (2) Choose the most similar pair of sequences (3) Merge them into a new sequence (4) Choose the most similar sequence to the new sequence (5) Repeat (3) and (4) until choosing all sequences Example Step 1 s1: GACA s2: GCGA s3: GAA s4: GCAGC s2 GCGA s4 GCAGC s1 GA-CA s2 G-CGA s1 GA-CA s3 GAA- s1 GACA-- s4 G -CAGC s2 G-CGA s3 GAA- s3 GA-A s4 G-CAGC 31

Greedy Approach (2) Example - continued Step 2 s2 GCGA s4 GCAGC Step 3 s 1 s 3 s 2,4 GACA GAA GCt/aGa/c s 2,4 GCt/aGa/c ( called profile or consensus sequence ) Features k-way alignment (alignment of k sequences) Runtime? Greedy algorithm Not optimal multiple alignment Progressive Alignment (1) Features A variation of greedy algorithm (more intelligent strategy on each step) Also called a hierarchical method Uses profiles to compare sequences Gaps are permanent ( once a gap, always a gap ) Works well for close sequences Process Stage 1 Computes similarity of all possible pairs of sequences ( similarity = sequence identity = # exact match / sequence length ) Makes a similarity matrix v 1 v 2 v 3 v 4 v 1 v 2 v 3 v 4 -.17.87.28.59.33.62 32

Progressive Alignment (2) Process - continued Stage 2 Creates a guide tree using the similarity matrix Stage 3 Applies a series of pairwise alignment following the guide tree Application of Progressive Alignment ClustalW Popular multiple alignment tool Adopts the progressive multiple alignment FOS_RA FOS_MOUSE FOS_CHICK FOSB_MOUSE FOSB_HUMAN PEEMSVS-LDLGGLPEAPESEEAFLPLLNDPEPK-PSLEPVKNISNMELKAEPFD PEEMSVAS-LDLGGLPEASPESEEAFLPLLNDPEPK-PSLEPVKSISNVELKAEPFD SEELAAAALDLG----APSPAAAEEAFALPLMEAPPAVPPKEPSG--SGLELKAEPFD PGPGPLAEVRDLPG-----SSAKEDGFGWLLPPPPPPP-----------------LPFQ PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ.. : **. :.. *:.* *. * **: Dots and stars show how well-conserved a column is 33

Scoring Schemes Number of Matches Multiple longest common subsequence score A column is a match if all the letters in the column are the same AAA AAG AA AC Only good for very similar sequences Entropy-Based Scoring Sum-of-Pair Scoring Entropy-Based Scoring (1) Entropy in Information heory A measure of the uncertainty associated with a random variable Entropy-Based Scoring in Multiple Alignment (1) Define frequencies for the occurrence of each letter on each column (2) Compute entropy of each column (3) Sum all entropies over all columns 34

Entropy-Based Scoring (2) Example AAA AAG AA AC Frequency 1 st column: p(a) = 1, p() = p(g) = p(c) = 0 2 nd column: p(a) = 0.75, p() = 0.25, p(g) = p(c) = 0 3 rd column: p(a) = 0.25, p() = 0.25, p(c) = 0.25, p(g) = 0.25 Entropy A A A A A 3 3 1 1 G 1 1 H 0 H log log 0. 244 A H A 4 4 4 4 log 4 0. 602 4 4 A C Entropy-based score in multiple alignment: 0 + 0.244 + 0.602 Sum-of-Pair Scoring Sum-of-Pairs Scoring in Multiple Alignment Consider pairwise alignment of sequences, a i and a j, imposed by a multiple alignment of k sequences Denote the score of the pairwise alignment as S*(a i, a j ) Sum up the pairwise scores for a multiple alignment: Example Aligning 4 sequences, a 1, a 2, a 3, and a 4, by 35

Advanced Multiple Alignment Background Progressive sequence alignment has loss of information not optimal Multi-domain proteins evolve not only through point mutations but also through domain duplications and domain re-combinations Rearrangement might be meaningful for aligning multi-domain protein sequences Examples Partial Order Multiple Sequence Alignment (PO-MSA) A-Bruijn Alignment (ABA) Alignment as a Graph Conventional pairwise alignment A path of a sequence Combining two paths Combined directed acyclic graph (partial order) 36

PO-MSA Algorithm Partial Order Multiple Sequence Alignment (PO-MSA) Considers a set of sequences S as a directed acyclic graph G such that each sequence in S is a path in G ( Each sequence is mapped into the graph.) Focuses on ordering rather than positions Algorithm (1) Construct a guide tree (2) Apply progressive alignment following the guide tree (3) Align two directed acyclic graphs (Partial Order Alignment) using dynamic programming algorithm at each step Partial Order Alignment (1) Schematic View Conventional pairwise alignment PO-MSA 37

Partial Order Alignment (2) Advantages Scalability Linear increase of time complexity as the increment of predecessors Accuracy Homologous recombination for multi-domain protein sequences ( he graph represents domain structure.) Questions? Lecture Slides are found on the Course Website, web.ecs.baylor.edu/faculty/cho/5330 38