Sequence Alignment (Inexact Matching) Overview

Similar documents
BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Pairwise Sequence Alignment

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Bio-Informatics Lectures. A Short Introduction

Amino Acids and Their Properties

Protein Protein Interaction Networks

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Bioinformatics Resources at a Glance

Network Protocol Analysis using Bioinformatics Algorithms

Introduction to Bioinformatics 3. DNA editing and contig assembly

BIOINFORMATICS TUTORIAL

Bioinformatics Grid - Enabled Tools For Biologists.

How To Cluster Of Complex Systems

Dynamic Programming. Lecture Overview Introduction

Introduction to Bioinformatics AS Laboratory Assignment 6

CD-HIT User s Guide. Last updated: April 5,

A Tutorial in Genetic Sequence Classification Tools and Techniques

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Web Data Extraction: 1 o Semestre 2007/2008

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

T cell Epitope Prediction

Genetic Algorithms and Sudoku

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

Practical Graph Mining with R. 5. Link Analysis

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Phylogenetic Trees Made Easy

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

Clone Manager. Getting Started

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Genome Explorer For Comparative Genome Analysis

The PageRank Citation Ranking: Bring Order to the Web

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

2) Write in detail the issues in the design of code generator.

A permutation can also be represented by describing its cycles. What do you suppose is meant by this?

Hierarchical Bayesian Modeling of the HIV Response to Therapy

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Hidden Markov Models

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Graph Mining and Social Network Analysis

Gold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996

DATA ANALYSIS II. Matrix Algorithms

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Clustering & Visualization

D A T A M I N I N G C L A S S I F I C A T I O N

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Scheduling Shop Scheduling. Tim Nieberg

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

LabGenius. Technical design notes. The world s most advanced synthetic DNA libraries. hi@labgeni.us V1.5 NOV 15

Lecture 1: Systems of Linear Equations

Introduction to Phylogenetic Analysis

Analyzing A DNA Sequence Chromatogram

Welcome to the Plant Breeding and Genomics Webinar Series

Clustering and scheduling maintenance tasks over time

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Learning from Diversity

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

3. Interpolation. Closing the Gaps of Discretization... Beyond Polynomials

A Review And Evaluations Of Shortest Path Algorithms

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

Data Preprocessing. Week 2

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

MASCOT Search Results Interpretation

Technology, Kolkata, INDIA,

Molecular Databases and Tools

Empirically Identifying the Best Genetic Algorithm for Covering Array Generation

Gene Finding CMSC 423

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

Three Effective Top-Down Clustering Algorithms for Location Database Systems

Hierarchical Data Visualization

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Compact Representations and Approximations for Compuation in Games

Part 2: Community Detection

Effect of Using Neural Networks in GA-Based School Timetabling

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

Arbres formels et Arbre(s) de la Vie

Computational searches of biological sequences

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Analysis of Algorithms, I

UCHIME in practice Single-region sequencing Reference database mode

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

6. Cholesky factorization

Vector storage and access; algorithms in GIS. This is lecture 6

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Mining Social-Network Graphs

Lecture 1: Course overview, circuits, and formulas

Principles of Evolution - Origin of Species

A Brief Study of the Nurse Scheduling Problem (NSP)

Transcription:

CSI/BINF 5330 Sequence Alignment (Inexact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment 1

Sequence Similarity Homologs similar sequence + common ancestor (divergent evolution) Orthlogs: homologs in different species by species divergence Paralogs: homologs in the same species by gene duplication Analogs similar sequence + no common ancestor (convergent evolution) How to measure sequence similarity? Sequence Alignment Definition Arranging two or more DNA or protein sequences by inserting gaps to maximize their sequence similarity e.g., DNA1 = AGA, DNA2 = CGAC Applications Given gene sequences, infer their evolutionary history (phylogenetics) Given gene sequences of known functions, infer the functions of newly sequenced genes Given genes of known functions in one organism, infer the functions of genes in another organism 2

Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment Longest Common Subsequences (1) Subsequence of x An ordered sequence of letters from x Not necessarily consecutive e.g., x= AGCA, AGCA?, CG?, AC?, GA? Common Subsequence of x and y e.g., x= ACGA and y= GCAA, CA?, GA?, AA? Longest Common Subsequence (LCS) of x and y? 3

Longest Common Subsequences (2) Definition of LCS Given two sequences, v = v 1 v 2 v m and w = w 1 w 2 w n, LCS of v and w is a sequence of positions in v: 1 i 1 < i 2 < < i t m and a sequence of positions in w: 1 j 1 < j 2 < < j t n such that i t -th letter of v equals to j t -letter of w, and t is maximal LCS in 2-Row Representation (1) Example x= ACGAG (m=8), y= GCAAC (n=7) x y 1 2 3 4 5 6 7 8 A C G A G G C A A C 1 2 3 4 5 6 7 Position in x: Position in y: LCS: CA 2 < 3 < 4 < 6 1 < 3 < 5 < 6 4

LCS in 2-Row Representation (2) Example - Continued x= ACGAG (m=8), y= GCAAC (n=7) x y 1 2 3 4 5 6 7 8 A C G A G G C A A C 1 2 3 4 5 6 7 x y 1 2 3 4 5 6 7 8 A C G A G G C A A C 1 2 3 4 5 6 7 LCS in 2-D Grid Representation Edit Graph 2-D grid structure having diagonals on the position of the same letter Example x= AGA (m=7) y= ACGAC (n=7) Strongest path in edit graph source A A C G A C (0,0) (1,1) (2,2) G (2,3) (3,4) (4,5) (5,5) (6,6) (7,6) (7,7) A sink 5

Formulation of LCS Problem Goal Finding the longest common subsequence (LCS) of two sequences (length-m, length-n) Finding the strongest path from a source to a sink in a weighted edit graph he path strength is measured by summing the weights on the path Input A weighted edit graph G with source (0,0) and sink (m,n) Output A strongest path in G from the source to the sink Solving by Exhaustive Search Algorithm (1) Enumerate all possible paths from the source to the sink (2) Compute the path strength for all possible paths (3) Find the strongest path Problems? 6

Solving by (Iterative) Greedy Algorithm Algorithm (1) Start from the source (2) Select the edge having the highest weight (i.e., if there is a diagonal edge, select it. Otherwise, select one of the other edges.) (3) Repeat (2) until it reaches the sink Problems? Runtime? Solving by Dynamic Programming Formula Algorithm 7

Example of LCS Example x= AGA (m=7), y= ACGAC (n=7) source A C G A C 0 0 0 0 0 0 0 0 A 0 1 1 1 1 1 1 1 0 1 2 2 2 2 2 2 G 0 1 2 2 3 3 3 3 0 0 A 0 0 1 2 2 3 4 4 4 1 2 2 3 4 4 4 1 2 2 3 4 5 5 1 2 2 3 4 5 5 sink Finding LCS Storing Directions Backtracking 8

Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment from LCS to Sequence Alignment LCS problem Allows only insertions and deletions - no substitutions Scores 1 for a match and 0 for an insertion or deletion Sequence Alignment Problem Allows gaps (insertions and deletions) and mismatches (substitutions) Uses any scoring schemes 9

Formulation of Global Alignment Problem Goal Finding the best alignment of two sequences under a given scoring schema Input wo sequences x (length-m) and y (length-n), and a scoring schema Output An alignment of x and y with the maximal score Basic Scoring Scheme Simplest Scoring Scheme Match premium: + Mismatch penalty: -μ Insertion and deletion (gap) penalty: -σ Score = #matches μ #mismatches σ (#insertions + #deletions) Formula 10

Example of Sequence Alignment Example x= AGA (m=7), y= ACGAC (n=7) A C G A C A G A + -μ -μ -μ -μ + -μ -μ + -μ -μ + -μ -μ -μ -μ -μ + -μ -μ -μ -μ + -μ -μ + -μ -μ -μ + -μ -μ + -μ -μ + -μ -μ -μ -μ + -μ -μ + -μ -μ + -μ -μ All horizontal and vertical edges: -σ Solving by Dynamic Programming Algorithm 11

Advanced Scoring Scheme Percent Identity Percentage of identical matches Percent Similarity Percentage of nucleotide pairs with the same type Percentage of similar amino acid pairs in biochemical structure Advanced Scoring Scheme Varying scores in similarity Varying, strong penalties for mismatches Relative likelihood of evolutionary relationship Probability of mutations Define scoring matrix for DNA or protein sequences Scoring Matrices (1) Scoring Matrix δ Also called substitution matrix 4 4 array representation for DNA sequences or (4) (4) array 20 20 array representation for protein sequences or (20) (20) array Entry of δ (i,j) has the score between i and j, i.e., the rate at which i is substituted with j over time Formula 12

Scoring Matrices (2) PAM (Point Accepted Mutations) For protein sequence alignment Amino acid substitution frequency in mutations Logarithmic matrix of mutation probabilities PAM120: Results from 120 mutations per 100 residues PAM120 vs. PAM240 BLOSUM (Block Substitution Matrix) For protein sequence alignment Applied for local sequence alignments Substitution frequencies between clustered groups BLOSUM-62: Results with a threshold (cut-off) of 62% identity BLOSUM-62 vs. BLOSUM-50 Scoring Matrices (3) Substitution Matrix Examples BLOSUM-62 PAM120 13

Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment Local Sequence Alignment (1) Example x = CAGGCGAAGA y = AGGCAGCAGGG Global Alignment C A G G C G A A G A A G G C A G C A G G G Local Alignment C A G G C G A A G A A G G C A G C A G G G 14

Local Sequence Alignment (2) Example - Continued C A G G C G A A G A A G G C A G C A G G G scoring in this range for local alignment Global vs. Local Alignment Global Alignment Problem Finds the path having the largest weight between vertices (0,0) and (m,n) in the edit graph Local Alignment Problem Finds the path having the largest weight between two arbitrary vertices, (i,j) and (i, j ), in the edit graph Score Comparison he score of local alignment must be greater than (or equal to) the score of global alignment 15

Formulation of Local Alignment Problem Goal Finding the best local alignment between two sequences Input wo sequences x and y, and a scoring matrix δ Output An alignment of substrings of x and y with the maximal score among all possible substrings of them Implementation of Local Alignment Strategy C A G G C G A A G A A G G C A G C A G G A Compute mini global alignment for local alignment 16

Solving by Exhaustive Search Process (1) Enumeration of all possible pairs of substrings (2) Global alignment for each pair of substrings (1) Enumeration of all substrings starting from (i,j) (2) Global alignment from each start position Runtime Suppose two sequences have the same length n Global alignment : otal runtime : Solution Free ride! Solving by Dynamic Programming Free Ride Assigns 0 weights from (0,0) to any other nodes (i,j) Yeah, a free ride! (0,0) Formula 0 Runtime? 17

Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment Scoring Insertions/Deletions Naïve Approach -σ for 1 insertion/deletion, -2σ for 2 consecutive insertions/deletions -3σ for 3 consecutive insertions/deletions, etc. too severe penalty for a series of 100 consecutive insertions/deletions Example x= AAGC, y= AAGC single event x= AAGGC, y= AGGC 18

Scoring Gaps of Insertions/Deletions Gap Contiguous sequence of spaces in one of the rows Contiguous sequence of insertions or deletions in 2-row representation Linear Gap Penalty Score for a gap of length x : -σ x ( Naïve approach ) Constant Gap Penalty Score for a gap of length x : -ρ Affine Gap Penalty Score for a gap of length x : - (ρ + σ x) -ρ : gap opening penalty / -σ : gap extension penalty ( ρ σ ) Solving Constant/Affine Gap Penalty Edit Graph Update Add long horizontal or vertical edges with ρ weights to the edit graph Runtime? 19

Improved Solution for Constant/Affine Gap Penalty 3-Layer Grid Structure Middle layer ( Main layer ) for diagonal edges Extends matches and mismatches Upper layer for horizontal edges Creates/extends gaps in a sequence x Lower layer for vertical edges Creates/extends gaps in a sequence y Gap opening penalty (-ρ) for jumping from middle layer to upper/lower layer Gap extension penalty (-σ) for extending on upper/lower layer Example of 3-Layer Grid upper layer ( gaps in x ) main layer (matches/mismatches) lower layer ( gaps in y ) 20

Solving by Dynamic Programming Formula Dynamic programming with recursion match / mismatch continuing gap in y starting gap in y continuing gap in x starting gap in x Runtime? Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment 21

Searching Databases Homolog Search Searching similar sequences to a query sequence in a database (e.g. GenBank) Computational issues Dynamic programming are rigorous But inefficient in searching a huge database Need heuristic approaches Homolog Search ools FASA BLAS FASA (1) FASA DNA / protein sequence alignment tool (local alignment) Applies dynamic programming in scoring selected sequences Heuristic method in candidate sequence search Process Finding all pairwise k-tuples (at least k contiguous matching residues) Scoring the k-tuples by a substitution matrix Selecting sequences with high scores for alignment 22

FASA (2) FASA package query sequence database fasta protein protein fasta DNA DNA fastx / fasty DNA (all reading frames) protein tfastx / tfasty protein DNA (all reading frames) fastf mixed peptide protein tfastf mixed peptide DNA ssearch : applies dynamic programming BLAS (1) BLAS (Basic Local Alignment Search ool) DNA / protein sequence alignment tool Finds local alignments Heuristic method in sequence search Runs faster than FASA Process Makes a list of words (word pairs) from the query sequence Chooses high-scoring words Searches database for matches (hits) with the high-scoring words Extends the matches in both directions to find high-scoring segments (HSP) 23

BLAS (2) BLAS programs query sequence database blastp protein protein blastn DNA DNA blastx DNA (all reading frames) protein tblastn protein DNA (all reading frames) tblastx DNA (all reading frames) DNA (all reading frames) Search Results BLAS Search Results FASA Search Results 24

E-value E-value Average number of alignments with a score of at least S that would be expected by chance alone in searching a database of n sequences High alignment score S Low E-value Low alignment score S High E-value Ranges of E-value: 0 ~ n Factors Alignment score he number of sequences in the database Sequence length Default E-value threshold: 0.01 ~ 0.001 Overview Backgrounds Longest Common Subsequence Problem Global Sequence Alignment Local Sequence Alignment Alignment with Gap Penalty Homolog Search Multiple Sequence Alignment 25

Pairwise Alignment vs. Multiple Alignment Pairwise Alignment Alignment of two sequences Sometimes two sequences are functionally similar or have common ancestor although they have weak sequence similarity Multiple Alignment Alignment of more than two sequences Finds invisible similarity in pairwise alignment Alignment of 3 Sequences Alignment of 2 Sequences Described in a 2-row representation Best alignment is found in a 2-D matrix by dynamic programming Alignment of 3 Sequences Described in a 3-row representation x= AGG, y= ACGA, z= ACG x : A - G G - y : A - C G - A z : A C - G - Best alignment is found in a 3-D matrix by dynamic programming 26

Alignment in 3-D Grid 3-D Edit Graph 3-D grid structure (cube) with diagonals in each cell Example source x : y : z : 0 1 2 2 3 4 5 5 A - G G - 0 1 1 2 3 4 4 5 A - C G - A 0 1 2 3 3 4 5 5 A C - G - Path in 3-D grid : sink (0,0,0) (1,1,1) (2,1,2) (2,2,3) (3,3,3) (4,4,4) (5,4,5) (5,5,5) 3-D Grid Unit 2-D Grid Unit Maximum 3 edges in each unit of 2-D grid 3-D Grid Cell Maximum 7 edges in each unit of 3-D grid ( i-1, j-1, k-1 ) ( i-1, j-1, k ) ( i-1, j, k-1 ) ( i-1, j, k ) ( i, j-1, k-1 ) ( i, j, k-1 ) ( i, j-1, k ) ( i, j, k ) 27

Solving by Dynamic Programming Formula δ (x, y, z) is the entry of 3-D scoring matrix Runtime? from 3-D Alignment to Multiple Alignment Alignment of k Sequences Able to be solved by dynamic programming in k-d grid Runtime? Conclusion Dynamic programming for pairwise alignment can be extended to multiple alignment However, computationally impractical How can we solve this problem? 28

Heuristics of Multiple Alignment Intuition Implementing pairwise alignment (2-D alignment) many times is better than implementing k-d multiple alignment once Heuristic Process (1) Implementing all possible pairwise alignments (2) Combining the most similar pair iteratively Multiple Alignment to Pairwise Alignment Multiple Alignment Pairwise Alignments Multiple alignment induces pairwise alignments x: A C G C G G C y: A C G C G A G x: A C G C G G C y: A C G C G A G z: G C C G C G A G x: A C G C G G C z: G C C G C G A G y: A C G C G A G z: G C C G C G A G 29

Multiple Alignment Projection Multiple Alignment Pairwise Alignments x: A C G C G G C y: A C G C G A G z: G C C G C G A G x: A C G C G G C y: A C G C G A G x: A C G C G G C z: G C C G C G A G y: A C G C G A G z: G C C G C G A G Projection Pairwise Alignment to Multiple Alignment (1) Pairwise Alignments Multiple Alignment x: A C G C G G C y: A C G C G A G x: A C G C G G C z: G C C G C G A G x: A C G C G G C y: A C G C G A G z: G C C G C G A G y: A C G C G A G z: G C C G C G A G Can we construct a multiple alignment that induces pairwise alignments? 30

Pairwise Alignment to Multiple Alignment (2) Examples x= AAA, y= GGG, z= AAAGGG x= AAA, y= GGG, z= GGGAAA Conclusion Difficult to infer optimal multiple alignment from all possible optimal pairwise alignments Greedy Approach (1) Process (1) Implement pairwise alignment for all possible pairs of sequences (2) Choose the most similar pair of sequences (3) Merge them into a new sequence (4) Choose the most similar sequence to the new sequence (5) Repeat (3) and (4) until choosing all sequences Example Step 1 s1: GACA s2: GCGA s3: GAA s4: GCAGC s2 GCGA s4 GCAGC s1 GA-CA s2 G-CGA s1 GA-CA s3 GAA- s1 GACA-- s4 G -CAGC s2 G-CGA s3 GAA- s3 GA-A s4 G-CAGC 31

Greedy Approach (2) Example - continued Step 2 s2 GCGA s4 GCAGC Step 3 s 1 s 3 s 2,4 GACA GAA GCt/aGa/c s 2,4 GCt/aGa/c ( called profile or consensus sequence ) Features k-way alignment (alignment of k sequences) Runtime? Greedy algorithm Not optimal multiple alignment Progressive Alignment (1) Features A variation of greedy algorithm (more intelligent strategy on each step) Also called a hierarchical method Uses profiles to compare sequences Gaps are permanent ( once a gap, always a gap ) Works well for close sequences Process Stage 1 Computes similarity of all possible pairs of sequences ( similarity = sequence identity = # exact match / sequence length ) Makes a similarity matrix v 1 v 2 v 3 v 4 v 1 v 2 v 3 v 4 -.17.87.28.59.33.62 32

Progressive Alignment (2) Process - continued Stage 2 Creates a guide tree using the similarity matrix Stage 3 Applies a series of pairwise alignment following the guide tree Application of Progressive Alignment ClustalW Popular multiple alignment tool Adopts the progressive multiple alignment FOS_RA FOS_MOUSE FOS_CHICK FOSB_MOUSE FOSB_HUMAN PEEMSVS-LDLGGLPEAPESEEAFLPLLNDPEPK-PSLEPVKNISNMELKAEPFD PEEMSVAS-LDLGGLPEASPESEEAFLPLLNDPEPK-PSLEPVKSISNVELKAEPFD SEELAAAALDLG----APSPAAAEEAFALPLMEAPPAVPPKEPSG--SGLELKAEPFD PGPGPLAEVRDLPG-----SSAKEDGFGWLLPPPPPPP-----------------LPFQ PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ.. : **. :.. *:.* *. * **: Dots and stars show how well-conserved a column is 33

Scoring Schemes Number of Matches Multiple longest common subsequence score A column is a match if all the letters in the column are the same AAA AAG AA AC Only good for very similar sequences Entropy-Based Scoring Sum-of-Pair Scoring Entropy-Based Scoring (1) Entropy in Information heory A measure of the uncertainty associated with a random variable Entropy-Based Scoring in Multiple Alignment (1) Define frequencies for the occurrence of each letter on each column (2) Compute entropy of each column (3) Sum all entropies over all columns 34

Entropy-Based Scoring (2) Example AAA AAG AA AC Frequency 1 st column: p(a) = 1, p() = p(g) = p(c) = 0 2 nd column: p(a) = 0.75, p() = 0.25, p(g) = p(c) = 0 3 rd column: p(a) = 0.25, p() = 0.25, p(c) = 0.25, p(g) = 0.25 Entropy A A A A A 3 3 1 1 G 1 1 H 0 H log log 0. 244 A H A 4 4 4 4 log 4 0. 602 4 4 A C Entropy-based score in multiple alignment: 0 + 0.244 + 0.602 Sum-of-Pair Scoring Sum-of-Pairs Scoring in Multiple Alignment Consider pairwise alignment of sequences, a i and a j, imposed by a multiple alignment of k sequences Denote the score of the pairwise alignment as S*(a i, a j ) Sum up the pairwise scores for a multiple alignment: Example Aligning 4 sequences, a 1, a 2, a 3, and a 4, by 35

Advanced Multiple Alignment Background Progressive sequence alignment has loss of information not optimal Multi-domain proteins evolve not only through point mutations but also through domain duplications and domain re-combinations Rearrangement might be meaningful for aligning multi-domain protein sequences Examples Partial Order Multiple Sequence Alignment (PO-MSA) A-Bruijn Alignment (ABA) Alignment as a Graph Conventional pairwise alignment A path of a sequence Combining two paths Combined directed acyclic graph (partial order) 36

PO-MSA Algorithm Partial Order Multiple Sequence Alignment (PO-MSA) Considers a set of sequences S as a directed acyclic graph G such that each sequence in S is a path in G ( Each sequence is mapped into the graph.) Focuses on ordering rather than positions Algorithm (1) Construct a guide tree (2) Apply progressive alignment following the guide tree (3) Align two directed acyclic graphs (Partial Order Alignment) using dynamic programming algorithm at each step Partial Order Alignment (1) Schematic View Conventional pairwise alignment PO-MSA 37

Partial Order Alignment (2) Advantages Scalability Linear increase of time complexity as the increment of predecessors Accuracy Homologous recombination for multi-domain protein sequences ( he graph represents domain structure.) Questions? Lecture Slides are found on the Course Website, web.ecs.baylor.edu/faculty/cho/5330 38