Lecture 2 Sequence Alignment. Burr Settles IBS Summer Research Program 2008 bsettles@cs.wisc.edu www.cs.wisc.edu/~bsettles/ibs08/

Similar documents
8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Implementation of Deutsch's Algorithm Using Mathcad

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

The Greedy Method. Introduction. 0/1 Knapsack Problem

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Recurrence. 1 Definitions and main statements

Lecture 2: Single Layer Perceptrons Kevin Swingler

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

1 Example 1: Axis-aligned rectangles

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

What is Candidate Sampling

Hedging Interest-Rate Risk with Duration

Extending Probabilistic Dynamic Epistemic Logic

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

This circuit than can be reduced to a planar circuit

Section 5.4 Annuities, Present Value, and Amortization

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

Rate Monotonic (RM) Disadvantages of cyclic. TDDB47 Real Time Systems. Lecture 2: RM & EDF. Priority-based scheduling. States of a process

J. Parallel Distrib. Comput.

Support Vector Machines

Question 2: What is the variance and standard deviation of a dataset?

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Formulating & Solving Integer Problems Chapter

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Figure 1. Inventory Level vs. Time - EOQ Problem

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The OC Curve of Attribute Acceptance Plans

Faraday's Law of Induction

Simple Interest Loans (Section 5.1) :

14.74 Lecture 5: Health (2)

Compiling for Parallelism & Locality. Dependence Testing in General. Algorithms for Solving the Dependence Problem. Dependence Testing

L10: Linear discriminants analysis

1. Measuring association using correlation and regression

8 Algorithm for Binary Searching in Trees

PERRON FROBENIUS THEOREM

Project Networks With Mixed-Time Constraints

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Availability-Based Path Selection and Network Vulnerability Assessment

A Performance Analysis of View Maintenance Techniques for Data Warehouses

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

NMT EE 589 & UNM ME 482/582 ROBOT ENGINEERING. Dr. Stephen Bruder NMT EE 589 & UNM ME 482/582

where the coordinates are related to those in the old frame as follows.

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

7.5. Present Value of an Annuity. Investigate

Ring structure of splines on triangulations

Fisher Markets and Convex Programs

DEFINING %COMPLETE IN MICROSOFT PROJECT

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

An MILP model for planning of batch plants operating in a campaign-mode

Updating the E5810B firmware

The Mathematical Derivation of Least Squares

Sketching Sampled Data Streams

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

An Alternative Way to Measure Private Equity Performance

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

SIMPLE LINEAR CORRELATION

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

Section C2: BJT Structure and Operational Modes

HÜCKEL MOLECULAR ORBITAL THEORY

An Interest-Oriented Network Evolution Mechanism for Online Communities

How To Solve An Onlne Control Polcy On A Vrtualzed Data Center

Sensor placement for leak detection and location in water distribution networks

Lecture 3: Annuity. Study annuities whose payments form a geometric progression or a arithmetic progression.

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Proactive Secret Sharing Or: How to Cope With Perpetual Leakage

Loop Parallelization

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

Mean Molecular Weight

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

A method for a robust optimization of joint product and supply chain design

Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Software project management with GAs

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

Finite Math Chapter 10: Study Guide and Solution to Problems

Section 5.3 Annuities, Future Value, and Sinking Funds

Series Solutions of ODEs 2 the Frobenius method. The basic idea of the Frobenius method is to look for solutions of the form 3

Heuristic Static Load-Balancing Algorithm Applied to CESM

Transcription:

Lecture 2 Sequence lgnment Burr Settles IBS Summer Research Program 2008 bsettles@cs.wsc.edu www.cs.wsc.edu/~bsettles/bs08/

Sequence lgnment: Task Defnton gven: a par of sequences DN or proten) a method for scorng a canddate algnment do: determne the correspondences between substrngs n the sequences such that the smlarty score s maxmzed

Why Do lgnment? homology: smlarty due to descent from a common ancestor often we can nfer homology from smlarty thus we can sometmes nfer structure/functon from sequence smlarty

Homology Example: Evoluton of the Globns

Homology homologous sequences can be dvded nto two groups orthologous sequences: sequences that dffer because they are found n dfferent speces e.g. human α -globn and mouse α-globn) paralogous sequences: sequences that dffer because of a gene duplcaton event e.g. human α-globn and human β-globn, varous versons of both )

Issues n Sequence lgnment the sequences we re comparng probably dffer n length there may be only a relatvely small regon n the sequences that match we want to allow partal matches.e. some amno acd pars are more substtutable than others) varable length regons may have been nserted/deleted from the common ancestral sequence

Sequence Varatons sequences may have dverged from a common ancestor through varous types of mutatons: substtutons CG GG) nsertons CG CCGGG) deletons CGGG G) the latter two wll result n gaps n algnments

Insertons, Deletons and Proten Structure Why s t that two smlar sequences may have large nsertons/deletons? some nsertons and deletons may not sgnfcantly affect the structure of a proten loop structures: nsertons/deletons here not so sgnfcant

Example lgnment: Globns fgure at rght shows prototypcal structure of globns fgure below shows part of algnment for 8 globns - s ndcate gaps)

Three Key Questons Q1: what do we want to algn? Q2: how do we score an algnment? Q3: how do we fnd the best algnment?

Q1: What Do We Want to lgn? global algnment: fnd best match of both sequences n ther entrety local algnment: fnd best subsequence match sem-global algnment: fnd best match wthout penalzng gaps on the ends of the algnment

The Space of Global lgnments some possble global algnments for ELV and VIS ELV VIS -ELV VIS- --ELV VIS-- ELV- -VIS E-LV VIS- ELV-- --VIS EL-V -VIS

Q2: How Do We Score lgnments? gap penalty functon wk) ndcates cost of a gap of length k substtuton matrx sa,b) ndcates score of algnng character a wth character b

Lnear Gap Penalty Functon dfferent gap penalty functons requre somewhat dfferent dynamc programmng algorthms the smplest case s when a lnear gap functon s used wk) = g k where g s a constant we ll start by consderng ths case

Scorng an lgnment the score of an algnment s the sum of the scores for pars of algned characters plus the scores for gaps example: gven the followng algnment VHV---D--DMPNLSLSDLHHKL IQLQVTGVVVTDTLKNLGSVHVSKG we would score t by sv,) s,i) sh,q) sv,l) 3g sd,g) 2g

Q3: How Do We Fnd the Best lgnment? smple approach: compute & score all possble algnments but there are 2n n = 2n)! 2 n!) 2 2n πn possble global algnments for 2 sequences of length n e.g. two sequences of length 100 have algnments 77 10 possble

Parwse lgnment Va Dynamc Programmng dynamc programmng: solve an nstance of a problem by takng advantage of solutons for subparts of the problem reduce problem of best algnment of two sequences to best algnment of all prefxes of the sequences avod recalculatng the scores already consdered example: Fbonacc sequence 1, 1, 2, 3, 5, 8, 13, 21, 34 frst used n algnment by Needleman & Wunsch, Journal of Molecular Bology, 1970

Dynamc Programmng Idea consder last step n computng algnment of C wth GC three possble optons; n each we ll choose a dfferent parng for end of algnment, and add ths to best algnment of prevous characters C C - G C G C GC C - consder best algnment of these prefxes score of algnng ths par

Dynamc Programmng Idea gven an n-character sequence x, and an m-character sequence y construct an n1) m1) matrx F F, ) = score of the best algnment of x[1 ] wth y[1 ] G C score of best algnment of to G C

Needleman-Wunch lgorthm one way to specfy the DP s n terms of ts recurrence relaton: match x wth y F, ) = F max F F 1, 1,, 1) 1) ) g g s x, y ) nserton n x nserton n y

DP lgorthm Sketch: Global lgnment ntalze frst row and column of matrx fll n rest of matrx from top to bottom, left to rght for each F, ), save ponters) to cells) that resulted n best score F m, n) holds the optmal algnment score; trace ponters back from F m, n) to F 0, 0) to recover algnment

Intalzng Matrx G C 0 g 2g 3g C g 2g 3g 4g

Global lgnment Example suppose we choose the followng scorng scheme: s x, y ) 1-1 = when when x = x y y g penalty for algnng wth a gap) = -2

Global lgnment Example G C s x, y ) 1-1 g = -2 = when when x = x y y C

Global lgnment Example G C 0-2 -4-6 -2-4 1-1 -3-1 0-2 one optmal algnment x: y: G - C C -6-3 -2-1 C -8-5 -4-1

Equally Optmal lgnments many optmal algnments may exst for a gven par of sequences can use preference orderng over paths when dong traceback hghroad 1 lowroad 3 2 2 3 1 hghroad and lowroad algnments show the two most dfferent optmal algnments

Hghroad & Lowroad lgnments G C -2 0-2 -4-6 1-1 -3 hghroad algnment x: y: G - C C -4-1 0-2 lowroad algnment -6-3 -2-1 x: y: - G C C C -8-5 -4-1

DP Comments works for ether DN or proten sequences, although the substtuton matrces used dffer fnds an optmal algnment the exact algorthm and computatonal complexty) depends on gap penalty functon we ll come back to ths)

Local lgnment so far we have dscussed global algnment, where we are lookng for best match between sequences from one end to the other more commonly, we wll want a local algnment, the best match between subsequences of x and y

Local lgnment Motvaton useful for comparng proten sequences that share a common motf conserved pattern) or doman ndependently folded unt) but dffer elsewhere useful for comparng DN sequences that share a smlar motf but dffer elsewhere useful for comparng proten sequences aganst genomc DN sequences long stretches of uncharacterzed sequence) more senstve when comparng hghly dverged sequences

Local lgnment DP lgorthm orgnal formulaton: Smth & Waterman, Journal of Molecular Bology, 1981 nterpretaton of array values s somewhat dfferent F, ) = score of the best algnment of a suffx of x[1 ] and a suffx of y[1 ]

Local lgnment DP lgorthm = 0 1), ) 1, ), 1) 1, max ), g F g F y x s F F the recurrence relaton s slghtly dfferent than for global algorthm

Local lgnment DP lgorthm ntalzaton: frst row and frst column ntalzed wth 0 s traceback: fnd maxmum value of F, ); can be anywhere n matrx stop when we get to a cell wth value 0

Local lgnment Example G s x, y ) = 1 when -1 when g = -2 x = x y y T T G

Local lgnment Example 0 0 0 0 0 0 0 0 0 0 0 T T G 0 0 0 0 0 0 0 G 0 0 0 1 0 1 1 2 3 1 1 1 x: y: G G

More On Gap Penalty Functons a gap of length k s more probable than k gaps of length 1 a gap may be due to a sngle mutatonal event that nserted/deleted a stretch of characters separated gaps are probably due to dstnct mutatonal events a lnear gap penalty functon treats these cases the same t s more common to use an affne gap penalty functon, whch nvolves two terms: a penalty h assocated wth openng a gap a smaller penalty g for extendng the gap

Gap Penalty Functons lnear w k) = gk affne w k) = h 0, gk, k = 0 k 1

Dynamc Programmng for the ffne Gap Penalty Case to do n O n 2 ) tme, need 3 matrces nstead of 1 M, ) best score gven that x[] s algned to y[] I x I y, ), ) best score gven that x[] s algned to a gap best score gven that y[] s algned to a gap

Global lgnment DP for the ffne Gap Penalty Case = ), 1) 1, ), 1) 1, ), 1) 1, max ), y x y x s I y x s I y x s M M = g I g h M I x x ) 1, ) 1, max ), = g I g h M I y y 1), 1), max ), match x wth y nserton n x nserton n y open gap n x extend gap n x open gap n y extend gap n y

Global lgnment DP for the ffne Gap Penalty Case ntalzaton M 0,0) = 0 I I x y,0) 0, ) = = h h g g other cells n top row and leftmost column traceback start at largest of M m, n), I x m, n), stop at any of M 0,0), I x 0,0), I y note that ponters may traverse all three matrces = I y m, n) 0,0)

h = -3, g = -1 Global lgnment Example ffne Gap Penalty) C C T M : 0 I x : -3 I y : -3-4 -5-6 -7-8 -4 1-5 -4-7 -8-3 -4-5 -6-5 -3-3 0-9 -2-8 -5-11 -6-12 -7-4 -5-6 T -6-6 -4-4 -4-1 -6-3 -9-4 -10-10 -8-5 -6

Global lgnment Example Contnued) C C T M : 0 I x : -3 I y : -3-4 -5-6 -7-8 -4 1-5 -4-7 -8-3 -4-5 -6-5 -3-3 0-9 -2-8 -5-11 -6-12 -7-4 -5-6 T -6-6 -4-4 -4-1 -6-3 -9-4 -10-10 -8-5 -6 three optmal algnments: CCT --T CCT --T CCT --T

Local lgnment DP for the ffne Gap Penalty Case = 0 ), 1) 1, ), 1) 1, ), 1) 1, max ), y x y x s I y x s I y x s M M = g I g h M I x x ) 1, ) 1, max ), = g I g h M I y y 1), 1), max ),

Local lgnment DP for the ffne Gap Penalty Case ntalzaton M 0,0) = 0 M,0) = 0 M 0, ) = 0 cells n top row and leftmost column of traceback start at largest stop at M M, ), ) = 0 I x, I y =

Gap Penalty Functons lnear: w k) = gk affne: w k) = h 0, gk, k = 0 k 1 concave: a functon for whch the followng holds for all k, l, m 0 w k m l) w k m) w k l) w k) e.g. w k) = h g log k)

Concave Gap Penalty Functons 8 7 6 w k m l) w k m) w k l) w k) 5 4 l 3 2 1 0 1 2 3 4 5 6 7 8 9 10 w k m l) w k m) w k l) w k)

More On Scorng Matches so far, we ve dscussed multple gap penalty functons, but only one match-scorng scheme: s x, y ) 1-1 = when when x = x y y for proten sequence algnment, some amno acds have smlar structures and can be substtuted n nature: aspartc acd D) glutamc acd E)

Substtuton Matrces two popular sets of matrces for proten sequences PM matrces [Dayhoff et al., 1978] BLOSUM matrces [Henkoff & Henkoff, 1992] both try to capture the the relatve substtutablty of amno acd pars n the context of evoluton

BLOSUM62 Matrx

Heurstc Methods the algorthms we learned today take Onm) tme to algn sequences, whch s too slow for searchng large databases magne an nternet search engne, but where queres and results are proten sequences heurstc methods do fast approxmaton to dynamc programmng example: BLST [ltschul et al., 1990; ltschul et al., 1997] break sequence nto small e.g. 3 base par) words scan database for word matches extend all matches to seek hgh-scorng algnments tradeoff: senstvty for speed

Multple Sequence lgnment we ve only dscussed algnng 2 sequences, but we may want to do more dscover common motfs n a set of sequences e.g. DN sequences that bnd the same proten) characterze a set of sequences e.g. a proten famly) much more complex Fgure from. Krogh, n Introducton to Hdden Markov Models for Bologcal Sequences

Next Tme basc molecular bology sequence algnment probablstc sequence models gene expresson analyss proten structure predcton by meet Son