Lecture 2 Sequence lgnment Burr Settles IBS Summer Research Program 2008 bsettles@cs.wsc.edu www.cs.wsc.edu/~bsettles/bs08/
Sequence lgnment: Task Defnton gven: a par of sequences DN or proten) a method for scorng a canddate algnment do: determne the correspondences between substrngs n the sequences such that the smlarty score s maxmzed
Why Do lgnment? homology: smlarty due to descent from a common ancestor often we can nfer homology from smlarty thus we can sometmes nfer structure/functon from sequence smlarty
Homology Example: Evoluton of the Globns
Homology homologous sequences can be dvded nto two groups orthologous sequences: sequences that dffer because they are found n dfferent speces e.g. human α -globn and mouse α-globn) paralogous sequences: sequences that dffer because of a gene duplcaton event e.g. human α-globn and human β-globn, varous versons of both )
Issues n Sequence lgnment the sequences we re comparng probably dffer n length there may be only a relatvely small regon n the sequences that match we want to allow partal matches.e. some amno acd pars are more substtutable than others) varable length regons may have been nserted/deleted from the common ancestral sequence
Sequence Varatons sequences may have dverged from a common ancestor through varous types of mutatons: substtutons CG GG) nsertons CG CCGGG) deletons CGGG G) the latter two wll result n gaps n algnments
Insertons, Deletons and Proten Structure Why s t that two smlar sequences may have large nsertons/deletons? some nsertons and deletons may not sgnfcantly affect the structure of a proten loop structures: nsertons/deletons here not so sgnfcant
Example lgnment: Globns fgure at rght shows prototypcal structure of globns fgure below shows part of algnment for 8 globns - s ndcate gaps)
Three Key Questons Q1: what do we want to algn? Q2: how do we score an algnment? Q3: how do we fnd the best algnment?
Q1: What Do We Want to lgn? global algnment: fnd best match of both sequences n ther entrety local algnment: fnd best subsequence match sem-global algnment: fnd best match wthout penalzng gaps on the ends of the algnment
The Space of Global lgnments some possble global algnments for ELV and VIS ELV VIS -ELV VIS- --ELV VIS-- ELV- -VIS E-LV VIS- ELV-- --VIS EL-V -VIS
Q2: How Do We Score lgnments? gap penalty functon wk) ndcates cost of a gap of length k substtuton matrx sa,b) ndcates score of algnng character a wth character b
Lnear Gap Penalty Functon dfferent gap penalty functons requre somewhat dfferent dynamc programmng algorthms the smplest case s when a lnear gap functon s used wk) = g k where g s a constant we ll start by consderng ths case
Scorng an lgnment the score of an algnment s the sum of the scores for pars of algned characters plus the scores for gaps example: gven the followng algnment VHV---D--DMPNLSLSDLHHKL IQLQVTGVVVTDTLKNLGSVHVSKG we would score t by sv,) s,i) sh,q) sv,l) 3g sd,g) 2g
Q3: How Do We Fnd the Best lgnment? smple approach: compute & score all possble algnments but there are 2n n = 2n)! 2 n!) 2 2n πn possble global algnments for 2 sequences of length n e.g. two sequences of length 100 have algnments 77 10 possble
Parwse lgnment Va Dynamc Programmng dynamc programmng: solve an nstance of a problem by takng advantage of solutons for subparts of the problem reduce problem of best algnment of two sequences to best algnment of all prefxes of the sequences avod recalculatng the scores already consdered example: Fbonacc sequence 1, 1, 2, 3, 5, 8, 13, 21, 34 frst used n algnment by Needleman & Wunsch, Journal of Molecular Bology, 1970
Dynamc Programmng Idea consder last step n computng algnment of C wth GC three possble optons; n each we ll choose a dfferent parng for end of algnment, and add ths to best algnment of prevous characters C C - G C G C GC C - consder best algnment of these prefxes score of algnng ths par
Dynamc Programmng Idea gven an n-character sequence x, and an m-character sequence y construct an n1) m1) matrx F F, ) = score of the best algnment of x[1 ] wth y[1 ] G C score of best algnment of to G C
Needleman-Wunch lgorthm one way to specfy the DP s n terms of ts recurrence relaton: match x wth y F, ) = F max F F 1, 1,, 1) 1) ) g g s x, y ) nserton n x nserton n y
DP lgorthm Sketch: Global lgnment ntalze frst row and column of matrx fll n rest of matrx from top to bottom, left to rght for each F, ), save ponters) to cells) that resulted n best score F m, n) holds the optmal algnment score; trace ponters back from F m, n) to F 0, 0) to recover algnment
Intalzng Matrx G C 0 g 2g 3g C g 2g 3g 4g
Global lgnment Example suppose we choose the followng scorng scheme: s x, y ) 1-1 = when when x = x y y g penalty for algnng wth a gap) = -2
Global lgnment Example G C s x, y ) 1-1 g = -2 = when when x = x y y C
Global lgnment Example G C 0-2 -4-6 -2-4 1-1 -3-1 0-2 one optmal algnment x: y: G - C C -6-3 -2-1 C -8-5 -4-1
Equally Optmal lgnments many optmal algnments may exst for a gven par of sequences can use preference orderng over paths when dong traceback hghroad 1 lowroad 3 2 2 3 1 hghroad and lowroad algnments show the two most dfferent optmal algnments
Hghroad & Lowroad lgnments G C -2 0-2 -4-6 1-1 -3 hghroad algnment x: y: G - C C -4-1 0-2 lowroad algnment -6-3 -2-1 x: y: - G C C C -8-5 -4-1
DP Comments works for ether DN or proten sequences, although the substtuton matrces used dffer fnds an optmal algnment the exact algorthm and computatonal complexty) depends on gap penalty functon we ll come back to ths)
Local lgnment so far we have dscussed global algnment, where we are lookng for best match between sequences from one end to the other more commonly, we wll want a local algnment, the best match between subsequences of x and y
Local lgnment Motvaton useful for comparng proten sequences that share a common motf conserved pattern) or doman ndependently folded unt) but dffer elsewhere useful for comparng DN sequences that share a smlar motf but dffer elsewhere useful for comparng proten sequences aganst genomc DN sequences long stretches of uncharacterzed sequence) more senstve when comparng hghly dverged sequences
Local lgnment DP lgorthm orgnal formulaton: Smth & Waterman, Journal of Molecular Bology, 1981 nterpretaton of array values s somewhat dfferent F, ) = score of the best algnment of a suffx of x[1 ] and a suffx of y[1 ]
Local lgnment DP lgorthm = 0 1), ) 1, ), 1) 1, max ), g F g F y x s F F the recurrence relaton s slghtly dfferent than for global algorthm
Local lgnment DP lgorthm ntalzaton: frst row and frst column ntalzed wth 0 s traceback: fnd maxmum value of F, ); can be anywhere n matrx stop when we get to a cell wth value 0
Local lgnment Example G s x, y ) = 1 when -1 when g = -2 x = x y y T T G
Local lgnment Example 0 0 0 0 0 0 0 0 0 0 0 T T G 0 0 0 0 0 0 0 G 0 0 0 1 0 1 1 2 3 1 1 1 x: y: G G
More On Gap Penalty Functons a gap of length k s more probable than k gaps of length 1 a gap may be due to a sngle mutatonal event that nserted/deleted a stretch of characters separated gaps are probably due to dstnct mutatonal events a lnear gap penalty functon treats these cases the same t s more common to use an affne gap penalty functon, whch nvolves two terms: a penalty h assocated wth openng a gap a smaller penalty g for extendng the gap
Gap Penalty Functons lnear w k) = gk affne w k) = h 0, gk, k = 0 k 1
Dynamc Programmng for the ffne Gap Penalty Case to do n O n 2 ) tme, need 3 matrces nstead of 1 M, ) best score gven that x[] s algned to y[] I x I y, ), ) best score gven that x[] s algned to a gap best score gven that y[] s algned to a gap
Global lgnment DP for the ffne Gap Penalty Case = ), 1) 1, ), 1) 1, ), 1) 1, max ), y x y x s I y x s I y x s M M = g I g h M I x x ) 1, ) 1, max ), = g I g h M I y y 1), 1), max ), match x wth y nserton n x nserton n y open gap n x extend gap n x open gap n y extend gap n y
Global lgnment DP for the ffne Gap Penalty Case ntalzaton M 0,0) = 0 I I x y,0) 0, ) = = h h g g other cells n top row and leftmost column traceback start at largest of M m, n), I x m, n), stop at any of M 0,0), I x 0,0), I y note that ponters may traverse all three matrces = I y m, n) 0,0)
h = -3, g = -1 Global lgnment Example ffne Gap Penalty) C C T M : 0 I x : -3 I y : -3-4 -5-6 -7-8 -4 1-5 -4-7 -8-3 -4-5 -6-5 -3-3 0-9 -2-8 -5-11 -6-12 -7-4 -5-6 T -6-6 -4-4 -4-1 -6-3 -9-4 -10-10 -8-5 -6
Global lgnment Example Contnued) C C T M : 0 I x : -3 I y : -3-4 -5-6 -7-8 -4 1-5 -4-7 -8-3 -4-5 -6-5 -3-3 0-9 -2-8 -5-11 -6-12 -7-4 -5-6 T -6-6 -4-4 -4-1 -6-3 -9-4 -10-10 -8-5 -6 three optmal algnments: CCT --T CCT --T CCT --T
Local lgnment DP for the ffne Gap Penalty Case = 0 ), 1) 1, ), 1) 1, ), 1) 1, max ), y x y x s I y x s I y x s M M = g I g h M I x x ) 1, ) 1, max ), = g I g h M I y y 1), 1), max ),
Local lgnment DP for the ffne Gap Penalty Case ntalzaton M 0,0) = 0 M,0) = 0 M 0, ) = 0 cells n top row and leftmost column of traceback start at largest stop at M M, ), ) = 0 I x, I y =
Gap Penalty Functons lnear: w k) = gk affne: w k) = h 0, gk, k = 0 k 1 concave: a functon for whch the followng holds for all k, l, m 0 w k m l) w k m) w k l) w k) e.g. w k) = h g log k)
Concave Gap Penalty Functons 8 7 6 w k m l) w k m) w k l) w k) 5 4 l 3 2 1 0 1 2 3 4 5 6 7 8 9 10 w k m l) w k m) w k l) w k)
More On Scorng Matches so far, we ve dscussed multple gap penalty functons, but only one match-scorng scheme: s x, y ) 1-1 = when when x = x y y for proten sequence algnment, some amno acds have smlar structures and can be substtuted n nature: aspartc acd D) glutamc acd E)
Substtuton Matrces two popular sets of matrces for proten sequences PM matrces [Dayhoff et al., 1978] BLOSUM matrces [Henkoff & Henkoff, 1992] both try to capture the the relatve substtutablty of amno acd pars n the context of evoluton
BLOSUM62 Matrx
Heurstc Methods the algorthms we learned today take Onm) tme to algn sequences, whch s too slow for searchng large databases magne an nternet search engne, but where queres and results are proten sequences heurstc methods do fast approxmaton to dynamc programmng example: BLST [ltschul et al., 1990; ltschul et al., 1997] break sequence nto small e.g. 3 base par) words scan database for word matches extend all matches to seek hgh-scorng algnments tradeoff: senstvty for speed
Multple Sequence lgnment we ve only dscussed algnng 2 sequences, but we may want to do more dscover common motfs n a set of sequences e.g. DN sequences that bnd the same proten) characterze a set of sequences e.g. a proten famly) much more complex Fgure from. Krogh, n Introducton to Hdden Markov Models for Bologcal Sequences
Next Tme basc molecular bology sequence algnment probablstc sequence models gene expresson analyss proten structure predcton by meet Son