String Edit Distance (and intro to dynamic programming) Lecture #4 Computational Linguistics CMPSCI 591N, Spring 2006

String Edit Distance (and intro to dynamic programming) Lecture # omputational Linguistics MPSI 59N, Spring 6 University of Massachusetts mherst ndrew Mcallum

Dynamic Programming (Not much to do with programming in the S sense.) Dynamic programming is efficient in finding optimal solutions for cases with lots of overlapping sub-problems. It solves problems by recombining solutions to sub-problems, when the sub-problems themselves may share sub-sub-problems.

Fibonacci Numbers 5 8...

alculating Fibonacci Numbers F(n) = F(n-) + F(n-), where F()=, F()=. Non-Dynamic Programming implementation def fib(n): if n == or n == : return n else: return fib(n-) + fib(n-) For fib(8), how many calls to function fib(n)?

DP Example: alculating Fibonacci Numbers Dynamic Programming: avoid repeated calls by remembering function values already calculated. table = {} def fib(n): global table if table.has_key(n): return table[n] if n == or n == : table[n] = n return n else: value = fib(n-) + fib(n-) table[n] = value return value

DP Example: alculating Fibonacci Numbers...or alternately, in a list instead of a dictionary... def fib(n): table = [] * (n+) table[] = table[] = for i in range(,n+): table[i] = table[i-] + table[i-] return table[n] We will see this pattern many more times in this course:. reate a table (of the right dimensions to describe our problem.. Fill the table, re-using solutions to previous sub-problems.

String Edit Distance Given two strings (sequences) return the distance between the two strings as measured by......the minimum number of character edit operations needed to turn one sequence into the other. ndrew mdrewz. substitute m to n. delete the z Distance =

String distance metrics: Levenshtein Given strings s and t Distance is shortest sequence of edit commands that transform s to t, (or equivalently s to t). Simple set of operations: opy character from s over to t (cost ) Delete a character in s (cost ) Insert a character in t (cost ) Substitute one character for another (cost ) This is Levenshtein distance

Levenshtein distance - example distance( William ohen, Willliam ohon ) S O E I N H O _ M I L L L I W N H O _ M I L L I W s t edit op cost so far... alignment gap lignment is a little bit like a parse.

Finding the Minimum What is the minimum number of operations for...? nother fine day in the park nybody can see him pick the ball Not so easy... not so clear. Not only are the strings, longer, but is isn t immediately obvious where the alignments should happen. What if we consider all possible alignments by brute force? How many alignments are there?

Dynamic Program Table for String Edit Measure distance between strings PRK SPKE P R K S P K E c ij c ij = the number of edit operations needed to align P with SP.

Dynamic Programming to the Rescue! How to take our big problem and chop it into building-block pieces. Given some partial solution, it isn t hard to figure out what a good next immediate step is. Partial solution = This is the cost for aligning s up to position i with t up to position j. Next step = In order to align up to positions x in s and y in t, should the last operation be a substitute, insert, or delete?

Dynamic Program Table for String Edit Measure distance between strings PRK SPKE Edit operations for turning SPKE into PRK S P delete P R K insert K E substitute

Dynamic Program Table for String Edit Measure distance between strings PRK SPKE P R K c c c c c 5 S c c c c c P c c c c c c c??? K E

Dynamic Program Table for String Edit P R K c c c c c 5 S c c c c c P c c subst c delete c c c insert c??? K E D(i,j) = score of best alignment from s..si to t..tj = min D(i-,j-), if si=tj //copy D(i-,j-)+, if si!=tj //substitute D(i-,j)+ //insert D(i,j-)+ //delete

omputing Levenshtein distance - D(i,j) = score of best alignment from s..si to t..tj = min D(i-,j-) + d(si,tj) //subst/copy D(i-,j)+ //insert D(i,j-)+ //delete (simplify by letting d(c,d)= if c=d, else) also let D(i,)=i (for i inserts) and D(,j)=j

Dynamic Program Table Initialized P R K S P K E 5 D(i,j) = score of best alignment from s..si to t..tj D(i-,j-)+d(si,tj) //substitute D(i-,j)+ //insert = min D(i,j-)+ //delete

Dynamic Program Table... filling in P R K S P K E 5 D(i,j) = score of best alignment from s..si to t..tj D(i-,j-)+d(si,tj) //substitute D(i-,j)+ //insert = min D(i,j-)+ //delete

Dynamic Program Table... filling in S P K E 5 P D(i,j) = score of best alignment from s..si to t..tj = min R D(i-,j-)+d(si,tj) D(i-,j)+ D(i,j-)+ K //substitute //insert //delete Final cost of aligning all of both strings.

DP String Edit Distance def stredit (s,s): "alculate Levenstein edit distance for strings s and s." len = len(s) # vertically len = len(s) # horizontally # llocate the table table = [None]*(len+) for i in range(len+): table[i] = []*(len+) # Initialize the table for i in range(, len+): table[i][] = i for i in range(, len+): table[][i] = i # Do dynamic programming for i in range(,len+): for j in range(,len+): if s[j-] == s[i-]: d = else: d = table[i][j] = min(table[i-][j-] + d, table[i-][j]+, table[i][j-]+)

Remebering the lignment (trace) D(i,j) = min D(i-,j-) + d(si,tj) //subst/copy D(i-,j)+ //insert D(i,j-)+ //delete trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than ) M O H N 5 O H E N 5 5 5 5

Three Enhanced Variants Needleman-Munch Variable costs Smith-Waterman Find longest soft matching subsequence ffine Gap Distance Make repeated deletions (insertions) cheaper (Implement one for homework?)

Needleman-Wunch distance D(i,j) = min D(i-,j-) + d(si,tj) //subst/copy D(i-,j) + G //insert D(i,j-) + G //delete d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) G = gap cost William ohen Wukkuan igeb

Smith-Waterman distance Instead of looking at each sequence in its entirety, this compares segments of all possible lengths and chooses whichever maximize the similarity measure. For every cell the algorithm calculates all possible paths leading to it. These paths can be of any length and can contain insertions and deletions.

Smith-Waterman distance D(i,j) = min //start over D(i-,j-) + d(si,tj) //subst/copy D(i-,j) + G //insert D(i,j-) + G //delete O H E N G = d(c,c) = - d(c,d) = + M O - - - - - - - - - H - -6-5 - N - -5-5 -7

Example output from Python s ' a l l o n g e r 5 6 7 8 9 l * o *- - u *- - n *- - - g 5 - *-5 - - e 6 - - *-7-6 (My implementation of HW#, task choice #. -Mcallum)

ffine gap distances Smith-Waterman fails on some pairs that seem quite similar: William W. ohen William W. Don t call me Dubya ohen Intuitively, single a single long long insertions are is cheaper than a lot lot of of short insertions

ffine gap distances - Idea: urrent cost of a gap of n characters: ng Make this cost: + (n-)b, where is cost of opening a gap, and B is cost of continuing a gap.

ffine gap distances - D(i,j) = max D(i-,j-) D(i-,j-) + d(si,tj) d(si,tj) //subst/copy D(i-,j)- IS(I-,j-) + d(si,tj) //insert D(i,j-)- //delete IT(I-,j-) + d(si,tj) IS(i,j) = max D(i-,j) - IS(i-,j) - B Best score in which si is aligned with a gap IT(i,j) = max D(i,j-) - IT(i,j-) - B Best score in which tj is aligned with a gap

ffine gap distances as automata -d(si,tj) IS -B -d(si,tj) D - - -d(si,tj) IT -B

Generative version of affine gap automata (Bilenko&Mooney, TechReport ) HMM emits pairs: (c,d) in state M, pairs (c,-) in state D, and pairs (-,d) in state I. For each state there is a multinomial distribution on pairs. The HMM can trained with EM from a sample of pairs of matched strings (s,t) E-step is forward-backward; M-step uses some ad hoc smoothing

ffine gap edit-distance learning: experiments results (Bilenko & Mooney) Experimental method: parse records into fields; append a few key fields together; sort by similarity; pick a threshold T and call all pairs with distance(s,t) < T duplicates ; picking T to maximize F-measure.

ffine gap edit-distance learning: experiments results (Bilenko & Mooney)

ffine gap edit-distance learning: experiments results (Bilenko & Mooney) Precision/recall for MILING dataset duplicate detection

ffine gap distances experiments (from Mcallum, Nigam,Ungar KDD) Goal is to match data like this:

The assignment Homework # Start with my stredit.py code Make some modifications Write a little about your experiences Some possible modifications Implement Needleman-Wunch, Smith-Waterman, or ffine Gap Distance. reate a little spell-checker: if entered word isn t in the dictionary, return the dictionary word that is closest. hange implementation to operate on sequences of words rather than characters... get an online translation dictionary, and find alignments between English & French or English & Russian! Try to learn the parameters of the function from data. (Tough.)