String Edit Distance (and intro to dynamic programming) Lecture #4 Computational Linguistics CMPSCI 591N, Spring 2006

Similar documents
Pairwise Sequence Alignment

Dynamic Programming. Lecture Overview Introduction

Programming Exercises

Arithmetic Coding: Introduction

Lecture 13: The Knapsack Problem

Scheduling Shop Scheduling. Tim Nieberg

Data Warehousing. Jens Teubner, TU Dortmund Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

Solutions to Homework 6

Programming Using Python

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Closest Pair Problem

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Near Optimal Solutions

Eventia Log Parsing Editor 1.0 Administration Guide

agucacaaacgcu agugcuaguuua uaugcagucuua

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Topological Data Analysis Applications to Computer Vision

B490 Mining the Big Data. 2 Clustering

Course: Model, Learning, and Inference: Lecture 5

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Dynamic Programming Problem Set Partial Solution CMPSC 465

New Hash Function Construction for Textual and Geometric Data Retrieval

2. Select Point B and rotate it by 15 degrees. A new Point B' appears. 3. Drag each of the three points in turn.

Data Deduplication in Slovak Corpora

GENERAL SCIENCE LABORATORY 1110L Lab Experiment 3: PROJECTILE MOTION

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Solving Simultaneous Equations and Matrices

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Approximation Algorithms

Computational Geometry: Line segment intersection

CSC 180 H1F Algorithm Runtime Analysis Lecture Notes Fall 2015

Bio-Informatics Lectures. A Short Introduction

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin

Scheduling Programming Activities and Johnson's Algorithm

Math 55: Discrete Mathematics

Fast Sequential Summation Algorithms Using Augmented Data Structures

Introduction to Parallel Programming and MapReduce

EE602 Algorithms GEOMETRIC INTERSECTION CHAPTER 27

Algorithm Design and Recursion

Chapter 2: Algorithm Discovery and Design. Invitation to Computer Science, C++ Version, Third Edition

Computers. An Introduction to Programming with Python. Programming Languages. Programs and Programming. CCHSG Visit June Dr.-Ing.

Introduction to Computer Science I Spring 2014 Mid-term exam Solutions

CD-HIT User s Guide. Last updated: April 5,

Random Fibonacci-type Sequences in Online Gambling

Face detection is a process of localizing and extracting the face region from the

for ECM Titanium) This guide contains a complete explanation of the Driver Maker plug-in, an add-on developed for

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

Statistical Machine Translation: IBM Models 1 and 2

Part 2: Community Detection

Collecting Polish German Parallel Corpora in the Internet

Introduction to: Computers & Programming: Review for Midterm 2

DesCartes (Combined) Subject: Mathematics Goal: Data Analysis, Statistics, and Probability

Arrangements And Duality

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

CS177 MIDTERM 2 PRACTICE EXAM SOLUTION. Name: Student ID:

Linear Programming I

IBM SPSS Direct Marketing 23

Unit 5 Length. Year 4. Five daily lessons. Autumn term Unit Objectives. Link Objectives

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Why is Internal Audit so Hard?

IBM SPSS Direct Marketing 22

Hidden Markov Models

FTP client Selection and Programming

DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS

Computational Mathematics with Python

Graphing calculators Transparencies (optional)

Lecture 2, Introduction to Python. Python Programming Language

6.02 Practice Problems: Routing

Computational Mathematics with Python

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

The Advantages and Disadvantages of Network Computing Nodes

Recursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1

Persistent Data Structures

Offline sorting buffers on Line

Using Edit-Distance Functions to Identify Similar Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC

Lecture 2 Mathcad Basics

9.4. The Scalar Product. Introduction. Prerequisites. Learning Style. Learning Outcomes

Introduction to Microsoft Excel 2010

Curriculum Map. Discipline: Computer Science Course: C++

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Definition: A vector is a directed line segment that has and. Each vector has an initial point and a terminal point.

Solutions to Problem Set 1

A parallel algorithm for the extraction of structured motifs

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Computational Mathematics with Python

Team Builder Project

1 Introduction. Dr. T. Srinivas Department of Mathematics Kakatiya University Warangal , AP, INDIA

Project Scheduling. Introduction

Seminar. Path planning using Voronoi diagrams and B-Splines. Stefano Martina

10. THERM DRAWING TIPS

Binary Image Scanning Algorithm for Cane Segmentation

Transcription:

String Edit Distance (and intro to dynamic programming) Lecture # omputational Linguistics MPSI 59N, Spring 6 University of Massachusetts mherst ndrew Mcallum

Dynamic Programming (Not much to do with programming in the S sense.) Dynamic programming is efficient in finding optimal solutions for cases with lots of overlapping sub-problems. It solves problems by recombining solutions to sub-problems, when the sub-problems themselves may share sub-sub-problems.

Fibonacci Numbers 5 8...

alculating Fibonacci Numbers F(n) = F(n-) + F(n-), where F()=, F()=. Non-Dynamic Programming implementation def fib(n): if n == or n == : return n else: return fib(n-) + fib(n-) For fib(8), how many calls to function fib(n)?

DP Example: alculating Fibonacci Numbers Dynamic Programming: avoid repeated calls by remembering function values already calculated. table = {} def fib(n): global table if table.has_key(n): return table[n] if n == or n == : table[n] = n return n else: value = fib(n-) + fib(n-) table[n] = value return value

DP Example: alculating Fibonacci Numbers...or alternately, in a list instead of a dictionary... def fib(n): table = [] * (n+) table[] = table[] = for i in range(,n+): table[i] = table[i-] + table[i-] return table[n] We will see this pattern many more times in this course:. reate a table (of the right dimensions to describe our problem.. Fill the table, re-using solutions to previous sub-problems.

String Edit Distance Given two strings (sequences) return the distance between the two strings as measured by......the minimum number of character edit operations needed to turn one sequence into the other. ndrew mdrewz. substitute m to n. delete the z Distance =

String distance metrics: Levenshtein Given strings s and t Distance is shortest sequence of edit commands that transform s to t, (or equivalently s to t). Simple set of operations: opy character from s over to t (cost ) Delete a character in s (cost ) Insert a character in t (cost ) Substitute one character for another (cost ) This is Levenshtein distance

Levenshtein distance - example distance( William ohen, Willliam ohon ) S O E I N H O _ M I L L L I W N H O _ M I L L I W s t edit op cost so far... alignment gap lignment is a little bit like a parse.

Finding the Minimum What is the minimum number of operations for...? nother fine day in the park nybody can see him pick the ball Not so easy... not so clear. Not only are the strings, longer, but is isn t immediately obvious where the alignments should happen. What if we consider all possible alignments by brute force? How many alignments are there?

Dynamic Program Table for String Edit Measure distance between strings PRK SPKE P R K S P K E c ij c ij = the number of edit operations needed to align P with SP.

Dynamic Programming to the Rescue! How to take our big problem and chop it into building-block pieces. Given some partial solution, it isn t hard to figure out what a good next immediate step is. Partial solution = This is the cost for aligning s up to position i with t up to position j. Next step = In order to align up to positions x in s and y in t, should the last operation be a substitute, insert, or delete?

Dynamic Program Table for String Edit Measure distance between strings PRK SPKE Edit operations for turning SPKE into PRK S P delete P R K insert K E substitute

Dynamic Program Table for String Edit Measure distance between strings PRK SPKE P R K c c c c c 5 S c c c c c P c c c c c c c??? K E

Dynamic Program Table for String Edit P R K c c c c c 5 S c c c c c P c c subst c delete c c c insert c??? K E D(i,j) = score of best alignment from s..si to t..tj = min D(i-,j-), if si=tj //copy D(i-,j-)+, if si!=tj //substitute D(i-,j)+ //insert D(i,j-)+ //delete

omputing Levenshtein distance - D(i,j) = score of best alignment from s..si to t..tj = min D(i-,j-) + d(si,tj) //subst/copy D(i-,j)+ //insert D(i,j-)+ //delete (simplify by letting d(c,d)= if c=d, else) also let D(i,)=i (for i inserts) and D(,j)=j

Dynamic Program Table Initialized P R K S P K E 5 D(i,j) = score of best alignment from s..si to t..tj D(i-,j-)+d(si,tj) //substitute D(i-,j)+ //insert = min D(i,j-)+ //delete

Dynamic Program Table... filling in P R K S P K E 5 D(i,j) = score of best alignment from s..si to t..tj D(i-,j-)+d(si,tj) //substitute D(i-,j)+ //insert = min D(i,j-)+ //delete

Dynamic Program Table... filling in P R K S P K E 5 D(i,j) = score of best alignment from s..si to t..tj D(i-,j-)+d(si,tj) //substitute D(i-,j)+ //insert = min D(i,j-)+ //delete

Dynamic Program Table... filling in P R K S P K E 5 D(i,j) = score of best alignment from s..si to t..tj D(i-,j-)+d(si,tj) //substitute D(i-,j)+ //insert = min D(i,j-)+ //delete

Dynamic Program Table... filling in S P K E 5 P D(i,j) = score of best alignment from s..si to t..tj = min R D(i-,j-)+d(si,tj) D(i-,j)+ D(i,j-)+ K //substitute //insert //delete Final cost of aligning all of both strings.

DP String Edit Distance def stredit (s,s): "alculate Levenstein edit distance for strings s and s." len = len(s) # vertically len = len(s) # horizontally # llocate the table table = [None]*(len+) for i in range(len+): table[i] = []*(len+) # Initialize the table for i in range(, len+): table[i][] = i for i in range(, len+): table[][i] = i # Do dynamic programming for i in range(,len+): for j in range(,len+): if s[j-] == s[i-]: d = else: d = table[i][j] = min(table[i-][j-] + d, table[i-][j]+, table[i][j-]+)

Remebering the lignment (trace) D(i,j) = min D(i-,j-) + d(si,tj) //subst/copy D(i-,j)+ //insert D(i,j-)+ //delete trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than ) M O H N 5 O H E N 5 5 5 5

Three Enhanced Variants Needleman-Munch Variable costs Smith-Waterman Find longest soft matching subsequence ffine Gap Distance Make repeated deletions (insertions) cheaper (Implement one for homework?)

Needleman-Wunch distance D(i,j) = min D(i-,j-) + d(si,tj) //subst/copy D(i-,j) + G //insert D(i,j-) + G //delete d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) G = gap cost William ohen Wukkuan igeb

Smith-Waterman distance Instead of looking at each sequence in its entirety, this compares segments of all possible lengths and chooses whichever maximize the similarity measure. For every cell the algorithm calculates all possible paths leading to it. These paths can be of any length and can contain insertions and deletions.

Smith-Waterman distance D(i,j) = min //start over D(i-,j-) + d(si,tj) //subst/copy D(i-,j) + G //insert D(i,j-) + G //delete O H E N G = d(c,c) = - d(c,d) = + M O - - - - - - - - - H - -6-5 - N - -5-5 -7

Example output from Python s ' a l l o n g e r 5 6 7 8 9 l * o *- - u *- - n *- - - g 5 - *-5 - - e 6 - - *-7-6 (My implementation of HW#, task choice #. -Mcallum)

ffine gap distances Smith-Waterman fails on some pairs that seem quite similar: William W. ohen William W. Don t call me Dubya ohen Intuitively, single a single long long insertions are is cheaper than a lot lot of of short insertions

ffine gap distances - Idea: urrent cost of a gap of n characters: ng Make this cost: + (n-)b, where is cost of opening a gap, and B is cost of continuing a gap.

ffine gap distances - D(i,j) = max D(i-,j-) D(i-,j-) + d(si,tj) d(si,tj) //subst/copy D(i-,j)- IS(I-,j-) + d(si,tj) //insert D(i,j-)- //delete IT(I-,j-) + d(si,tj) IS(i,j) = max D(i-,j) - IS(i-,j) - B Best score in which si is aligned with a gap IT(i,j) = max D(i,j-) - IT(i,j-) - B Best score in which tj is aligned with a gap

ffine gap distances as automata -d(si,tj) IS -B -d(si,tj) D - - -d(si,tj) IT -B

Generative version of affine gap automata (Bilenko&Mooney, TechReport ) HMM emits pairs: (c,d) in state M, pairs (c,-) in state D, and pairs (-,d) in state I. For each state there is a multinomial distribution on pairs. The HMM can trained with EM from a sample of pairs of matched strings (s,t) E-step is forward-backward; M-step uses some ad hoc smoothing

ffine gap edit-distance learning: experiments results (Bilenko & Mooney) Experimental method: parse records into fields; append a few key fields together; sort by similarity; pick a threshold T and call all pairs with distance(s,t) < T duplicates ; picking T to maximize F-measure.

ffine gap edit-distance learning: experiments results (Bilenko & Mooney)

ffine gap edit-distance learning: experiments results (Bilenko & Mooney) Precision/recall for MILING dataset duplicate detection

ffine gap distances experiments (from Mcallum, Nigam,Ungar KDD) Goal is to match data like this:

The assignment Homework # Start with my stredit.py code Make some modifications Write a little about your experiences Some possible modifications Implement Needleman-Wunch, Smith-Waterman, or ffine Gap Distance. reate a little spell-checker: if entered word isn t in the dictionary, return the dictionary word that is closest. hange implementation to operate on sequences of words rather than characters... get an online translation dictionary, and find alignments between English & French or English & Russian! Try to learn the parameters of the function from data. (Tough.)