Fast string matching



Similar documents
A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW)

New Techniques for Regular Expression Searching

An efficient matching algorithm for encoded DNA sequences and binary strings

Improved Single and Multiple Approximate String Matching

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

A The Exact Online String Matching Problem: a Review of the Most Recent Results

String Search. Brute force Rabin-Karp Knuth-Morris-Pratt Right-Left scan

Automata and Formal Languages

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

Knuth-Morris-Pratt Algorithm

ECE 358: Computer Networks. Solutions to Homework #4. Chapter 4 - The Network Layer

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Improved Approach for Exact Pattern Matching

Python Loops and String Manipulation

Searching BWT compressed text with the Boyer-Moore algorithm and binary search

Robust Quick String Matching Algorithm for Network Security

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

CS103B Handout 17 Winter 2007 February 26, 2007 Languages and Regular Expressions

6.4 Special Factoring Rules

4.2 Sorting and Searching

Chapter 6: Episode discovery process

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Boolean Algebra Part 1

CPSC 121: Models of Computation Assignment #4, due Wednesday, July 22nd, 2009 at 14:00

A FAST STRING MATCHING ALGORITHM

Improved Single and Multiple Approximate String Matching

Information, Entropy, and Coding

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

CS 1133, LAB 2: FUNCTIONS AND TESTING

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL

The wide window string matching algorithm

Reconfigurable FPGA Inter-Connect For Optimized High Speed DNA Sequencing

Linear Codes. Chapter Basics

College of information technology Department of software

A GUIDE TO NETWORK ANALYSIS by MICHAEL C GLEN

2 Integrating Both Sides

Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur

The Student-Project Allocation Problem

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

REWARD System For Even Money Bet in Roulette By Izak Matatya

Arithmetic Coding: Introduction

1 Review of Newton Polynomials

Online EFFECTIVE AS OF JANUARY 2013

14.1 Rent-or-buy problem

Lecture 10: Regression Trees

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

RN-Codings: New Insights and Some Applications

Windows Server Performance Monitoring

Data Warehousing. Jens Teubner, TU Dortmund Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

Today s Agenda. Automata and Logic. Quiz 4 Temporal Logic. Introduction Buchi Automata Linear Time Logic Summary

Lempel-Ziv Factorization: LZ77 without Window

Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80)

Boolean Algebra (cont d) UNIT 3 BOOLEAN ALGEBRA (CONT D) Guidelines for Multiplying Out and Factoring. Objectives. Iris Hui-Ru Jiang Spring 2010

How to Create Subnets To create subnetworks, you take bits from the host portion of the IP address and reserve them to define the subnet address.

15 Limit sets. Lyapunov functions

The LCA Problem Revisited

GCE Computing. COMP3 Problem Solving, Programming, Operating Systems, Databases and Networking Report on the Examination.

C H A P T E R Regular Expressions regular expression

MACM 101 Discrete Mathematics I

Binary Coded Web Access Pattern Tree in Education Domain

NP-completeness and the real world. NP completeness. NP-completeness and the real world (2) NP-completeness and the real world

IMPROVED ALGORITHMS FOR STRING SEARCHING PROBLEMS

(Refer Slide Time: 02:17)

Lecture Notes on the Mathematics of Finance

Hardware Pattern Matching for Network Traffic Analysis in Gigabit Environments

Storage Optimization in Cloud Environment using Compression Algorithm

Lecture Notes on Actuarial Mathematics

How To Compare A Markov Algorithm To A Turing Machine

NP-Completeness and Cook s Theorem

THEORY of COMPUTATION

Before we discuss a Call Option in detail we give some Option Terminology:

Why data encryption is not data masking. Grid Tools Ltd

Sequential Data Structures

Lecture 9. Semantic Analysis Scoping and Symbol Table

CSE 123: Computer Networks Fall Quarter, 2014 MIDTERM EXAM

Two Algorithms for the Student-Project Allocation Problem

Lecture 2: Regular Languages [Fa 14]

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

FIRST AND FOLLOW Example

1 Approximating Set Cover

Transcription:

Fast string matching This exposition is based on earlier versions of this lecture and the following sources, which are all recommended reading: Shift-And/Shift-Or 1. Flexible Pattern Matching in Strings, Navarro, Raffinot, 2002, pages 15ff. 2. A nice overview of the plethora of string matching algorithms with implementations can be found under http://www-igm.univ-mlv.fr/~lecroq/ string. Suffix trees, suffix arrays 1. Dan Gusfield: Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, Cambridge, 1997, pages 94ff. ISBN 0-521-58519-8 1

Thoughts about string matching We will get to know one very practical string matching algorithms for a single pattern and learn the basics about suffix trees and suffix arrays, two central data structures in computational biology. Let s start with the classical string matching problem: The task at hand is to find all occurrences of a given pattern p = p 1,..., p m in a text T = t 1,..., t n usually with n m. The algorithmic ideas of exact string matching are useful to know, although in computational biology algorithms for approximate string matching, or indexed methods are of more use. However, in online scenarios it is often not possible to precompute an index for finding exact matches. String matching is known for being amenable to approaches that range from the extremely theoretical to the extremely practical. 2

Thoughts about string matching (2) One example is the famous Knuth-Morris-Pratt algorithm which in practice twice as slow as the brute force algorithm, and the well-known Boyer-Moore algorithm which, in its original version, has a worst case running time of O(mn) but is quite fast in practice. Some easy terminology: Given strings x, y, and z, we say that x is a prefix of xy, a suffix of yx, and a factor (:=substring) of yxz. x y y x y x z prefix suffix factor In general, string matching algorithms follow three basic approaches. In each a search window of the size of the pattern is slid from left to right along the text and the pattern is searched within the window. The algorithms differ in the way the window is shifted: 3

Thoughts about string matching (3) 1. Prefix searching. For each position of the window we search the longest prefix of the window that is also a prefix of the pattern. search window T p forward search 2. Suffix searching. The search is conducted backwards along the search window. On average this can avoid to read some characters of the text and leads to sublinear average case algorithms. search window T p suffix search 3. Factor searching. The search is done backwards in the search window, looking for the longest suffix of the window that is also a factor of the pattern. search window T p factor search 4

Prefix based approaches 5

Prefix based approaches Suppose we have read the text up to position i and that we know the length of the longest suffix p of the text read that corresponds to a prefix of p. If p = p we have found an occurrence, otherwise we have to shift the search window in a safe way. The main algorithmic problem is to efficiently compute this length when reading the next character. There are two classical ways: 1. compute the longest suffix of the text read that is also a prefix of p (used by the Knuth-Morris-Pratt algorithm). This insures that the implied shift is safe and allows the detection of a match. 2. maintain the set of all prefixes of p that are also suffixes of the text read and update the set when reading a character (e.g., the Shift-And algorithm). 6

Prefix based approaches (2) Knuth-Morris-Pratt: T 1 2 i l p 1 2 1 2 q Why safe? 1 2 Last q characters of T match with first q characters of p. Determine maximal prefix of p[1... q] that is also suffix of p[1... q] (let l be its length). Move the window q l positions to the right. Why is it safe? 7

Prefix based approaches (3) Since the KMP algorithm is inferior in practice we concentrate on the Shift-And and the Shift-Or algorithms, which maintain the set of all prefixes of p that match a suffix of the text read. The algorithms use bit-parallelism to update this set for each new text character. The set is represented by a bit mask D = d m,..., d 1. We start with the Shift-And algorithm, which is easier to explain. The Shift-Or algorithm then results as an implementation trick. 8

We keep the following invariant: Shift-And and Shift-Or There is a 1 at the j-th position of D if and only if p 1,..., p j is a suffix of t 1,..., t i. T i p 1 j 1 j 2 m D: m j 2 j 1 1 0 1 1 0 If the size of p is less than the word length, this array will fit into a computer register. When reading the next character t i+1, we have to compute the new set D. We use the following fact: Observation. A position j + 1 in the set D will be active if and only if 1. the position j was active in D, that is, p 1,..., p j was a suffix of t 1,..., t i, and 2. t i+1 matches p j+1. The algorithm builds a table B which stores a bit mask b m,..., b 1 for each character of Σ. The mask B[c] has the j-th bit set if p j = c. 9

Shift-And and Shift-Or (2) 1 Shift-And(p, T ); 2 // Preprocessing 3 for c Σ do B[c] = 0 m ; od 4 for j 1... m do B[p j ] = B[p j ] 0 m j 10 j 1 od 5 // Searching 6 D = 0 m ; 7 for pos 1... n do 8 D = ((D 1) 0 m 1 1) & B[t pos ]; 9 if D & 10 m 1 0 m 10 then report an occurrence at position pos m + 1; 11 fi 12 od 10

Shift-And and Shift-Or (3) Initially we set D = 0 m and for each new character t pos we update D using the formula D = ((D 1) 0 m 1 1) & B[t pos ]. This update maintains our invariant using the above observation. The shift operation marks all positions as potential prefix-suffix matches that were such matches in step i (notice that this includes the empty string ɛ). In addition, to stay a match, the character t i+1 has to match p at those positions. This is achieved by applying an &-operation with the corresponding bitmask B[t i+1 ]. 11

Shift-And and Shift-Or (4) So what is the Shift-Or algorithm about? It is just an implementation trick to avoid a bit operation, namely the 0 m 1 1 in line??. In the Shift-Or algorithm we complement all bit masks of B and use a complemented bit mask D. Now the operator will introduce a 0 to the right of D and the new suffix stemming from the empty string is already in D. Obviously we have to use a bit instead of an & and report a match whenever d m = 0. Lets look at an example of Shift-And and Shift-Or. 12

Example Find all occurrences of the pattern P=atat in the text T=atacgatatata. Shift-And Shift-Or B[a] = 0101 B[a] = 1010 B[t] = 1010 B[t] = 0101 B[*] = 0000 B[*] = 1111 D = 0000 D = 1111 1 Reading a 0001 1110 0101 1010 ---- ---- 0001 1110 13

2 Reading T 3 Reading A 0011 1100 0101 1010 1010 0101 0101 1010 ------------- ------------- 0010 1101 0101 1010 Example (2) 4 Reading C 5 Reading G 1011 0100 0001 1110 0000 1111 0000 1111 ------------- ------------- 0000 1111 0000 1111 14

6 Reading a 7 Reading t 0001 1110 0010 1100 0101 1010 1010 0101 ------------- ------------- 0001 1110 0010 1101 Example (3) 8 Reading a 9 Reading t 0101 1010 1011 0100 0101 1010 1010 0101 ------------- ------------- 0101 1010 1010 0101 Hence, in step 9, we found the first occurrence of P at position 9 4 + 1 = 6. Note: The bit order is reversed in the Java applet. The running time of the algorithms is O(n), assuming that the operations on D can be done in constant time. 15