Knuth-Morris-Pratt Algorithm



Similar documents
Reconfigurable FPGA Inter-Connect For Optimized High Speed DNA Sequencing

Contributing Efforts of Various String Matching Methodologies in Real World Applications

A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW)

Study and Analysis of Network based Intrusion Detection System

Fast string matching

A The Exact Online String Matching Problem: a Review of the Most Recent Results

Protecting Websites from Dissociative Identity SQL Injection Attacka Patch for Human Folly

Improved Approach for Exact Pattern Matching

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

>

Longest Common Extensions via Fingerprinting

Approximate Search Engine Optimization for Directory Service

A FAST STRING MATCHING ALGORITHM

Robust Quick String Matching Algorithm for Network Security

15.0. Percent Exceptions

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

Indexing Techniques for File Sharing in Scalable Peer-to-Peer Networks

Searching BWT compressed text with the Boyer-Moore algorithm and binary search

Longest Common Extensions via Fingerprinting

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

Network Security Algorithms

High Performance String Matching Algorithm for a Network Intrusion Prevention System (NIPS)

Distributed Apriori in Hadoop MapReduce Framework

Extended Finite-State Machine Inference with Parallel Ant Colony Based Algorithms

Python Loops and String Manipulation

Jump-Miss Binary Erosion Algorithm

String Search. Brute force Rabin-Karp Knuth-Morris-Pratt Right-Left scan

The wide window string matching algorithm

Accurate identification and maintenance. unique customer profiles are critical to the success of Oracle CRM implementations.

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

1 Introductory Comments. 2 Bayesian Probability

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

Hardware Pattern Matching for Network Traffic Analysis in Gigabit Environments

Improved Single and Multiple Approximate String Matching

arxiv: v2 [cs.ds] 15 Oct 2008

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Solving Linear Equations in One Variable. Worked Examples

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

Automata and Computability. Solutions to Exercises

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

String Matching on a multicore GPU using CUDA

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Network Intrusion Prevention Systems: Signature-Based and Anomaly Detection

Software Development Life Cycle

Introduction...3. The Integrations Section...4. Create a New Integration...5. Create a New Trigger...6. Custom fields Custom Variables...

Dynamic Programming. Lecture Overview Introduction

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Algorithmic Software Verification

Algorithm & Flowchart & Pseudo code. Staff Incharge: S.Sasirekha

An efficient matching algorithm for encoded DNA sequences and binary strings

Umbrella: A New Component-Based Software Development Model

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Compression techniques

ESCI 386 IDL Programming for Advanced Earth Science Applications Lesson 6 Program Control

A Practical Attack on Broadcast RC4

Evaluation of Bayesian Spam Filter and SVM Spam Filter

SeChat: An AES Encrypted Chat

WPF font selection model

Downloads all known user accounts for the associated account along with any previously saved MCP Username alias data.

Information, Entropy, and Coding

How to Connect WinCC V6 to TOP Server OPC Servers

Objectives. The software process. Basic software process Models. Waterfall model. Software Processes

Intersection of a Line and a Convex. Hull of Points Cloud

Spam Filtering using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers

NP-Completeness and Cook s Theorem

Expansion of Sliding Window Method for Finding Shorter Addition/Subtraction-Chains

New Techniques for Regular Expression Searching

A Practical Method to Diagnose Memory Leaks in Java Application Alan Yu

Efficient String Matching and Easy Bottom-Up Parsing

gprof: a Call Graph Execution Profiler 1

CS154. Turing Machines. Turing Machine. Turing Machines versus DFAs FINITE STATE CONTROL AI N P U T INFINITE TAPE. read write move.

UCHIME in practice Single-region sequencing Reference database mode

Distance Degree Sequences for Network Analysis

Efficient Multi-Feature Index Structures for Music Data Retrieval

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Module 2 Stacks and Queues: Abstract Data Types

A Fast String Searching Algorithm

The Goldberg Rao Algorithm for the Maximum Flow Problem

Random-Number Generation

recursion, O(n), linked lists 6/14

Binary Coded Web Access Pattern Tree in Education Domain

Review of Hashing: Integer Keys

Testing for Duplicate Payments

FAST PATTERN MATCHING IN STRINGS*

Data Structures. Jaehyun Park. CS 97SI Stanford University. June 29, 2015

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Unit 1 Learning Objectives

SMALL INDEX LARGE INDEX (SILT)

Data Structures. Topic #12

Secure and Efficient Outsourcing of Sequence Comparisons

Admin Guide Product version: Product date: November, Technical Administration Guide. General

Chair for Network Architectures and Services Department of Informatics TU München Prof. Carle. Network Security. Chapter 13

Basic Parsing Algorithms Chart Parsing

Systems of Linear Equations

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

Xi2000 Series Configuration Guide

Integer Factorization using the Quadratic Sieve

PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP

Transcription:

December 18, 2011

outline Definition History Components of KMP Example Run-Time Analysis Advantages and Disadvantages References

Definition: Best known for linear time for exact matching. Compares from left to right. Shifts more than one position. Preprocessing approach of Pattern to avoid trivial comparisions. Avoids recomputing matches.

History: This algorithm was conceived by Donald Knuth and Vaughan Pratt and independently by James H.Morris in 1977.

History: Knuth, Morris and Pratt discovered first linear time string-matching algorithm by analysis of the naive algorithm. It keeps the information that naive approach wasted gathered during the scan of the text. By avoiding this waste of information, it achieves a running time of O(m + n). The implementation of algorithm is efficient because it minimizes the total number of comparisons of the pattern against the input string.

Components of KMP: The prefix-function : It preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. It is defined as the size of the largest prefix of P[0..j 1] that is also a suffix of P[1..j]. It also indicates how much of the last comparison can be reused if it fails. It enables avoiding backtracking on the string S.

m length[p] a[1] 0 k 0 for q 2 to m do while k > 0 and p[k + 1] p[q] do k a[k] end while if p[k + 1] = p[q] then k k + 1 end if a[q] k end for return Here a =

Computation of Prefix-function with example: Let us consider an example of how to compute for the pattern p. I n i t i a l l y : m = l e n g t h [ p ]= 7 [ 1 ] = 0 k=0 where m, [1], and k are the length of the pattern, prefix function and initial potential value respectively.

Step 1: q = 2, k = 0 [ 2 ] = 0 q 1 2 3 4 5 6 7 p a b a b a c a 0 0 Step 2: q = 3, k = 0 [ 3 ] = 1 q 1 2 3 4 5 6 7 p a b a b a c a 0 0 1

Step 3: q = 4, k = 1 [ 4 ] = 2 q 1 2 3 4 5 6 7 p a b a b a c a 0 0 1 2 Step 4: q = 5, k = 2 [ 5 ] = 3 q 1 2 3 4 5 6 7 p a b a b a c a 0 0 1 2 3

Step 5: q = 6, k = 3 [ 6 ] = 1 q 1 2 3 4 5 6 7 p a b a b a c a 0 0 1 2 3 1 Step 6: q = 7, k = 1 [ 7 ] = 1 q 1 2 3 4 5 6 7 p a b a b a c a 0 0 1 2 3 1 1

A f t e r i t e r a t i n g 6 times, the p r e f i x f u n c t i o n computations i s complete : q 1 2 3 4 5 6 7 p a b A b a c a 0 0 1 2 3 1 1 The running time of the prefix function is O(m).

Step 1: I n i t i a l i z e the i n p u t v a r i a b l e s : n = Length of the Text. m = Length of the Pattern. = Prefix function of pattern ( p ). q = Number of characters matched. Step 2: Define the variable : q=0, the beginning of the match. Step 3: Compare the f i r s t character of the pattern with f i r s t character of t e x t. I f match is not found, substitute the value of [ q ] to q. I f match is found, then increment the value of q by 1. Step 4: Check whether a l l the pattern elements are matched with the text elements. I f not, repeat the search process. I f yes, p r i n t the number of s h i f t s taken by the p a t t e r n. Step 5: look for the next match.

n length[s] m length[p] a Compute Prefix function q 0 for i 1 to n do while q > 0 and p[q + 1] S[i] do q a[q] if p[q + 1] = S[i] then q q + 1 end if if q == m then q a[q] end if end while end for Here a =

Example of KMP algorithm: Now let us consider an example so that the algorithm can be clearly understood. Let us execute the KMP algorithm to find whether p occurs in S.

I n i t i a l l y : n = size of S = 15; m = size of p=7 Step 1: i = 1, q = 0 comparing p [ 1 ] w ith S [ 1 ] P[1] does not match with S[1]. p will be shifted one position to the right. Step 2: i = 2, q = 0 comparing p [ 1 ] w ith S [ 2 ]

Step 3: i = 3, q = 1 comparing p [ 2 ] w ith S [ 3 ] p [ 2 ] does not match with S [ 3 ] Backtracking on p, comparing p [ 1 ] and S [ 3 ] Step 4: i = 4, q = 0 comparing p [ 1 ] w ith S [ 4 ] p [ 1 ] does not match with S [ 4 ]

Step 5: i = 5, q = 0 comparing p [ 1 ] w ith S [ 5 ] Step 6: i = 6, q = 1 comparing p [ 2 ] w ith S [ 6 ] p [ 2 ] matches with S [ 6 ]

Step 7: i = 7, q = 2 comparing p [ 3 ] w ith S [ 7 ] p [ 3 ] matches with S [ 7 ] Step 8: i = 8, q = 3 comparing p [ 4 ] w ith S [ 8 ] p [ 4 ] matches with S [ 8 ]

Step 9: i = 9, q = 4 comparing p [ 5 ] w ith S [ 9 ] p [ 5 ] matches with S [ 9 ] Step 10: i = 10, q = 5 comparing p [ 6 ] w ith S[ 1 0 ] p [ 6 ] doesn t matches with S[ 1 0 ] Backtracking on p, comparing p [ 4 ] with S[ 10] because after mismatch q = [ 5 ] = 3

Step 11: i = 11, q = 4 comparing p [ 5 ] with S[ 11] Step 12: i = 12, q = 5 comparing p [ 6 ] with S[ 12] p [ 6 ] matches with S[ 12]

Step 13: i = 13, q = 6 comparing p [ 7 ] with S[ 13] p [ 7 ] matches with S[ 13] pattern p has been found to completely occur in string S. The total number of shifts that took place for the match to be found are: i m = 13-7 = 6 shifts.

Run-Time analysis: O(m) - It is to compute the prefix function values. O(n) - It is to compare the pattern to the text. Total of O(n + m) run time.

Advantages and Disadvantages: Advantages: The running time of the KMP algorithm is optimal (O(m + n)), which is very fast. The algorithm never needs to move backwards in the input text T. It makes the algorithm good for processing very large files.

Advantages and Disadvantages: Disadvantages: Doesn t work so well as the size of the alphabets increases. By which more chances of mismatch occurs.

Graham A.Stephen, String Searching s, year = 1994. Donald Knuth, James H. Morris, Jr, Vaughan Pratt, Fast pattern matching in strings, year = 1977. Thomas H.Cormen; Charles E.Leiserson., Introduction to algorithms second edition, The, year = 2001.