String Searching. String Search. Spam Filtering. String Search



Similar documents
One Minute To Learn Programming: Finite Automata

Homework 3 Solutions

Bayesian Updating with Continuous Priors Class 13, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Regular Sets and Expressions

How fast can we sort? Sorting. Decision-tree model. Decision-tree for insertion sort Sort a 1, a 2, a 3. CS Spring 2009

flex Regular Expressions and Lexical Scanning Regular Expressions and flex Examples on Alphabet A = {a,b} (Standard) Regular Expressions on Alphabet A

Reasoning to Solve Equations and Inequalities

Appendix D: Completing the Square and the Quadratic Formula. In Appendix A, two special cases of expanding brackets were considered:

Data Compression. Lossless And Lossy Compression

1.00/1.001 Introduction to Computers and Engineering Problem Solving Fall Final Exam

Solving the String Statistics Problem in Time O(n log n)

Small Businesses Decisions to Offer Health Insurance to Employees

Small Business Networking

5 a LAN 6 a gateway 7 a modem

Basic Research in Computer Science BRICS RS Brodal et al.: Solving the String Statistics Problem in Time O(n log n)

Lec 2: Gates and Logic

Experiment 6: Friction

Integration by Substitution

! What can a computer do? ! What can a computer do with limited resources? ! Don't talk about specific machines or problems.

1. Introduction Texts and their processing

Small Business Networking

DlNBVRGH + Sickness Absence Monitoring Report. Executive of the Council. Purpose of report

How To Network A Smll Business

Bypassing Space Explosion in Regular Expression Matching for Network Intrusion Detection and Prevention Systems

Section 5-4 Trigonometric Functions

Small Business Networking

Welch Allyn CardioPerfect Workstation Installation Guide

Operations with Polynomials

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Friday 16 th May Time: 14:00 16:00

A Visual and Interactive Input abb Automata. Theory Course with JFLAP 4.0

Treatment Spring Late Summer Fall Mean = 1.33 Mean = 4.88 Mean = 3.

APPLICATION NOTE Revision 3.0 MTD/PS-0534 August 13, 2008 KODAK IMAGE SENDORS COLOR CORRECTION FOR IMAGE SENSORS

How To Set Up A Network For Your Business

Outline of the Lecture. Software Testing. Unit & Integration Testing. Components. Lecture Notes 3 (of 4)

Regular Languages and Finite Automata

AntiSpyware Enterprise Module 8.5

Small Business Networking

Polynomial Functions. Polynomial functions in one variable can be written in expanded form as ( )

Distributions. (corresponding to the cumulative distribution function for the discrete case).

Example 27.1 Draw a Venn diagram to show the relationship between counting numbers, whole numbers, integers, and rational numbers.

Solutions for Selected Exercises from Introduction to Compiler Design

FORMAL LANGUAGES, AUTOMATA AND THEORY OF COMPUTATION EXERCISES ON REGULAR LANGUAGES

Logical or physical organisation and data independence

CS99S Laboratory 2 Preparation Copyright W. J. Dally 2001 October 1, 2001

Concept Formation Using Graph Grammars

Example A rectangular box without lid is to be made from a square cardboard of sides 18 cm by cutting equal squares from each corner and then folding

Application Bundles & Data Plans

Protocol Analysis / Analysis of Software Artifacts Kevin Bierhoff

Java CUP. Java CUP Specifications. User Code Additions You may define Java code to be included within the generated parser:

19. The Fermat-Euler Prime Number Theorem

Graphs on Logarithmic and Semilogarithmic Paper

Helicopter Theme and Variations

Binary Representation of Numbers Autar Kaw

Solution to Problem Set 1

Morgan Stanley Ad Hoc Reporting Guide

Factoring Polynomials

Learning Outcomes. Computer Systems - Architecture Lecture 4 - Boolean Logic. What is Logic? Boolean Logic 10/28/2010

Economics Letters 65 (1999) macroeconomists. a b, Ruth A. Judson, Ann L. Owen. Received 11 December 1998; accepted 12 May 1999

EasyMP Network Projection Operation Guide

Data replication in mobile computing

EQUATIONS OF LINES AND PLANES

How To Make A Network More Efficient

New Internet Radio Feature

LINEAR TRANSFORMATIONS AND THEIR REPRESENTING MATRICES

Enterprise Risk Management Software Buyer s Guide

P.3 Polynomials and Factoring. P.3 an 1. Polynomial STUDY TIP. Example 1 Writing Polynomials in Standard Form. What you should learn

Gene Expression Programming: A New Adaptive Algorithm for Solving Problems

Lectures 8 and 9 1 Rectangular waveguides

Quick Reference Guide: One-time Account Update

0.1 Basic Set Theory and Interval Notation

2 DIODE CLIPPING and CLAMPING CIRCUITS

Modular Generic Verification of LTL Properties for Aspects

Or more simply put, when adding or subtracting quantities, their uncertainties add.

Your duty, however, does not require disclosure of matter:

Advanced Baseline and Release Management. Ed Taekema

String Search. Brute force Rabin-Karp Knuth-Morris-Pratt Right-Left scan

A Network Management System for Power-Line Communications and its Verification by Simulation

Blackbaud The Raiser s Edge

Measuring Similarity between Graphs Based on the Levenshtein Distance

EFFICIENT VARIANTS OF THE BACKWARD-ORACLE-MATCHING ALGORITHM



Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

EE247 Lecture 4. For simplicity, will start with all pole ladder type filters. Convert to integrator based form- example shown

4.11 Inner Product Spaces

Health insurance exchanges What to expect in 2014

Section 5.2, Commands for Configuring ISDN Protocols. Section 5.3, Configuring ISDN Signaling. Section 5.4, Configuring ISDN LAPD and Call Control

Virtual Machine. Part II: Program Control. Building a Modern Computer From First Principles.

Unit 6: Exponents and Radicals

Use Geometry Expressions to create a more complex locus of points. Find evidence for equivalence using Geometry Expressions.

JaERM Software-as-a-Solution Package

FDIC Study of Bank Overdraft Programs

6.2 Volumes of Revolution: The Disk Method

Engineer-to-Engineer Note

9 CONTINUOUS DISTRIBUTIONS

E-Commerce Comparison

Scalable Mining of Large Disk-based Graph Databases

Unit 29: Inference for Two-Way Tables

T H E S E C U R E T R A N S M I S S I O N P R O T O C O L O F S E N S O R A D H O C N E T W O R K

Transcription:

String Serch String Serching String serch: given pttern string p, find first mtch in text t. Model : cn't fford to preprocess the text. Krp-Rin Knuth-Morris-Prtt Boyer-Moore N = # chrcters in text M = # chrcters in pttern n n e e n l Serch Text typiclly N >> M Ex: N = 1 million, M = 1 e d e n e e n e e d l e n l d Serch Pttern n e e d l e M = 1, N = 6 Reference: Chpter 19, Algorithms in C, nd Edition, Roert Sedgewick. n n e e n l Successful Serch e d e n e e n e e d l e n l d Princeton University COS 6 Algorithms nd Dt Structures Spring 4 Kevin Wyne http://www.princeton.edu/~cos6 String Serch Spm Filtering String: Sequence of chrcters over some lphet. Ex lphets: inry, deciml, ASCII, UNICODE, DNA. Some pplictions. Prsers. Lexis/Nexis. Spm filters. Virus scnning. Digitl lirries. Screen scrpers. Word processors. We serch engines. Symol mnipultion. Biliogrphic retrievl. Nturl lnguge processing. Crnivore surveillnce system. Computtionl moleculr iology. Feture detection in digitized imges. 3 Spm filtering: ptterns indictive of spm. AMAZING GUARANTEE PROFITS herl Vigr This is one-time miling. There is no ctch. This messge is sent in complince with spm regultions. You're getting this messge ecuse you registered with one of our mrketing prtners. 4

Brute Force Anlysis of Brute Force Brute force: Check for pttern strting t every text position. pulic sttic int serch(string pttern, String text) { int M = pttern.length(); int N = text.length(); Anlysis of rute force. Running time depends on pttern nd text. Worst cse: M N comprisons. "Averge" cse: 1.1 N comprisons. (!) Slow if M nd N re lrge, nd hve lots of repetition. for (int i = ; i < N - M; i++) { int j; for (j = ; j < M; j++) { if (text.chrat(i+j)!= pttern.chrat(j)) rek; if (j == M) return i; return -1; return -1 if not found return offset i if found Serch Pttern Serch Text 5 6 Screen Scrping Find current stock price of Sun Microsystems. t.indexof(p): index of 1 st occurrence of pttern p in text t. Downlod html from http://finnce.yhoo.com/q?s=sunw Find 1 st string delimited y <> nd </> ppering fter Lst Trde pulic clss StockQuote { pulic sttic void min(string[] rgs) { String nme = "http://finnce.yhoo.com/q?s=" + rgs[]; In in = new In(nme); String input = in.redall(); int p = input.indexof("lst Trde:", ); int from = input.indexof("<>", p); int to = input.indexof("</>", from); String price = input.sustring(from + 3, to); System.out.println(price); % jv StockQuote sunw 5. % jv StockQuote im 96.84 Algorithmic Chllenges Theoreticl chllenge: liner-time gurntee. TST index costs ~ N lgn. Prcticl chllenge: void BACKUP. Often no room or time to sve text. Fundmentl lgorithmic prolem. Now is the time for ll people to come to the id of their prty.now is the time for ll good people to come to the id of theirprty.now is the time for mnygood people to come to the id of their prty.now is the time for ll good people to come to the id of their prty.now is the time for lot of good people to come to the id of their prty.now is the time for ll of the good people to come to the id of their prty.now is the time for ll good people to come to the id of their prty. Now is the time for ech good person to come to the id of their prty.now is the time for ll good people to come to the id of their prty. Now is the time for ll good Repulicns to come to the id of their prty.now is the time for ll good people to come to the id of their prty. Now is the time for mny or ll good people to come to the id of their prty. Now is the time for ll good people to come to the id of their prty.now is the time for ll good Democrts to come to the id of their prty. Now is the time for ll people to come to the id of their prty.now is the time for ll good people to come to the id of theirprty. Now is the time for mnygood people to come to the id of their prty.now is the time for ll good people to come to the id of their prty.now is the time for lot of good people to come to the id of their prty.now is the time for ll of the good people to come to the id of their prty.now is the time for ll good people to come to the id of their ttck t dwn prty. Now is the time for ech person to come to the id of their prty.now is the time for ll good people to come to the id of their prty. Now is the time for ll good Repulicns to come to the id of their prty.now is the time for ll good people to come to the id of their prty. Now is the time for mny or ll good people to come to the id of their prty. Now is the time for ll good people to come to the id of their prty.now is the time for ll good Democrts to come to the id of their prty. 7 8

Krp-Rin Fingerprint Algorithm Krp-Rin Fingerprint Algorithm Ide: use hshing. Compute hsh function for ech text position. No explicit hsh tle: just compre with pttern hsh! Exmple. Hsh "tle" size = 97. Serch Pttern 5 9 6 5 5965 % 97 = 95 Serch Text 3 1 4 1 5 9 6 5 3 5 8 9 7 9 3 3 8 4 6 3 1 4 1 5 1 4 1 5 9 31415 % 97 = 84 14159 % 97 = 94 4 1 5 9 4159 % 97 = 76 1 5 9 6 1596 % 97 = 18 5 9 6 5 5965 % 97 = 95 Ide: use hshing. Compute hsh function for ech text position. Gurnteeing correctness. Need full compre on hsh mtch to gurd ginst collisions. 5965 % 97 = 95 5936 % 97 = 95 Running time. Hsh function depends on M chrcters. Running time is (MN) for serch miss. how cn we fix this? 9 1 Krp-Rin Fingerprint Algorithm Krp-Rin Fingerprint Algorithm : Jv Implementtion Key ide: fst to compute hsh function of djcent sustrings. Use previous hsh to compute next hsh. O(1) time per hsh, except first one. Exmple. Pre-compute: 1 % 97 = 9 Previous hsh: 4159 % 97 = 76 Next hsh: 1596 % 97 =?? Oservtion. 1596 % 97 (4159 (4 * 1)) * 1 + 6 (76 (4 * 9 )) * 1 + 6 46 18 pulic sttic int serch(string p, String t) { int M = p.length(); int N = t.length(); int dm = 1, h1 =, h = ; int q = 3355439; // tle size int d = 56; // rdix for (int j = 1; j < M; j++) // precompute d^m % q dm = (d * dm) % q; for (int j = ; j < M; j++) { h1 = (h1*d + p.chrat(j)) % q; h = (h*d + t.chrat(j)) % q; if (h1 == h) return i - M; // hsh of pttern // hsh of text // mtch found for (int i = M; j < N; i++) { h = (h - t.chrat(i-m)) % q; // remove high order digit h = (h*d + t.chrat(i)) % q; // insert low order digit if (h1 == h) return i - M; // mtch found return -1; // not found 11 1

String Serch Implementtion Cost Summry Rndomized Algorithms Krp-Rin fingerprint lgorithm. Choose tle size t rndom to e huge prime. Expected running time is O(M + N). (MN) worst-cse, ut this is (unelievly) unlikely. Min dvntge. Extends to d ptterns nd other generliztions. A rndomized lgorithm uses rndom numers to gin efficiency. Ls Vegs lgorithms. Expected to e fst. Gurnteed to e correct. Ex: quicksort, rndomized BST, Rin-Krp with mtch check. Serch for n M-chrcter pttern in n N-chrcter text. Implementtion Krp-Rin Typicl (N) Worst Brute 1.1 N M N (N) ssumes pproprite model rndomized Monte Crlo lgorithms. Gurnteed to e fst. Expected to e correct. Ex: Rin-Krp without mtch check. Would either version of Rin-Krp mke good lirry function? chrcter comprisons 13 14 How To Sve Comprisons Knuth-Morris-Prtt (over inry lphet) How to void re-computtion? Pre-nlyze serch pttern. Ex: suppose tht first 5 chrcters of pttern re ll 's. if t[..4] mtches p[..4] then t[1..4] mtches p[..3] no need to check i = 1, j =, 1,, 3 sves 4 comprisons KMP lgorithm. Use knowledge of how serch pttern repets itself. Build DFA from pttern. Run DFA on text. Bsic strtegy: pre-compute something sed on pttern. Serch Pttern Serch Text Serch Pttern Serch Text 1 ccept stte 15 16

DFA Representtion KMP Algorithm DFA used in KMP hs specil property. Upon chrcter mtch, go forwrd one stte. Only need to keep trck of where to go upon chrcter mismtch: go to stte next[j] if chrcter mismtches in stte j Two key differences from rute force. Text pointer i never cks up. Need to precompute next[] tle. Serch Pttern 1 5 1 4 3 3 next 3 only store this row for (int i =, j = ; i < N; i++) { if (t.chrat(i) == p.chrat(j)) j++; // mtch else j = next[j]; // mismtch if (j == M) return i - M + 1; // found return -1; // not found Simultion of KMP DFA (ssumes inry lphet) 1 ccept stte 7 8 Knuth-Morris-Prtt DFA Construction for KMP KMP lgorithm. (over inry lphet, for simplicity) Use knowledge of how serch pttern repets itself. Build DFA from pttern. Run DFA on text. Rule for creting next[] tle for pttern. next[4]: longest prefix of tht is proper suffix of. next[5]: longest prefix of tht is proper suffix of. DFA construction for KMP. DFA uilds itself! Ex: compute next[6] for pttern p[..6] =. Assume you know DFA for pttern p[..5]=. Assume you know stte X for p[1..5] =. X = Updte next[6] to stte for. X + = Updte X to stte for p[1..6] = X + = 3 compute y simulting on DFA 1 1 9 3

DFA Construction for KMP DFA Construction for KMP DFA construction for KMP. DFA uilds itself! DFA construction for KMP. DFA uilds itself! Ex: compute next[6] for pttern p[..6] =. Assume you know DFA for pttern p[..5]=. Assume you know stte X for p[1..5] =. X = Updte next[6] to stte for. X + = Updte X to stte for p[1..6] = X + = 3 Ex: compute next[7] for pttern p[..7] =. Assume you know DFA for pttern p[..6]=. Assume you know stte X for p[1..6] =. X = 3 Updte next[7] to stte for. X + = 4 Updte X to stte for p[1..7] = X + = 1 7 1 7 31 3 DFA Construction for KMP DFA Construction for KMP DFA construction for KMP. DFA uilds itself! DFA construction for KMP. DFA uilds itself! Ex: compute next[7] for pttern p[..7] =. Assume you know DFA for pttern p[..6]=. Assume you know stte X for p[1..6] =. X = 3 Updte next[7] to stte for. X + = 4 Updte X to stte for p[1..7] = X + = 1 7 8 Crucil insight. To compute trnsitions for stte n of DFA, suffices to hve: DFA for sttes to n-1 stte X tht DFA ends up in with input p[1..n-1] To compute stte X' tht DFA ends up in with input p[1..n], it suffices to hve: DFA for sttes to n-1 stte X tht DFA ends up in with input p[1..n-1] 33 34

DFA Construction for KMP DFA Construction for KMP: Implementtion 1 Serch Pttern 1 7 3 1 4 5 6 3 7 4 8 j 1 3 4 5 6 7 1 1 3 pttern[1..j] X next 3 4 7 8 Build DFA for KMP. Tkes O(M) time. Requires O(M) extr spce to store next[] tle. int X = ; int[] next = new int[m]; for (int j = 1; j < M; j++) { if (p.chrat(x) == p.chrat(j)) { // chr mtch next[j] = next[x]; X = X + 1; else { // chr mismtch next[j] = X + 1; X = next[x]; DFA Construction for KMP (ssumes inry lphet) 43 44 Optimized KMP Implementtion KMP Over Aritrry Alphet Ultimte serch progrm for pttern. Specilized C progrm. Mchine lnguge version of C progrm. int kmperch(chr t[]) { int i = ; s: if (t[i++]!= '') goto s; s1: if (t[i++]!= '') goto s; s: if (t[i++]!= '') goto s; s3: if (t[i++]!= '') goto s; s4: if (t[i++]!= '') goto s; s5: if (t[i++]!= '') goto s3; s6: if (t[i++]!= '') goto s; s7: if (t[i++]!= '') goto s4; return i - 8; next[] DFA for ptterns over ritrry lphets. Red new chrcter only upon success (or filure t eginning). Reuse current chrcter upon filure nd follow ck. Fct: KMP follows t most 1 + log M ck links in row. Theorem: t most N chrcter comprisons in totl. Ex: DFA for pttern c. N Y c ssumes pttern is in text (o/w use sentinel) 45 46

String Serch Implementtion Cost Summry History of KMP KMP nlysis. DFA simultion tkes (N) time in worst-cse. DFA construction tkes (M) time nd spce in worst-cse. Extends to ASCII or UNICODE lphets. Good efficiency for ptterns nd texts with much repetition. "On-line lgorithm." virus scnning, internet spying Serch for n M-chrcter pttern in n N-chrcter text. Implementtion Krp-Rin KMP Typicl Worst Brute 1.1 N M N (N) (N) 1.1 N N ssumes pproprite model rndomized History of KMP. Inspired y esoteric theorem of Cook tht sys liner time lgorithm should e possile for -wy pushdown utomt. Discovered in 1976 independently y two theoreticins nd hcker. Knuth: discovered liner time lgorithm Prtt: mde running time independent of lphet Morris: trying to uild n editor nd void nnoying uffer for string serch Resolved theoreticl nd prcticl prolems. Surprise when it ws discovered. In hindsight, seems like right lgorithm. chrcter comprisons 47 48 Boyer-Moore Boyer-Moore Boyer-Moore lgorithm (1974). Right-to-left scnning. find offset i in text y moving left to right. compre pttern to text y moving right to left. Boyer-Moore lgorithm (1974). Right-to-left scnning. Heuristic 1: dvnce offset i using "d chrcter rule." upon mismtch of text chrcter c, look up index[c] increse offset i so tht j th chrcter of pttern lines up with text chrcter c Index g 5 i n 1 s 4 t 3 * 5 Text Text s t r i n g s s e r c h c o n s i o f Pttern s t r i n g s s e r c h c o n s i o f Pttern Mismtch Mtch No comprison 49 Mismtch Mtch No comprison 5

Boyer-Moore Boyer-Moore Boyer-Moore lgorithm (1974). Right-to-left scnning. Heuristic 1: dvnce offset i using "d chrcter rule." upon mismtch of text chrcter c, look up index[c] increse offset i so tht j th chrcter of pttern lines up with text chrcter c Index g 5 i n 1 s 4 t 3 * 5 Boyer-Moore lgorithm (1974). Right-to-left scnning. Heuristic 1: dvnce offset i using "d chrcter rule." Heuristic : use KMP-like suffix rule. effective with smll lphets different rules led to different worst-cse ehvior privte sttic void dchrskip(string pttern, int[] skip) { int M = pttern.length(); for (int j = ; j < 56; j++) skip[j] = M; for (int j = ; j < M-1; j++) skip[pttern.chrat(j)] = M j 1; construction of d chrcter skip tle Text x x x x x x x?????? x x x x x x x x x c d d x d d now ligned d chrcter rule 51 5 Boyer-Moore String Serch Implementtion Cost Summry Boyer-Moore lgorithm (1974). Right-to-left scnning. Heuristic 1: dvnce offset i using "d chrcter rule." Heuristic : use KMP-like suffix rule. effective with smll lphets different rules led to different worst-cse ehvior Boyer-Moore nlysis. O(N / M) verge cse if given letter usully doesn't occur in string. time decreses s pttern length increses suliner in input size! O(N) worst-cse with Glil vrint. Serch for n M-chrcter pttern in n N-chrcter text. Text x x x x x x x?????? x x x x x x x x x c d d x d d cn skip over this since we know d doesn't mtch strong good suffix Implementtion Brute Typicl 1.1 N Worst M N Krp-Rin (N) (N) KMP 1.1 N N Boyer-Moore N / M 4N chrcter comprisons ssumes pproprite model rndomized 53 54

Boyer-Moore nd Alphet Size Tip of the Iceerg Boyer-Moore spce requirement. (M + A) Big lphets. Direct implementtion my e imprcticl, e.g., UNICODE. My explin why Jv's indexof doesn't use it. Solution 1: serch one yte t time. Solution : hsh UNICODE chrcters to smller rnge. Smll lphets. Loses effectiveness when A is too smll, e.g., DNA. Solution: group chrcters together (, c,... ). Multiple string serch. Serch for ny of k different strings. Nïve: O(M + kn). Aho-Corsick: O(M + N). Screen out dirty words from text strem. Wildcrds / chrcter clsses. Ex: PROSITE ptterns for computtionl iology. O(M + N) time using O(M + A) extr spce. Multiple mtches Approximte string mtching: llow up to k mismtches. Recovering from typing or spelling errors in informtion retrievl. Fixing trnsmission errors in signl processing. Edit-distnce: llow up to k edits. Recover from mesurement errors in computtionl iology. 55 56 Jv String Lirry String Serch Summry Jv String lirry hs uilt-in string serching. t.indexof(p): index of 1 st occurrence of pttern p in text t. Cvet: it's rute force, nd cn tke (MN) time. pulic sttic void min(string[] rgs) { int n = Integer.prseInt(rgs[]); String s = ""; for (int i = ; i < n; i++) s = s + s; String pttern = s + ""; String text = s + s;... n......... Ingenious lgorithms for fundmentl prolem. Rin Krp. Esy to implement, ut usully worse thn rute-force. Extends to more generl settings (e.g., D serch). Knuth-Morris-Prtt. Quintessentil solution to theoreticl prolem. Independent of lphet size. Extends to multiple string serch, wild-crds, regulr expressions. n+1 System.out.println(text.indexOf(pttern)); Boyer-Moore. Simple ide leds to drmtic speedup for long ptterns. Need to twek for smll or lrge lphets. Why do you think lirry uses rute force? 57 58