String Serch String Serching String serch: given pttern string p, find first mtch in text t. Model : cn't fford to preprocess the text. Krp-Rin Knuth-Morris-Prtt Boyer-Moore N = # chrcters in text M = # chrcters in pttern n n e e n l Serch Text typiclly N >> M Ex: N = 1 million, M = 1 e d e n e e n e e d l e n l d Serch Pttern n e e d l e M = 1, N = 6 Reference: Chpter 19, Algorithms in C, nd Edition, Roert Sedgewick. n n e e n l Successful Serch e d e n e e n e e d l e n l d Princeton University COS 6 Algorithms nd Dt Structures Spring 4 Kevin Wyne http://www.princeton.edu/~cos6 String Serch Spm Filtering String: Sequence of chrcters over some lphet. Ex lphets: inry, deciml, ASCII, UNICODE, DNA. Some pplictions. Prsers. Lexis/Nexis. Spm filters. Virus scnning. Digitl lirries. Screen scrpers. Word processors. We serch engines. Symol mnipultion. Biliogrphic retrievl. Nturl lnguge processing. Crnivore surveillnce system. Computtionl moleculr iology. Feture detection in digitized imges. 3 Spm filtering: ptterns indictive of spm. AMAZING GUARANTEE PROFITS herl Vigr This is one-time miling. There is no ctch. This messge is sent in complince with spm regultions. You're getting this messge ecuse you registered with one of our mrketing prtners. 4
Brute Force Anlysis of Brute Force Brute force: Check for pttern strting t every text position. pulic sttic int serch(string pttern, String text) { int M = pttern.length(); int N = text.length(); Anlysis of rute force. Running time depends on pttern nd text. Worst cse: M N comprisons. "Averge" cse: 1.1 N comprisons. (!) Slow if M nd N re lrge, nd hve lots of repetition. for (int i = ; i < N - M; i++) { int j; for (j = ; j < M; j++) { if (text.chrat(i+j)!= pttern.chrat(j)) rek; if (j == M) return i; return -1; return -1 if not found return offset i if found Serch Pttern Serch Text 5 6 Screen Scrping Find current stock price of Sun Microsystems. t.indexof(p): index of 1 st occurrence of pttern p in text t. Downlod html from http://finnce.yhoo.com/q?s=sunw Find 1 st string delimited y <> nd </> ppering fter Lst Trde pulic clss StockQuote { pulic sttic void min(string[] rgs) { String nme = "http://finnce.yhoo.com/q?s=" + rgs[]; In in = new In(nme); String input = in.redall(); int p = input.indexof("lst Trde:", ); int from = input.indexof("<>", p); int to = input.indexof("</>", from); String price = input.sustring(from + 3, to); System.out.println(price); % jv StockQuote sunw 5. % jv StockQuote im 96.84 Algorithmic Chllenges Theoreticl chllenge: liner-time gurntee. TST index costs ~ N lgn. Prcticl chllenge: void BACKUP. Often no room or time to sve text. Fundmentl lgorithmic prolem. Now is the time for ll people to come to the id of their prty.now is the time for ll good people to come to the id of theirprty.now is the time for mnygood people to come to the id of their prty.now is the time for ll good people to come to the id of their prty.now is the time for lot of good people to come to the id of their prty.now is the time for ll of the good people to come to the id of their prty.now is the time for ll good people to come to the id of their prty. Now is the time for ech good person to come to the id of their prty.now is the time for ll good people to come to the id of their prty. Now is the time for ll good Repulicns to come to the id of their prty.now is the time for ll good people to come to the id of their prty. Now is the time for mny or ll good people to come to the id of their prty. Now is the time for ll good people to come to the id of their prty.now is the time for ll good Democrts to come to the id of their prty. Now is the time for ll people to come to the id of their prty.now is the time for ll good people to come to the id of theirprty. Now is the time for mnygood people to come to the id of their prty.now is the time for ll good people to come to the id of their prty.now is the time for lot of good people to come to the id of their prty.now is the time for ll of the good people to come to the id of their prty.now is the time for ll good people to come to the id of their ttck t dwn prty. Now is the time for ech person to come to the id of their prty.now is the time for ll good people to come to the id of their prty. Now is the time for ll good Repulicns to come to the id of their prty.now is the time for ll good people to come to the id of their prty. Now is the time for mny or ll good people to come to the id of their prty. Now is the time for ll good people to come to the id of their prty.now is the time for ll good Democrts to come to the id of their prty. 7 8
Krp-Rin Fingerprint Algorithm Krp-Rin Fingerprint Algorithm Ide: use hshing. Compute hsh function for ech text position. No explicit hsh tle: just compre with pttern hsh! Exmple. Hsh "tle" size = 97. Serch Pttern 5 9 6 5 5965 % 97 = 95 Serch Text 3 1 4 1 5 9 6 5 3 5 8 9 7 9 3 3 8 4 6 3 1 4 1 5 1 4 1 5 9 31415 % 97 = 84 14159 % 97 = 94 4 1 5 9 4159 % 97 = 76 1 5 9 6 1596 % 97 = 18 5 9 6 5 5965 % 97 = 95 Ide: use hshing. Compute hsh function for ech text position. Gurnteeing correctness. Need full compre on hsh mtch to gurd ginst collisions. 5965 % 97 = 95 5936 % 97 = 95 Running time. Hsh function depends on M chrcters. Running time is (MN) for serch miss. how cn we fix this? 9 1 Krp-Rin Fingerprint Algorithm Krp-Rin Fingerprint Algorithm : Jv Implementtion Key ide: fst to compute hsh function of djcent sustrings. Use previous hsh to compute next hsh. O(1) time per hsh, except first one. Exmple. Pre-compute: 1 % 97 = 9 Previous hsh: 4159 % 97 = 76 Next hsh: 1596 % 97 =?? Oservtion. 1596 % 97 (4159 (4 * 1)) * 1 + 6 (76 (4 * 9 )) * 1 + 6 46 18 pulic sttic int serch(string p, String t) { int M = p.length(); int N = t.length(); int dm = 1, h1 =, h = ; int q = 3355439; // tle size int d = 56; // rdix for (int j = 1; j < M; j++) // precompute d^m % q dm = (d * dm) % q; for (int j = ; j < M; j++) { h1 = (h1*d + p.chrat(j)) % q; h = (h*d + t.chrat(j)) % q; if (h1 == h) return i - M; // hsh of pttern // hsh of text // mtch found for (int i = M; j < N; i++) { h = (h - t.chrat(i-m)) % q; // remove high order digit h = (h*d + t.chrat(i)) % q; // insert low order digit if (h1 == h) return i - M; // mtch found return -1; // not found 11 1
String Serch Implementtion Cost Summry Rndomized Algorithms Krp-Rin fingerprint lgorithm. Choose tle size t rndom to e huge prime. Expected running time is O(M + N). (MN) worst-cse, ut this is (unelievly) unlikely. Min dvntge. Extends to d ptterns nd other generliztions. A rndomized lgorithm uses rndom numers to gin efficiency. Ls Vegs lgorithms. Expected to e fst. Gurnteed to e correct. Ex: quicksort, rndomized BST, Rin-Krp with mtch check. Serch for n M-chrcter pttern in n N-chrcter text. Implementtion Krp-Rin Typicl (N) Worst Brute 1.1 N M N (N) ssumes pproprite model rndomized Monte Crlo lgorithms. Gurnteed to e fst. Expected to e correct. Ex: Rin-Krp without mtch check. Would either version of Rin-Krp mke good lirry function? chrcter comprisons 13 14 How To Sve Comprisons Knuth-Morris-Prtt (over inry lphet) How to void re-computtion? Pre-nlyze serch pttern. Ex: suppose tht first 5 chrcters of pttern re ll 's. if t[..4] mtches p[..4] then t[1..4] mtches p[..3] no need to check i = 1, j =, 1,, 3 sves 4 comprisons KMP lgorithm. Use knowledge of how serch pttern repets itself. Build DFA from pttern. Run DFA on text. Bsic strtegy: pre-compute something sed on pttern. Serch Pttern Serch Text Serch Pttern Serch Text 1 ccept stte 15 16
DFA Representtion KMP Algorithm DFA used in KMP hs specil property. Upon chrcter mtch, go forwrd one stte. Only need to keep trck of where to go upon chrcter mismtch: go to stte next[j] if chrcter mismtches in stte j Two key differences from rute force. Text pointer i never cks up. Need to precompute next[] tle. Serch Pttern 1 5 1 4 3 3 next 3 only store this row for (int i =, j = ; i < N; i++) { if (t.chrat(i) == p.chrat(j)) j++; // mtch else j = next[j]; // mismtch if (j == M) return i - M + 1; // found return -1; // not found Simultion of KMP DFA (ssumes inry lphet) 1 ccept stte 7 8 Knuth-Morris-Prtt DFA Construction for KMP KMP lgorithm. (over inry lphet, for simplicity) Use knowledge of how serch pttern repets itself. Build DFA from pttern. Run DFA on text. Rule for creting next[] tle for pttern. next[4]: longest prefix of tht is proper suffix of. next[5]: longest prefix of tht is proper suffix of. DFA construction for KMP. DFA uilds itself! Ex: compute next[6] for pttern p[..6] =. Assume you know DFA for pttern p[..5]=. Assume you know stte X for p[1..5] =. X = Updte next[6] to stte for. X + = Updte X to stte for p[1..6] = X + = 3 compute y simulting on DFA 1 1 9 3
DFA Construction for KMP DFA Construction for KMP DFA construction for KMP. DFA uilds itself! DFA construction for KMP. DFA uilds itself! Ex: compute next[6] for pttern p[..6] =. Assume you know DFA for pttern p[..5]=. Assume you know stte X for p[1..5] =. X = Updte next[6] to stte for. X + = Updte X to stte for p[1..6] = X + = 3 Ex: compute next[7] for pttern p[..7] =. Assume you know DFA for pttern p[..6]=. Assume you know stte X for p[1..6] =. X = 3 Updte next[7] to stte for. X + = 4 Updte X to stte for p[1..7] = X + = 1 7 1 7 31 3 DFA Construction for KMP DFA Construction for KMP DFA construction for KMP. DFA uilds itself! DFA construction for KMP. DFA uilds itself! Ex: compute next[7] for pttern p[..7] =. Assume you know DFA for pttern p[..6]=. Assume you know stte X for p[1..6] =. X = 3 Updte next[7] to stte for. X + = 4 Updte X to stte for p[1..7] = X + = 1 7 8 Crucil insight. To compute trnsitions for stte n of DFA, suffices to hve: DFA for sttes to n-1 stte X tht DFA ends up in with input p[1..n-1] To compute stte X' tht DFA ends up in with input p[1..n], it suffices to hve: DFA for sttes to n-1 stte X tht DFA ends up in with input p[1..n-1] 33 34
DFA Construction for KMP DFA Construction for KMP: Implementtion 1 Serch Pttern 1 7 3 1 4 5 6 3 7 4 8 j 1 3 4 5 6 7 1 1 3 pttern[1..j] X next 3 4 7 8 Build DFA for KMP. Tkes O(M) time. Requires O(M) extr spce to store next[] tle. int X = ; int[] next = new int[m]; for (int j = 1; j < M; j++) { if (p.chrat(x) == p.chrat(j)) { // chr mtch next[j] = next[x]; X = X + 1; else { // chr mismtch next[j] = X + 1; X = next[x]; DFA Construction for KMP (ssumes inry lphet) 43 44 Optimized KMP Implementtion KMP Over Aritrry Alphet Ultimte serch progrm for pttern. Specilized C progrm. Mchine lnguge version of C progrm. int kmperch(chr t[]) { int i = ; s: if (t[i++]!= '') goto s; s1: if (t[i++]!= '') goto s; s: if (t[i++]!= '') goto s; s3: if (t[i++]!= '') goto s; s4: if (t[i++]!= '') goto s; s5: if (t[i++]!= '') goto s3; s6: if (t[i++]!= '') goto s; s7: if (t[i++]!= '') goto s4; return i - 8; next[] DFA for ptterns over ritrry lphets. Red new chrcter only upon success (or filure t eginning). Reuse current chrcter upon filure nd follow ck. Fct: KMP follows t most 1 + log M ck links in row. Theorem: t most N chrcter comprisons in totl. Ex: DFA for pttern c. N Y c ssumes pttern is in text (o/w use sentinel) 45 46
String Serch Implementtion Cost Summry History of KMP KMP nlysis. DFA simultion tkes (N) time in worst-cse. DFA construction tkes (M) time nd spce in worst-cse. Extends to ASCII or UNICODE lphets. Good efficiency for ptterns nd texts with much repetition. "On-line lgorithm." virus scnning, internet spying Serch for n M-chrcter pttern in n N-chrcter text. Implementtion Krp-Rin KMP Typicl Worst Brute 1.1 N M N (N) (N) 1.1 N N ssumes pproprite model rndomized History of KMP. Inspired y esoteric theorem of Cook tht sys liner time lgorithm should e possile for -wy pushdown utomt. Discovered in 1976 independently y two theoreticins nd hcker. Knuth: discovered liner time lgorithm Prtt: mde running time independent of lphet Morris: trying to uild n editor nd void nnoying uffer for string serch Resolved theoreticl nd prcticl prolems. Surprise when it ws discovered. In hindsight, seems like right lgorithm. chrcter comprisons 47 48 Boyer-Moore Boyer-Moore Boyer-Moore lgorithm (1974). Right-to-left scnning. find offset i in text y moving left to right. compre pttern to text y moving right to left. Boyer-Moore lgorithm (1974). Right-to-left scnning. Heuristic 1: dvnce offset i using "d chrcter rule." upon mismtch of text chrcter c, look up index[c] increse offset i so tht j th chrcter of pttern lines up with text chrcter c Index g 5 i n 1 s 4 t 3 * 5 Text Text s t r i n g s s e r c h c o n s i o f Pttern s t r i n g s s e r c h c o n s i o f Pttern Mismtch Mtch No comprison 49 Mismtch Mtch No comprison 5
Boyer-Moore Boyer-Moore Boyer-Moore lgorithm (1974). Right-to-left scnning. Heuristic 1: dvnce offset i using "d chrcter rule." upon mismtch of text chrcter c, look up index[c] increse offset i so tht j th chrcter of pttern lines up with text chrcter c Index g 5 i n 1 s 4 t 3 * 5 Boyer-Moore lgorithm (1974). Right-to-left scnning. Heuristic 1: dvnce offset i using "d chrcter rule." Heuristic : use KMP-like suffix rule. effective with smll lphets different rules led to different worst-cse ehvior privte sttic void dchrskip(string pttern, int[] skip) { int M = pttern.length(); for (int j = ; j < 56; j++) skip[j] = M; for (int j = ; j < M-1; j++) skip[pttern.chrat(j)] = M j 1; construction of d chrcter skip tle Text x x x x x x x?????? x x x x x x x x x c d d x d d now ligned d chrcter rule 51 5 Boyer-Moore String Serch Implementtion Cost Summry Boyer-Moore lgorithm (1974). Right-to-left scnning. Heuristic 1: dvnce offset i using "d chrcter rule." Heuristic : use KMP-like suffix rule. effective with smll lphets different rules led to different worst-cse ehvior Boyer-Moore nlysis. O(N / M) verge cse if given letter usully doesn't occur in string. time decreses s pttern length increses suliner in input size! O(N) worst-cse with Glil vrint. Serch for n M-chrcter pttern in n N-chrcter text. Text x x x x x x x?????? x x x x x x x x x c d d x d d cn skip over this since we know d doesn't mtch strong good suffix Implementtion Brute Typicl 1.1 N Worst M N Krp-Rin (N) (N) KMP 1.1 N N Boyer-Moore N / M 4N chrcter comprisons ssumes pproprite model rndomized 53 54
Boyer-Moore nd Alphet Size Tip of the Iceerg Boyer-Moore spce requirement. (M + A) Big lphets. Direct implementtion my e imprcticl, e.g., UNICODE. My explin why Jv's indexof doesn't use it. Solution 1: serch one yte t time. Solution : hsh UNICODE chrcters to smller rnge. Smll lphets. Loses effectiveness when A is too smll, e.g., DNA. Solution: group chrcters together (, c,... ). Multiple string serch. Serch for ny of k different strings. Nïve: O(M + kn). Aho-Corsick: O(M + N). Screen out dirty words from text strem. Wildcrds / chrcter clsses. Ex: PROSITE ptterns for computtionl iology. O(M + N) time using O(M + A) extr spce. Multiple mtches Approximte string mtching: llow up to k mismtches. Recovering from typing or spelling errors in informtion retrievl. Fixing trnsmission errors in signl processing. Edit-distnce: llow up to k edits. Recover from mesurement errors in computtionl iology. 55 56 Jv String Lirry String Serch Summry Jv String lirry hs uilt-in string serching. t.indexof(p): index of 1 st occurrence of pttern p in text t. Cvet: it's rute force, nd cn tke (MN) time. pulic sttic void min(string[] rgs) { int n = Integer.prseInt(rgs[]); String s = ""; for (int i = ; i < n; i++) s = s + s; String pttern = s + ""; String text = s + s;... n......... Ingenious lgorithms for fundmentl prolem. Rin Krp. Esy to implement, ut usully worse thn rute-force. Extends to more generl settings (e.g., D serch). Knuth-Morris-Prtt. Quintessentil solution to theoreticl prolem. Independent of lphet size. Extends to multiple string serch, wild-crds, regulr expressions. n+1 System.out.println(text.indexOf(pttern)); Boyer-Moore. Simple ide leds to drmtic speedup for long ptterns. Need to twek for smll or lrge lphets. Why do you think lirry uses rute force? 57 58