Fst Serching in Pcked Strings Philip Bille 1
String Mtching Prolem: Given strings P nd Q of lengths m nd n, resp., report ll occurrences of P in Q. Q = ccc P = c KMP-lgorithm [KMP1977] uses O(n) time (ssume w.l.o.g. m n). Optiml if strings re stored with one chr per memory word. 2
Pcked Strings Rel strings re pcked: S = c log σ S = c log n With word-length log n memory word holds log n / log σ chrcters. S uses O( S log σ/log n) = O( S / logσn) words. 3
Pcked String Mtching Prolem: String mtching with P nd Q in pcked representtion. Lower ound: Ω ( ) n + m log σ n + occ Wht is the est upper ound? Cn we do etter thn O(n)? 4
A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 5
A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 Ide: Trverse Q from left-to-right reding r #numer of chrcters per word t time. 5
A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 Ide: Trverse Q from left-to-right reding r #numer of chrcters per word t time. At ech step compute the longest prefix of P mtching the current suffix of Q. (slightly more informtion needed to lso report occurrences). 5
A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 Ide: Trverse Q from left-to-right reding r #numer of chrcters per word t time. At ech step compute the longest prefix of P mtching the current suffix of Q. (slightly more informtion needed to lso report occurrences). To do step in constnt time store for ech prefix of P nd ech comintion of r chrcters pointer to the next prefix. (clled super-lphet technique [Fre02]). 5
A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 Ide: Trverse Q from left-to-right reding r #numer of chrcters per word t time. At ech step compute the longest prefix of P mtching the current suffix of Q. (slightly more informtion needed to lso report occurrences). To do step in constnt time store for ech prefix of P nd ech comintion of r chrcters pointer to the next prefix. (clled super-lphet technique [Fre02]). Spce: Time: O(mσ r ) O(n/r + mσ r + occ) 5
A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 Ide: Trverse Q from left-to-right reding r #numer of chrcters per word t time. At ech step compute the longest prefix of P mtching the current suffix of Q. (slightly more informtion needed to lso report occurrences). To do step in constnt time store for ech prefix of P nd ech comintion of r chrcters pointer to the next prefix. (clled super-lphet technique [Fre02]). r = ɛ log σ n Spce: Time: O(mσ r ) O(n/r + mσ r + occ) O(mn ɛ ) O(n/ log σ n + mn ɛ + occ) 5
Complexities O O ( ) n O r + mσr + occ ( n ( n Time log σ n + mnε + occ ( ) n O r + m + σr + occ log σ n + m + occ ) ) Spce O(mσ r ) O(mn ε ) O(m + σ r ) O(m + n ε ) Simple This pper 6
Algorithm Overview Bsed on the Knuth-Morris-Prtt utomton. The Four-Russin Technique (divide nd tulte) with new twists. 7
The Knuth-Morris-Prtt Automton P = c c KMP(P ) 8
A First Attempt: The Four-Russin Technique 9
A First Attempt: The Four-Russin Technique 9
A First Attempt: The Four-Russin Technique r 9
A First Attempt: The Four-Russin Technique r Tulte informtion for ech suutomt to llow up to r internl trnsitions in constnt time. 9
A First Attempt: The Four-Russin Technique r Tulte informtion for ech suutomt to llow up to r internl trnsitions in constnt time. Simulte y doing externl trnsitions explicitly nd internl trnsitions using the tulted informtion. 9
A First Attempt: The Four-Russin Technique r Tulte informtion for ech suutomt to llow up to r internl trnsitions in constnt time. Simulte y doing externl trnsitions explicitly nd internl trnsitions using the tulted informtion. Issue 1: Too mny externl trnsitions. 9
A First Attempt: The Four-Russin Technique r Tulte informtion for ech suutomt to llow up to r internl trnsitions in constnt time. Simulte y doing externl trnsitions explicitly nd internl trnsitions using the tulted informtion. Issue 1: Too mny externl trnsitions. Issue 2: Representing suutomt compctly. 9
Fixing 1: Too Mny Externl Trnsitions 10
Fixing 1: Too Mny Externl Trnsitions 10
Fixing 1: Too Mny Externl Trnsitions 10
Fixing 1: Too Mny Externl Trnsitions 10
Fixing 1: Too Mny Externl Trnsitions 10
Fixing 1: Too Mny Externl Trnsitions 10
Fixing 1: Too Mny Externl Trnsitions At most O(n/r) externl trnsitions in simultion of Q 10
Fixing 2: Representing Suutomt Compctly c We wnt to encode n ritrry suutomton of KMP(P) in O(r log σ) its. Non-filure trnsitions encoded y the sequence of lels in O(r log σ) its. How out the filure trnsitions in S? 11
Fixing 2: Representing Suutomt Compctly c 12
Fixing 2: Representing Suutomt Compctly c Storing r explicit pointers uses Ω(r log r) its. 12
Fixing 2: Representing Suutomt Compctly c Storing r explicit pointers uses Ω(r log r) its. Insted we exploit sic property of KMP-utomt: In ny suutomton filure trnsition endpoints increse y t most 1 etween consecutive sttes. 12
Fixing 2: Representing Suutomt Compctly c Storing r explicit pointers uses Ω(r log r) its. Insted we exploit sic property of KMP-utomt: In ny suutomton filure trnsition endpoints increse y t most 1 etween consecutive sttes. => Totl increse t most r => Totl decrese t most O(r). 12
Fixing 2: Representing Suutomt Compctly c Storing r explicit pointers uses Ω(r log r) its. Insted we exploit sic property of KMP-utomt: In ny suutomton filure trnsition endpoints increse y t most 1 etween consecutive sttes. => Totl increse t most r => Totl decrese t most O(r). => We cn difference encode ll filure trnsitions with O(r) its. 12
Putting the Pieces together Construct segment utomton nd tulte trnsitions for suutomt using the compct encoding. Simulte the segment utomton. Ech externl trnsitions is done explicitly. Internl trnsitions re done using the tultion. Complexity: Spce: Time: O(m + σ r ) O(n/r + m + σ r + occ) r = ɛ log σ n O(m + n ɛ ) O(n/ log σ n + m + occ) 13
Directions 14
Directions Pcked string mtching: 14
Directions Pcked string mtching: Prcticl? 14
Directions Pcked string mtching: Prcticl? Long word lengths? 14
Directions Pcked string mtching: Prcticl? Long word lengths? Multi-string mtching? 14
Directions Pcked string mtching: Prcticl? Long word lengths? Multi-string mtching? Pcked prolems pper everywhere. 14
Directions Pcked string mtching: Prcticl? Long word lengths? Multi-string mtching? Pcked prolems pper everywhere. Longer word lengths => more pcking. 14
Directions Pcked string mtching: Prcticl? Long word lengths? Multi-string mtching? Pcked prolems pper everywhere. Longer word lengths => more pcking. Most pcked prolems re not well-solved. 14