Fast Searching in Packed Strings. Philip Bille

Similar documents
One Minute To Learn Programming: Finite Automata

Solving the String Statistics Problem in Time O(n log n)

Basic Research in Computer Science BRICS RS Brodal et al.: Solving the String Statistics Problem in Time O(n log n)

Homework 3 Solutions

Answer, Key Homework 10 David McIntyre 1

Bypassing Space Explosion in Regular Expression Matching for Network Intrusion Detection and Prevention Systems

String Searching. String Search. Spam Filtering. String Search

Data Compression. Lossless And Lossy Compression

Polynomial Functions. Polynomial functions in one variable can be written in expanded form as ( )

Lec 2: Gates and Logic

Regular Sets and Expressions

FAULT TREES AND RELIABILITY BLOCK DIAGRAMS. Harry G. Kwatny. Department of Mechanical Engineering & Mechanics Drexel University

1. Introduction Texts and their processing

Firm Objectives. The Theory of the Firm II. Cost Minimization Mathematical Approach. First order conditions. Cost Minimization Graphical Approach

Bayesian Updating with Continuous Priors Class 13, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

MA Lesson 16 Notes Summer 2016 Properties of Logarithms. Remember: A logarithm is an exponent! It behaves like an exponent!

Binary Representation of Numbers Autar Kaw

flex Regular Expressions and Lexical Scanning Regular Expressions and flex Examples on Alphabet A = {a,b} (Standard) Regular Expressions on Alphabet A

How fast can we sort? Sorting. Decision-tree model. Decision-tree for insertion sort Sort a 1, a 2, a 3. CS Spring 2009

Example A rectangular box without lid is to be made from a square cardboard of sides 18 cm by cutting equal squares from each corner and then folding

MATH 150 HOMEWORK 4 SOLUTIONS

5.2. LINE INTEGRALS 265. Let us quickly review the kind of integrals we have studied so far before we introduce a new one.

Quick Reference Guide: One-time Account Update

EQUATIONS OF LINES AND PLANES

trademark and symbol guidelines FOR CORPORATE STATIONARY APPLICATIONS reviewed

STUDY ON 3D TEXTURED BUILDING MODEL BASED ON ADS40 IMAGE AND 3D MODEL

Operations with Polynomials

Physics 43 Homework Set 9 Chapter 40 Key

Experiment 6: Friction

Geometry 7-1 Geometric Mean and the Pythagorean Theorem

Regular Repair of Specifications

Section 5.2, Commands for Configuring ISDN Protocols. Section 5.3, Configuring ISDN Signaling. Section 5.4, Configuring ISDN LAPD and Call Control

Reasoning to Solve Equations and Inequalities

FORMAL LANGUAGES, AUTOMATA AND THEORY OF COMPUTATION EXERCISES ON REGULAR LANGUAGES

Generating In-Line Monitors For Rabin Automata

Regular Languages and Finite Automata

Decision Rule Extraction from Trained Neural Networks Using Rough Sets

1. In the Bohr model, compare the magnitudes of the electron s kinetic and potential energies in orbit. What does this imply?

How To Make A Network More Efficient

Pointed Regular Expressions

Or more simply put, when adding or subtracting quantities, their uncertainties add.

Use Geometry Expressions to create a more complex locus of points. Find evidence for equivalence using Geometry Expressions.

EFFICIENT VARIANTS OF THE BACKWARD-ORACLE-MATCHING ALGORITHM

Drawing Diagrams From Labelled Graphs

PROBLEMS 13 - APPLICATIONS OF DERIVATIVES Page 1

APPLICATION NOTE Revision 3.0 MTD/PS-0534 August 13, 2008 KODAK IMAGE SENDORS COLOR CORRECTION FOR IMAGE SENSORS

DAGmaps: Space Filling Visualization of Directed Acyclic Graphs

Learning Outcomes. Computer Systems - Architecture Lecture 4 - Boolean Logic. What is Logic? Boolean Logic 10/28/2010

Lectures 8 and 9 1 Rectangular waveguides

CUBIC-FOOT VOLUME OF A LOG

A Network Management System for Power-Line Communications and its Verification by Simulation

A.7.1 Trigonometric interpretation of dot product A.7.2 Geometric interpretation of dot product

Modular Generic Verification of LTL Properties for Aspects

Health insurance marketplace What to expect in 2014

Module 2. Analysis of Statically Indeterminate Structures by the Matrix Force Method. Version 2 CE IIT, Kharagpur

6.5 - Areas of Surfaces of Revolution and the Theorems of Pappus

CS99S Laboratory 2 Preparation Copyright W. J. Dally 2001 October 1, 2001

Knuth-Morris-Pratt Algorithm

1. Find the zeros Find roots. Set function = 0, factor or use quadratic equation if quadratic, graph to find zeros on calculator

Graphs on Logarithmic and Semilogarithmic Paper

Appendix D: Completing the Square and the Quadratic Formula. In Appendix A, two special cases of expanding brackets were considered:

ASG Techniques of Adaptivity

Small Business Networking

Math 314, Homework Assignment Prove that two nonvertical lines are perpendicular if and only if the product of their slopes is 1.

5 a LAN 6 a gateway 7 a modem

Measuring Similarity between Graphs Based on the Levenshtein Distance

Scalable Mining of Large Disk-based Graph Databases

Humana Critical Illness/Cancer

Solutions for Selected Exercises from Introduction to Compiler Design

and thus, they are similar. If k = 3 then the Jordan form of both matrices is

Basic Analysis of Autarky and Free Trade Models

Math 135 Circles and Completing the Square Examples

c b N/m 2 (0.120 m m 3 ), = J. W total = W a b + W b c 2.00

Efficient load-balancing routing for wireless mesh networks

Mathematics. Vectors. hsn.uk.net. Higher. Contents. Vectors 128 HSN23100

, and the number of electrons is -19. e e C. The negatively charged electrons move in the direction opposite to the conventional current flow.

Brillouin Zones. Physics 3P41 Chris Wiebe

How To Network A Smll Business

Visualization of Time-Varying Volumetric Data using Differential Time-Histogram Table

Learning Workflow Petri Nets

Application Bundles & Data Plans

Warm-up for Differential Calculus

Multiplication and Division - Left to Right. Addition and Subtraction - Left to Right.

When Simulation Meets Antichains (on Checking Language Inclusion of NFAs)

The Velocity Factor of an Insulated Two-Wire Transmission Line

Distributions. (corresponding to the cumulative distribution function for the discrete case).

Your duty, however, does not require disclosure of matter:

Vectors. The magnitude of a vector is its length, which can be determined by Pythagoras Theorem. The magnitude of a is written as a.

A Visual and Interactive Input abb Automata. Theory Course with JFLAP 4.0

Small Business Networking

Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

Traffic Rank Based QoS Routing in Wireless Mesh Network

Small Business Networking

Health insurance exchanges What to expect in 2014

LECTURE #05. Learning Objectives. How does atomic packing factor change with different atom types? How do you calculate the density of a material?

Abstract. This paper introduces new algorithms and data structures for quick counting for machine

Complexity Results in Epistemic Planning

CHAPTER 11 Numerical Differentiation and Integration

A formal model for databases in DNA

Transcription:

Fst Serching in Pcked Strings Philip Bille 1

String Mtching Prolem: Given strings P nd Q of lengths m nd n, resp., report ll occurrences of P in Q. Q = ccc P = c KMP-lgorithm [KMP1977] uses O(n) time (ssume w.l.o.g. m n). Optiml if strings re stored with one chr per memory word. 2

Pcked Strings Rel strings re pcked: S = c log σ S = c log n With word-length log n memory word holds log n / log σ chrcters. S uses O( S log σ/log n) = O( S / logσn) words. 3

Pcked String Mtching Prolem: String mtching with P nd Q in pcked representtion. Lower ound: Ω ( ) n + m log σ n + occ Wht is the est upper ound? Cn we do etter thn O(n)? 4

A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 5

A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 Ide: Trverse Q from left-to-right reding r #numer of chrcters per word t time. 5

A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 Ide: Trverse Q from left-to-right reding r #numer of chrcters per word t time. At ech step compute the longest prefix of P mtching the current suffix of Q. (slightly more informtion needed to lso report occurrences). 5

A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 Ide: Trverse Q from left-to-right reding r #numer of chrcters per word t time. At ech step compute the longest prefix of P mtching the current suffix of Q. (slightly more informtion needed to lso report occurrences). To do step in constnt time store for ech prefix of P nd ech comintion of r chrcters pointer to the next prefix. (clled super-lphet technique [Fre02]). 5

A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 Ide: Trverse Q from left-to-right reding r #numer of chrcters per word t time. At ech step compute the longest prefix of P mtching the current suffix of Q. (slightly more informtion needed to lso report occurrences). To do step in constnt time store for ech prefix of P nd ech comintion of r chrcters pointer to the next prefix. (clled super-lphet technique [Fre02]). Spce: Time: O(mσ r ) O(n/r + mσ r + occ) 5

A Simple Algorithm: Use lots of spce P = c Q = c 4 0 2 Ide: Trverse Q from left-to-right reding r #numer of chrcters per word t time. At ech step compute the longest prefix of P mtching the current suffix of Q. (slightly more informtion needed to lso report occurrences). To do step in constnt time store for ech prefix of P nd ech comintion of r chrcters pointer to the next prefix. (clled super-lphet technique [Fre02]). r = ɛ log σ n Spce: Time: O(mσ r ) O(n/r + mσ r + occ) O(mn ɛ ) O(n/ log σ n + mn ɛ + occ) 5

Complexities O O ( ) n O r + mσr + occ ( n ( n Time log σ n + mnε + occ ( ) n O r + m + σr + occ log σ n + m + occ ) ) Spce O(mσ r ) O(mn ε ) O(m + σ r ) O(m + n ε ) Simple This pper 6

Algorithm Overview Bsed on the Knuth-Morris-Prtt utomton. The Four-Russin Technique (divide nd tulte) with new twists. 7

The Knuth-Morris-Prtt Automton P = c c KMP(P ) 8

A First Attempt: The Four-Russin Technique 9

A First Attempt: The Four-Russin Technique 9

A First Attempt: The Four-Russin Technique r 9

A First Attempt: The Four-Russin Technique r Tulte informtion for ech suutomt to llow up to r internl trnsitions in constnt time. 9

A First Attempt: The Four-Russin Technique r Tulte informtion for ech suutomt to llow up to r internl trnsitions in constnt time. Simulte y doing externl trnsitions explicitly nd internl trnsitions using the tulted informtion. 9

A First Attempt: The Four-Russin Technique r Tulte informtion for ech suutomt to llow up to r internl trnsitions in constnt time. Simulte y doing externl trnsitions explicitly nd internl trnsitions using the tulted informtion. Issue 1: Too mny externl trnsitions. 9

A First Attempt: The Four-Russin Technique r Tulte informtion for ech suutomt to llow up to r internl trnsitions in constnt time. Simulte y doing externl trnsitions explicitly nd internl trnsitions using the tulted informtion. Issue 1: Too mny externl trnsitions. Issue 2: Representing suutomt compctly. 9

Fixing 1: Too Mny Externl Trnsitions 10

Fixing 1: Too Mny Externl Trnsitions 10

Fixing 1: Too Mny Externl Trnsitions 10

Fixing 1: Too Mny Externl Trnsitions 10

Fixing 1: Too Mny Externl Trnsitions 10

Fixing 1: Too Mny Externl Trnsitions 10

Fixing 1: Too Mny Externl Trnsitions At most O(n/r) externl trnsitions in simultion of Q 10

Fixing 2: Representing Suutomt Compctly c We wnt to encode n ritrry suutomton of KMP(P) in O(r log σ) its. Non-filure trnsitions encoded y the sequence of lels in O(r log σ) its. How out the filure trnsitions in S? 11

Fixing 2: Representing Suutomt Compctly c 12

Fixing 2: Representing Suutomt Compctly c Storing r explicit pointers uses Ω(r log r) its. 12

Fixing 2: Representing Suutomt Compctly c Storing r explicit pointers uses Ω(r log r) its. Insted we exploit sic property of KMP-utomt: In ny suutomton filure trnsition endpoints increse y t most 1 etween consecutive sttes. 12

Fixing 2: Representing Suutomt Compctly c Storing r explicit pointers uses Ω(r log r) its. Insted we exploit sic property of KMP-utomt: In ny suutomton filure trnsition endpoints increse y t most 1 etween consecutive sttes. => Totl increse t most r => Totl decrese t most O(r). 12

Fixing 2: Representing Suutomt Compctly c Storing r explicit pointers uses Ω(r log r) its. Insted we exploit sic property of KMP-utomt: In ny suutomton filure trnsition endpoints increse y t most 1 etween consecutive sttes. => Totl increse t most r => Totl decrese t most O(r). => We cn difference encode ll filure trnsitions with O(r) its. 12

Putting the Pieces together Construct segment utomton nd tulte trnsitions for suutomt using the compct encoding. Simulte the segment utomton. Ech externl trnsitions is done explicitly. Internl trnsitions re done using the tultion. Complexity: Spce: Time: O(m + σ r ) O(n/r + m + σ r + occ) r = ɛ log σ n O(m + n ɛ ) O(n/ log σ n + m + occ) 13

Directions 14

Directions Pcked string mtching: 14

Directions Pcked string mtching: Prcticl? 14

Directions Pcked string mtching: Prcticl? Long word lengths? 14

Directions Pcked string mtching: Prcticl? Long word lengths? Multi-string mtching? 14

Directions Pcked string mtching: Prcticl? Long word lengths? Multi-string mtching? Pcked prolems pper everywhere. 14

Directions Pcked string mtching: Prcticl? Long word lengths? Multi-string mtching? Pcked prolems pper everywhere. Longer word lengths => more pcking. 14

Directions Pcked string mtching: Prcticl? Long word lengths? Multi-string mtching? Pcked prolems pper everywhere. Longer word lengths => more pcking. Most pcked prolems re not well-solved. 14