Improved Single and Multiple Approximate String Matching

Size: px

Start display at page:

Download "Improved Single and Multiple Approximate String Matching"

Rolf Arnold
8 years ago
Views:

1 Improved Single and Multiple Approximate String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Gonzalo Navarro Department of Computer Science, University of Chile CPM 04 p.1/26

Science, University of Joensuu, Finland Gonzalo

2 The Problem Setting & Complexity Given text alphabet of size, find the approximate occurrences of from, allowing at most differences (edit operations). and pattern over some finite Exact matching (single pattern) lower bound: character comparisons (Yao, 79). Approximate matcing lower bound: (Chang & Marr, 94). We will search simultaneously a set of patterns. "!###! & %$ lower bound for (Fredriksson & Navarro, 2003) ( patterns CPM 04 p.2/26

and pattern over some finite Exact matching (single pattern) lower bound: character comparisons (Yao, 79).

3 " Previous work Only a few algorithms exist for multipattern approximate searching under the differences model. Naïve approach: search the patterns separately, using any of the single pattern search algorithms. (Muth & Manber, 1996): average time algorithm using space. The algorithm is based on hashing, and works only for. (Baeza-Yates & Navarro, 1997): Partitioning into exact search: on average ( preprocessing), but can be improved to. Works for. Other less interesting ones. CPM 04 p.3/26

(Muth & Manber, 1996): average time algorithm using space. The algorithm is based on hashing, and works only for.

4 (Fredriksson & Navarro, 2003): The first average-optimal algorithm. Previous work average-optimal up to error level. linear on average up to error level. (Hyyrö, Fredriksson & Navarro, 2004): worst case for short patterns, where is the number of bits in machine word. CPM 04 p.4/26

5 We have improved the (optimal) algorithm of (Fredriksson & Navarro, 2003) Faster in practice, and......allows error levels up to Our algorithm runs in time, which is optimal. Preprocessing time is algorithm needs., and the space, where. This work average The fastest algorithm in practice for intermediate and small. CPM 04 p.5/26

.....allows error levels up to Our algorithm runs in time, which is optimal.

6 The method in brief: The algorithm is based on the preprocessing/filtering/verification paradigm. The preprocessing phase generates all strings of lenght, and computes their minimum distance over the set of patterns. The filtering phase searches (approximately) text -grams from the patterns, using the precomputed distance table, accumulating the differences. The verification phase uses dynamic programming algorithm, and is applied to each pattern separately. CPM 04 p.6/26

The filtering phase searches (approximately) text -grams from the patterns, using the precomputed distance table,

7 Preprocessing Build a table as follows: 1. Choose a number in the range 2. For every string of length ( gram), search for in 3. Store in the smallest number of differences needed to match inside (a number between 0 and ). requires space for computed in entries and can be time. CPM 04 p.7/26

8 Filtering Any occurrence is at least characters long use a sliding window of characters over Invariant: all occurreces starting before the window are already reported. Read -grams from right to left: " ### T: S 3 S 2 S 1 text window m k characters from the text window, Any occurrence starting at the beginning of the window must contain all the -grams read. CPM 04 p.8/26

Read -grams from right to left: " ### T: S 3 S 2 S 1 text window m k characters from the

9 " Filtering Accumulate a sum of necessary differences:. If for some (i.e. the smallest) then no occurrence can contain the -grams becomes, ### slide the window past the first character of E.g. T: T: " : S 3 S 2 S 1 text window m k characters new window position. CPM 04 p.9/26

10 If, then the window might contain an occurrence the occurrence can be verify the area position of the window T: S 3 S 2 S 1 Verification characters long, so, where is the starting verification area m+k characters text window m k characters The verification is done for each of the patterns, using standard dynamic programming algorithm. CPM 04 p.10/26

starting verification area m+k characters text window m k characters The verification

11 Stricter matching condition Our basic algorithm: text -grams can match anywhere inside the patterns. If, then we know that no occurrence can contain the -grams in any position. ### The matching area can be made smaller without losing this property. CPM 04 p.11/26

If, then we know that no occurrence can contain the -grams in any

12 " " " " Stricter matching condition Consider an approximate occurrence of inside the pattern. cannot be closer than end of the pattern. positions from the For precompute a table, which considers its best match in the area rather than. In general, for preprocess a table, using the area Compute as CPM 04 p.12/26

positions from the For precompute a table, which considers its best match in

13 Stricter matching condition P: T: D [ 1 ] Area for 1 S D [ 2 ] Area for 2 S D [ 3 ] Area for 3 S S 3 S 2 S 1 text window CPM 04 p.13/26

14 Stricter matching condition for any and the smallest that permits shifting the window is never smaller than for the basic method. this variant never examines more more windows, nor shifts less. -grams, verifies Drawback: needs more space and preprocessing effort Can be slower in practice. The matching condition can be made even stricter Work less per window......but the shift can be smaller. CPM 04 p.14/26

-grams, verifies Drawback: needs more space and preprocessing effort Can be slower in practice.

15 Analysis It can be shown that the basic algorithm has optimal average case complexity. This holds for. The worst case complexity can be made (filtering verification). The preprocessing cost is requires space. ", and it Since the algorithm with the stricter matching condition is never worse than the basic version, it is also optimal. CPM 04 p.15/26

The preprocessing cost is requires space.

16 Analysis For a single pattern our complexity is the same as the algorithm of Chang & Marr, i.e.... (...but our filter works up to, whereas the filter of Chang & Marr works only up to. CPM 04 p.16/26

17 Experimental results Implementation in C, compiled using icc 7.1 with full optimizations, run in a 2GHZ Pentium 4, with 512MB RAM, running Linux Experiments for alphabet sizes (DNA) and (proteins), both random and real texts. Text lengths were 64Mb, and patterns 64 characters. In the implementation we used several practical improvements described in (Fredriksson & Navarro, 2003) Bit-parallel counters Hierarchical / bit-parallel verification CPM 04 p.17/26

Experiments for alphabet sizes (DNA) and (proteins), both random and real texts.

18 Experimental results We used for DNA, and for proteins. the maximum values we can use in practice, otherwise the preprocessing cost becomes too high. Analytical results: for DNA, and (depending on ). # ## # ## for proteins Altought our algorithms are fast, in practice they cannot cope with as high difference ratios as predicted by the analysis. CPM 04 p.18/26

high. Analytical results: for DNA, and (depending on ).

19 Experimental results Comparison against: CM: Our previous optimal filtering algorithm LT: Our previous linear time filter EXP: Partitioning into exact search MM: Muth & Manber algorithm, works only for ABNDM: Approximate BNDM algorithm, a single pattern approximate search algorithm extending classical BDM. BPM: Bit-parallel Myers, currently the best non-filtering algorithm for single patterns. CPM 04 p.19/26

ABNDM: Approximate BNDM algorithm, a single pattern approximate search algorithm extending classical

20 Experimental results Comparison against Muth and Manber ( ): Alg. DNA MM Ours Alg. proteins MM Ours CPM 04 p.20/26

52 Ours 0.08 0.12 0.21 0.54 Alg. proteins MM 1.

21 Experimental results, random DNA 1 time (s) 0.1 Ours, l=6 Ours, l=8 Ours, strict k Ours, strictest CM LT EXP BPM ABNDM CPM 04 p.21/26

22 Experimental results, random DNA 100 time (s) 10 1 Ours, l=6 Ours, l=8 Ours, strict k Ours, strictest CM LT EXP BPM ABNDM CPM 04 p.22/26

23 Experimental results, random proteins 10 time (s) Ours Ours, stricter Ours, strictest k CM LT EXP BPM ABNDM CPM 04 p.23/26

24 Experimental results, random proteins 100 time (s) Ours Ours, stricter Ours, strictest k CM LT EXP BPM ABNDM CPM 04 p.24/26

25 Experimental results Areas where each algorithm performs best. From left to right, DNA ( ), and proteins ( ). Top row: random data. bottom row: real data. 256 r 256 r Ours EXP Ours EXP 1 BPM k 1 E X P k r 256 r Ours EXP Ours EXP 1 BPM k 1 E X P k CPM 04 p.25/26

26 Conclusions Our new algorithm becomes the fastest for low The larger, the smaller values are tolerated. When applied to just one pattern, our algorithm becomes the fastest for low difference ratios. Our basic algorithm usually beats the extensions. True only if we use the same parameter for both algorithms. For limited memory we can use the stricter matching condition with smaller, and beat the basic algorithm Our algorithm would be favored on even longer texts (relative preprocessing cost decreases).. CPM 04 p.26/26

Improved Single and Multiple Approximate String Matching

Improved Single and Multiple Approximate String Matching Kimmo Fredrisson and Gonzalo Navarro 2 Department of Computer Science, University of Joensuu fredri@cs.joensuu.fi 2 Department of Computer Science,