Improved Single and Multiple Approximate String Matching



Similar documents
Improved Single and Multiple Approximate String Matching

An efficient matching algorithm for encoded DNA sequences and binary strings

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

Fast string matching

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

New Techniques for Regular Expression Searching

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

Towards running complex models on big data

Optimizing Pattern Matching for Intrusion Detection

Faster polynomial multiplication via multipoint Kronecker substitution

Storage Management for Files of Dynamic Records

Technical Information. Digital Signals. 1 bit. Part 1 Fundamentals

File Systems Management and Examples

A3 Computer Architecture

A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW)

Chapter 6: Episode discovery process

Robust Quick String Matching Algorithm for Network Security

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Practical issues in DIY RAID Recovery

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

Small Maximal Independent Sets and Faster Exact Graph Coloring

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Approximate String Matching in DNA Sequences

The enhancement of the operating speed of the algorithm of adaptive compression of binary bitmap images

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

Udacity cs101: Building a Search Engine. Extracting a Link

SIMS 255 Foundations of Software Design. Complexity and NP-completeness

Public Key Cryptography. Performance Comparison and Benchmarking

An On-Line Algorithm for Checkpoint Placement

Secure Way of Storing Data in Cloud Using Third Party Auditor

Unsupervised Data Mining (Clustering)

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Data Deduplication in Slovak Corpora

Faster deterministic integer factorisation

Symbol Tables. Introduction

APP INVENTOR. Test Review

14.1 Rent-or-buy problem

In mathematics, it is often important to get a handle on the error term of an approximation. For instance, people will write

Theory of Computation Chapter 2: Turing Machines

Big Data Processing with Google s MapReduce. Alexandru Costan

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Whitepaper. Innovations in Business Intelligence Database Technology.

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Cloud Computing at Google. Architecture

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Hardware and Software Requirements for Installing California.pro

Hardware Configuration Guide

A hierarchical multicriteria routing model with traffic splitting for MPLS networks

Bootstrapping Big Data

How to recover a failed Storage Spaces

Big Data and Scripting map/reduce in Hadoop

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Rethinking SIMD Vectorization for In-Memory Databases

Structure for String Keys

Generalized Widening

HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Increasing Flash Throughput for Big Data Applications (Data Management Track)

Fast Sequential Summation Algorithms Using Augmented Data Structures

RevoScaleR Speed and Scalability

Probability Using Dice

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

The Relative Worst Order Ratio for On-Line Algorithms

Chapter 2 Data Storage

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

A FAST STRING MATCHING ALGORITHM

NAND Flash Memories. Understanding NAND Flash Factory Pre-Programming. Schemes

Microsoft Office Outlook 2013: Part 1

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

NAND Flash Memories. Using Linux MTD compatible mode. on ELNEC Universal Device Programmers. (Quick Guide)

Extended Finite-State Machine Inference with Parallel Ant Colony Based Algorithms

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Efficiently Identifying Inclusion Dependencies in RDBMS

Transcription:

Improved Single and Multiple Approximate String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Gonzalo Navarro Department of Computer Science, University of Chile CPM 04 p.1/26

The Problem Setting & Complexity Given text alphabet of size, find the approximate occurrences of from, allowing at most differences (edit operations). and pattern over some finite Exact matching (single pattern) lower bound: character comparisons (Yao, 79). Approximate matcing lower bound: (Chang & Marr, 94). We will search simultaneously a set of patterns. "!###! & %$ lower bound for (Fredriksson & Navarro, 2003) ( patterns CPM 04 p.2/26

" Previous work Only a few algorithms exist for multipattern approximate searching under the differences model. Naïve approach: search the patterns separately, using any of the single pattern search algorithms. (Muth & Manber, 1996): average time algorithm using space. The algorithm is based on hashing, and works only for. (Baeza-Yates & Navarro, 1997): Partitioning into exact search: on average ( preprocessing), but can be improved to. Works for. Other less interesting ones. CPM 04 p.3/26

(Fredriksson & Navarro, 2003): The first average-optimal algorithm. Previous work average-optimal up to error level. linear on average up to error level. (Hyyrö, Fredriksson & Navarro, 2004): worst case for short patterns, where is the number of bits in machine word. CPM 04 p.4/26

We have improved the (optimal) algorithm of (Fredriksson & Navarro, 2003) Faster in practice, and......allows error levels up to Our algorithm runs in time, which is optimal. Preprocessing time is algorithm needs., and the space, where. This work average The fastest algorithm in practice for intermediate and small. CPM 04 p.5/26

The method in brief: The algorithm is based on the preprocessing/filtering/verification paradigm. The preprocessing phase generates all strings of lenght, and computes their minimum distance over the set of patterns. The filtering phase searches (approximately) text -grams from the patterns, using the precomputed distance table, accumulating the differences. The verification phase uses dynamic programming algorithm, and is applied to each pattern separately. CPM 04 p.6/26

Preprocessing Build a table as follows: 1. Choose a number in the range 2. For every string of length ( gram), search for in 3. Store in the smallest number of differences needed to match inside (a number between 0 and ). requires space for computed in entries and can be time. CPM 04 p.7/26

Filtering Any occurrence is at least characters long use a sliding window of characters over Invariant: all occurreces starting before the window are already reported. Read -grams from right to left: " ### T: S 3 S 2 S 1 text window m k characters from the text window, Any occurrence starting at the beginning of the window must contain all the -grams read. CPM 04 p.8/26

" Filtering Accumulate a sum of necessary differences:. If for some (i.e. the smallest) then no occurrence can contain the -grams becomes, ### slide the window past the first character of E.g. T: T: " : S 3 S 2 S 1 text window m k characters new window position. CPM 04 p.9/26

If, then the window might contain an occurrence the occurrence can be verify the area position of the window T: S 3 S 2 S 1 Verification characters long, so, where is the starting verification area m+k characters text window m k characters The verification is done for each of the patterns, using standard dynamic programming algorithm. CPM 04 p.10/26

Stricter matching condition Our basic algorithm: text -grams can match anywhere inside the patterns. If, then we know that no occurrence can contain the -grams in any position. ### The matching area can be made smaller without losing this property. CPM 04 p.11/26

" " " " Stricter matching condition Consider an approximate occurrence of inside the pattern. cannot be closer than end of the pattern. positions from the For precompute a table, which considers its best match in the area rather than. In general, for preprocess a table, using the area Compute as CPM 04 p.12/26

Stricter matching condition P: T: D [ 1 ] Area for 1 S D [ 2 ] Area for 2 S D [ 3 ] Area for 3 S S 3 S 2 S 1 text window CPM 04 p.13/26

Stricter matching condition for any and the smallest that permits shifting the window is never smaller than for the basic method. this variant never examines more more windows, nor shifts less. -grams, verifies Drawback: needs more space and preprocessing effort Can be slower in practice. The matching condition can be made even stricter Work less per window......but the shift can be smaller. CPM 04 p.14/26

Analysis It can be shown that the basic algorithm has optimal average case complexity. This holds for. The worst case complexity can be made (filtering verification). The preprocessing cost is requires space. ", and it Since the algorithm with the stricter matching condition is never worse than the basic version, it is also optimal. CPM 04 p.15/26

Analysis For a single pattern our complexity is the same as the algorithm of Chang & Marr, i.e.... (...but our filter works up to, whereas the filter of Chang & Marr works only up to. CPM 04 p.16/26

Experimental results Implementation in C, compiled using icc 7.1 with full optimizations, run in a 2GHZ Pentium 4, with 512MB RAM, running Linux 2.4.18. Experiments for alphabet sizes (DNA) and (proteins), both random and real texts. Text lengths were 64Mb, and patterns 64 characters. In the implementation we used several practical improvements described in (Fredriksson & Navarro, 2003) Bit-parallel counters Hierarchical / bit-parallel verification CPM 04 p.17/26

Experimental results We used for DNA, and for proteins. the maximum values we can use in practice, otherwise the preprocessing cost becomes too high. Analytical results: for DNA, and (depending on ). # ## # ## for proteins Altought our algorithms are fast, in practice they cannot cope with as high difference ratios as predicted by the analysis. CPM 04 p.18/26

Experimental results Comparison against: CM: Our previous optimal filtering algorithm LT: Our previous linear time filter EXP: Partitioning into exact search MM: Muth & Manber algorithm, works only for ABNDM: Approximate BNDM algorithm, a single pattern approximate search algorithm extending classical BDM. BPM: Bit-parallel Myers, currently the best non-filtering algorithm for single patterns. CPM 04 p.19/26

Experimental results Comparison against Muth and Manber ( ): Alg. DNA MM 1.30 3.97 12.86 42.52 Ours 0.08 0.12 0.21 0.54 Alg. proteins MM 1.17 1.19 1.26 2.33 Ours 0.08 0.11 0.18 0.59 CPM 04 p.20/26

Experimental results, random DNA 1 time (s) 0.1 Ours, l=6 Ours, l=8 Ours, strict 2 4 6 8 10 12 14 16 k Ours, strictest CM LT EXP BPM ABNDM CPM 04 p.21/26

Experimental results, random DNA 100 time (s) 10 1 Ours, l=6 Ours, l=8 Ours, strict 0.1 2 4 6 8 10 12 14 k Ours, strictest CM LT EXP BPM ABNDM CPM 04 p.22/26

Experimental results, random proteins 10 time (s) 1 0.1 Ours Ours, stricter Ours, strictest 2 4 6 8 10 12 14 16 k CM LT EXP BPM ABNDM CPM 04 p.23/26

Experimental results, random proteins 100 time (s) 10 1 0.1 Ours Ours, stricter Ours, strictest 2 4 6 8 10 12 14 16 k CM LT EXP BPM ABNDM CPM 04 p.24/26

Experimental results Areas where each algorithm performs best. From left to right, DNA ( ), and proteins ( ). Top row: random data. bottom row: real data. 256 r 256 r 64 16 Ours EXP 64 16 Ours EXP 1 BPM k 1 E X P k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 256 r 256 r 64 16 Ours EXP 64 16 Ours EXP 1 BPM k 1 E X P k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 CPM 04 p.25/26

Conclusions Our new algorithm becomes the fastest for low The larger, the smaller values are tolerated. When applied to just one pattern, our algorithm becomes the fastest for low difference ratios. Our basic algorithm usually beats the extensions. True only if we use the same parameter for both algorithms. For limited memory we can use the stricter matching condition with smaller, and beat the basic algorithm Our algorithm would be favored on even longer texts (relative preprocessing cost decreases).. CPM 04 p.26/26