An efficient matching algorithm for encoded DNA sequences and binary strings



Similar documents
A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

Improved Single and Multiple Approximate String Matching

A FAST STRING MATCHING ALGORITHM

A The Exact Online String Matching Problem: a Review of the Most Recent Results

Fast string matching

Improved Single and Multiple Approximate String Matching

A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW)

Robust Quick String Matching Algorithm for Network Security

The Goldberg Rao Algorithm for the Maximum Flow Problem

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)

Notes on Complexity Theory Last updated: August, Lecture 1

A PPENDIX G S IMPLIFIED DES

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Arithmetic Coding: Introduction

Automata and Computability. Solutions to Exercises

Efficient Prevention of Credit Card Leakage from Enterprise Networks

Memory Systems. Static Random Access Memory (SRAM) Cell

A Dynamic Programming Approach for Generating N-ary Reflected Gray Code List

The enhancement of the operating speed of the algorithm of adaptive compression of binary bitmap images

Improved Approach for Exact Pattern Matching

Chapter 4: Computer Codes

New Techniques for Regular Expression Searching

String Search. Brute force Rabin-Karp Knuth-Morris-Pratt Right-Left scan

Reconfigurable FPGA Inter-Connect For Optimized High Speed DNA Sequencing

MACHINE INSTRUCTIONS AND PROGRAMS

International Journal of Advanced Research in Computer Science and Software Engineering

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

Section 1.4 Place Value Systems of Numeration in Other Bases

Learning from Data: Naive Bayes

Web Data Extraction: 1 o Semestre 2007/2008

Rethinking SIMD Vectorization for In-Memory Databases

Compression algorithm for Bayesian network modeling of binary systems

FPGA Implementation of an Extended Binary GCD Algorithm for Systolic Reduction of Rational Numbers

Operating Systems. Virtual Memory

CS101 Lecture 11: Number Systems and Binary Numbers. Aaron Stevens 14 February 2011

Streaming Lossless Data Compression Algorithm (SLDC)

A Fast Pattern-Matching Algorithm for Network Intrusion Detection System

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Finite Automata. Reading: Chapter 2

Extended Finite-State Machine Inference with Parallel Ant Colony Based Algorithms

Randomized Hashing for Digital Signatures

respectively. Section IX concludes the paper.

The Mixed Binary Euclid Algorithm

Regular Expressions and Automata using Haskell

A survey on click modeling in web search

An Efficient DNA Molecule Clustering using GCC Algorithm

Binary Coded Web Access Pattern Tree in Education Domain

Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Instruction Set Architecture (ISA)

Computer Science 281 Binary and Hexadecimal Review

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

Private Record Linkage with Bloom Filters

A HIGH PERFORMANCE SOFTWARE IMPLEMENTATION OF MPEG AUDIO ENCODER. Figure 1. Basic structure of an encoder.

The LCA Problem Revisited

How To Compare A Markov Algorithm To A Turing Machine

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout

RSA Question 2. Bob thinks that p and q are primes but p isn t. Then, Bob thinks Φ Bob :=(p-1)(q-1) = φ(n). Is this true?

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Sheet 7 (Chapter 10)

=

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Machine Learning in Spam Filtering

Computability Theory

Shift-based Pattern Matching for Compressed Web Traffic

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Binary search tree with SIMD bandwidth optimization using SSE

CS154. Turing Machines. Turing Machine. Turing Machines versus DFAs FINITE STATE CONTROL AI N P U T INFINITE TAPE. read write move.

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Theoretical Aspects of Storage Systems Autumn 2009

MACM 101 Discrete Mathematics I

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

Nino Pellegrino October the 20th, 2015

Spam Detection A Machine Learning Approach

Temporal Difference Learning in the Tetris Game

HY345 Operating Systems

CPSC 121: Models of Computation Assignment #4, due Wednesday, July 22nd, 2009 at 14:00

Factoring & Primality

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

A Catalogue of the Steiner Triple Systems of Order 19

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Learning Tetris Using the Noisy Cross-Entropy Method

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

Transcription:

An efficient matching algorithm for encoded DNA sequences and binary strings Simone Faro and Thierry Lecroq faro@dmi.unict.it, thierry.lecroq@univ-rouen.fr Dipartimento di Matematica e Informatica, Università di Catania, Italy University of Rouen, LITIS EA 4108, 76821 Mont-Saint-Aignan Cedex, France Combinatorial Pattern Matching 22 24 June 2009 Lille, France

Outline 1 Introduction 2 A new algorithm 3 Experimental Results Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 2 / 38

Outline 1 Introduction 2 A new algorithm 3 Experimental Results Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 3 / 38

Problem Searching for all exact occurrences of a pattern p ( p = m) in a text t ( t = n) where both p and t are bitstreams Example p = 110010110010010010110010001 and t = 0101010001010101010100100111001011001001001011001000101001010011001001 Requirement Avoid the access to individual bits access to blocks of k bits Special cases Each character of p and t consists of a single bit binary sequences a couple of bits encoded DNA sequences Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 4 / 38

Problem Searching for all exact occurrences of a pattern p ( p = m) in a text t ( t = n) where both p and t are bitstreams Example p = 110010110010010010110010001 and t = 0101010001010101010100100111001011001001001011001000101001010011001001 Requirement Avoid the access to individual bits access to blocks of k bits Special cases Each character of p and t consists of a single bit binary sequences a couple of bits encoded DNA sequences Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 4 / 38

Existing solutions S. T. Klein and M. K. Ben-Nissan Accelerating Boyer Moore searches on binary texts CIAA, LNCS 4783, pp 130 143, 2007 J. W. Kim, E. Kim, and K. Park Fast matching method for DNA sequences Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, LNCS 4614, pp 271 281, 2007 S. Faro and T. Lecroq Efficient pattern matching on binary strings SOFSEM, poster, 2009 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 5 / 38

Outline 1 Introduction 2 A new algorithm 3 Experimental Results Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 6 / 38

Preprocessing The algorithm computes 1 a table of k copies of p, in order to process text and pattern block by block (as in [Klein & Ben-Nissan 2007]) 2 bit-mask vectors to implement a multi-pattern version of the BNDM algorithm 3 an index-list table to identify candidate alignments during the searching phase 4 a shift table based on the bad-character heuristic to increase the length of the shifts Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 7 / 38

Byte We suppose that the block size k is fixed All references to both text and pattern will only be to entire blocks of k bits We refer to a k-bit block as a byte though larger values than k = 8 could be supported T [i] and P [i] denote, respectively, the (i + 1)-th byte of the text and of the pattern The last byte may be only partially defined. We suppose that the undefined bits of the last byte are set to 0. Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 8 / 38

k copies of p We define k copies, denoted by Patt[i] of the pattern p shifted by i position to the right, for 0 i < k i P = {0, 1,..., k 1} In each pattern P att[i], the i leftmost bits of the first byte remain undefined and are set to 0 Similarly the rightmost ((k ((m + i) mod k) last byte are set to 0 mod k) bits of the Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 9 / 38

Example p =110010110010010010110010001 of length 27 Patt 0 1 2 3 4 i 0 11001011 00100100 10110010 00100000 1 01100101 10010010 01011001 00010000 2 00110010 11001001 00101100 10001000 3 00011001 01100100 10010110 01000100 4 00001100 10110010 01001011 00100010 5 00000110 01011001 00100101 10010001 6 00000011 00101100 10010010 11001000 10000000 7 00000001 10010110 01001001 01100100 01000000 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 10 / 38

Additional information to the k copies b i : the index of the first byte in P att[i] containing a k-substring of p e i : the index of the last byte of the pattern P att[i]. m i : the number of bytes in P att[i] containing k-substrings of p F 1[i] : bit mask for the first byte of Patt[i] F 2[i] : bit mask for the last byte of Patt[i] Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 11 / 38

Example p =110010110010010010110010001 of length 27 Patt 0 1 2 3 4 i 0 11001011 00100100 10110010 00100000 1 01100101 10010010 01011001 00010000 2 00110010 11001001 00101100 10001000 3 00011001 01100100 10010110 01000100 4 00001100 10110010 01001011 00100010 5 00000110 01011001 00100101 10010001 6 00000011 00101100 10010010 11001000 10000000 7 00000001 10010110 01001001 01100100 01000000 b i e i m i 0 3 3 1 3 2 1 3 2 1 3 2 1 3 2 1 3 3 1 4 3 1 4 3 F 1 11111111 01111111 00111111 00011111 00001111 00000111 00000011 00000001 F 2 11100000 11110000 11111000 11111100 11111110 11111111 10000000 11000000 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 12 / 38

Example p =110010110010010010110010001 of length 27 Patt 0 1 2 3 4 i 0 11001011 00100100 10110010 1 10010010 01011001 2 11001001 00101100 3 01100100 10010110 4 10110010 01001011 5 01011001 00100101 10010001 6 00101100 10010010 11001000 7 10010110 01001001 01100100 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 13 / 38

Example p =110010110010010010110010001 of length 27 Patt 0 1 2 3 4 i 0 11001011 00100100 1 10010010 01011001 2 11001001 00101100 3 01100100 10010110 4 10110010 01001011 5 01011001 00100101 6 00101100 10010010 7 10010110 01001001 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 14 / 38

Bit-parallelism The algorithm uses bit-parallelism to simulate the behavior of a NFA constructed over the set of patterns P att[i] However, in order to let the automaton fit in a single machine word of size ω, only the substrings P att[i][b i.. b i + m 1] are handled by the automaton m = min({m i } {ω}) P=set of remaining k patterns of length m Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 15 / 38

Bit-parallelism m + 1 different states: Q = {0, 1, 2, 3,..., m} m different transitions: state q, with 0 < q m, has a transition towards state q 1 labeled with the class of characters {P att[i][s i + q]} m is the initial state Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 16 / 38

p =110010110010010010110010001 of length 27 Patt 0 1 2 3 4 i 0 11001011=L 00100100=A 1 10010010=H 01011001=F 2 11001001=K 00101100=C 3 01100100=G 10010110=I 4 10110010=J 01001011=E 5 01011001=F 00100101=B 6 00101100=C 10010010=H 7 10010110=I 01001001=D ω = 32 M 00100100=A 00000000000000000000000000000001 00100101=B 00000000000000000000000000000001 00101100=C 00000000000000000000000000000011 01001001=D 00000000000000000000000000000001 01001011=E 00000000000000000000000000000001 01011001=F 00000000000000000000000000000011 01100100=G 00000000000000000000000000000010 10010010=H 00000000000000000000000000000011 10010110=I 00000000000000000000000000000011 10110010=J 00000000000000000000000000000010 11001001=K 00000000000000000000000000000010 11001011=L 00000000000000000000000000000010 c {A, B, C, D, E, F, G, H, I, J, K, L} 00000000000000000000000000000000 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 17 / 38

Index list The NFA recognizes also words that are not substrings of the pattern However, in order to make a filter the algorithm maintains, for each block B {0,..., 2 k 1}, a linked list λ which is used to find candidate patterns In particular, for each block B {0,..., 2 k 1}: λ[b] = {i Patt[i, b i + m 1] = B} When a block sequence is recognized by the automaton, ending at block position j of the text, the algorithm naively checks for the occurrence of any pattern Patt[g], with g λ[t [j]] Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 18 / 38

Example p =110010110010010010110010001 of length 27 Patt 0 1 2 3 4 i 0 11001011=L 00100100=A 1 10010010=H 01011001=F 2 11001001=K 00101100=C 3 01100100=G 10010110=I 4 10110010=J 01001011=E 5 01011001=F 00100101=B 6 00101100=C 10010010=H 7 10010110=I 01001001=D λ 00100100=A {0} 00100101=B {5} 00101100=C {2} 01001001=D {7} 01001011=E {4} 01011001=F {1} 10010010=H {6} 10010110=I {3} c {A, B, C, D, E, F, H, I} Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 19 / 38

Shift table text patterns Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 20 / 38

Shift table text patterns Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 21 / 38

Shift table text patterns Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 22 / 38

Shift table The algorithm uses a long shift rule which is a multi-pattern version of the original bad-character shift heuristic, improved with an efficient look-ahead This shift rule is used when no substring is recognized while scanning the text from right to left In such a case the current window of the text can be safely advanced by m positions to the right Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 23 / 38

Shift table Observe that, if j is the right position of the current window of the text, the block at position j + m is always involved in the next attempt, thus we can use it to compute the next window position ls : {0,..., 2 k 1} {m,..., 2 m 1} ls[b] = m 1 + min{distance of B from the right end of a pattern in P} the next right position of the window is j + ls[t [j + m 1]] Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 24 / 38

p =110010110010010010110010001 of length 27 Patt 0 1 2 3 4 i 0 11001011=L 00100100=A 1 10010010=H 01011001=F 2 11001001=K 00101100=C 3 01100100=G 10010110=I 4 10110010=J 01001011=E 5 01011001=F 00100101=B 6 00101100=C 10010010=H 7 10010110=I 01001001=D ls 00100100=A 1 + 0 00100101=B 1 + 0 00101100=C 1 + 0 01001001=D 1 + 0 01001011=E 1 + 0 01011001=F 1 + 0 01100100=G 1 + 1 10010010=H 1 + 0 10010110=I 1 + 0 10110010=J 1 + 1 11001001=K 1 + 1 11001011=L 1 + 1 c {A, B, C, D, E, F, G, H, I, J, K, L} 1 + 2 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 25 / 38

The searching phase 2 parts: a match phase a shift phase Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 26 / 38

The shift phase The NFA is represented by a state vector D of size m (the first state of the automaton is not represented) Like in the SBndm algorithm [Holub & Durian 2005], each iteration starts with a test of 2 consecutive text characters and implements a fast-loop to obtain better results on average Such a fast loop makes use of the long shift table to compute the next window alignment Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 27 / 38

The match phase Right to left scan in a window of size m, ending at position j in the text, as in the BNDM algorithm The state vector is updated in a similar fashion as in the Shift-And algorithm [BYG92] If the state vector D is equal to 0 after l + 1 updates of D, then a word of length l has been recognized by the automaton If l = m a candidate alignment has been found and the algorithm naively checks the occurrence of any pattern contained in the index list λ[t [j]] In all cases, after the match phase, the index j is advanced of m l + 1 to the right and the algorithm restarts its computation with the shift phase Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 28 / 38

The match phase If a candidate alignment is found for a text position j, for each index i λ[t [j]], the algorithm uses the precomputed table P att[i] to check whether s = j m b i + 1 is a valid shift Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 29 / 38

Example p =110010110010010010110010001 A=00100100 B=00100101 C=00101100 D=01001001 E=01001011 F =01011001 G=01100100 H=10010010 I=10010110 J=10110010 K=11001001 L=11001011 0101010001010101010100100111001011001001001011001000101001010011001001 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 30 / 38

Example p =110010110010010010110010001 A=00100100 B=00100101 C=00101100 D=01001001 E=01001011 F =01011001 G=01100100 H=10010010 I=10010110 J=10110010 K=11001001 L=11001011 j 0 1 2 3 4 5 6 7 8 01010100 01010101 01010010 01110010 11001001 00101100 10001010 01010011 00100100 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 30 / 38

Example p =110010110010010010110010001 A=00100100 B=00100101 C=00101100 D=01001001 E=01001011 F =01011001 G=01100100 H=10010010 I=10010110 J=10110010 K=11001001 L=11001011 j 0 1 2 3 4 5 6 7 8 01010100 01010101 D = 0 01010010 01110010 11001001 00101100 10001010 01010011 00100100 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 30 / 38

Example p =110010110010010010110010001 A=00100100 B=00100101 C=00101100 D=01001001 E=01001011 F =01011001 G=01100100 H=10010010 I=10010110 J=10110010 K=11001001 L=11001011 j j + 1 0 1 2 3 4 5 6 7 8 01010100 01010101 01010010 01110010 11001001 00101100 10001010 01010011 00100100 {A, B, C, D, E, F, G, H, I, J, K, L} ls = 3 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 30 / 38

Example p =110010110010010010110010001 A=00100100 B=00100101 C=00101100 D=01001001 E=01001011 F =01011001 G=01100100 H=10010010 I=10010110 J=10110010 K=11001001 L=11001011 j 0 1 2 3 4 5 6 7 8 01010100 01010101 01010010 01110010 11001001 D = 0 00101100 10001010 01010011 00100100 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 30 / 38

Example p =110010110010010010110010001 A=00100100 B=00100101 C=00101100 D=01001001 E=01001011 F =01011001 G=01100100 H=10010010 I=10010110 J=10110010 K=11001001 L=11001011 j j + 1 0 1 2 3 4 5 6 7 8 01010100 01010101 01010010 01110010 11001001 00101100 10001010 01010011 00100100 D = 0 ls[c] = 1 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 30 / 38

Example p =110010110010010010110010001 A=00100100 B=00100101 C=00101100 D=01001001 E=01001011 F =01011001 G=01100100 H=10010010 I=10010110 J=10110010 K=11001001 L=11001011 0 1 2 3 4 5 6 7 8 01010100 01010101 01010010 01110010 11001001 00101100 D 0 j 10001010 01010011 00100100 Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 30 / 38

Example p =110010110010010010110010001 A=00100100 B=00100101 C=00101100 D=01001001 E=01001011 F =01011001 G=01100100 H=10010010 I=10010110 J=10110010 K=11001001 L=11001011 110010 11001001 00101100 10001 0 1 2 3 4 5 6 7 8 01010100 01010101 01010010 01110010 11001001 00101100 10001010 01010011 00100100 00111111 F 1[2] j K C D 0 λ(c) = 2 11111000 F 2[2] Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 30 / 38

Complexity time space Patt, b i, e i, m i, F 1, F 2 O(k m/k ) = O(m) O(m) Bit masks O(2 k + km) O(2 k ) Index list O(2 k + k) O(2 k ) Shift table O(2 k + m) O(2 k ) Searching phase O( n/k m/k k) = O(n m) Overall O(2 k + (n + k)m) O(2 k + m) Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 31 / 38

Handling encoded DNA sequences Each base is represented by a couple of bits Thus a DNA sequence γ can be represented with a bitstream of (2 γ ) bits Any occurrence of a given encoded pattern p starts at an even position of the text This suggests that only even alignments of the pattern have to be processed The only change to be applied, when handling encoded DNA sequences, is in the preprocessing of the set of patterns Specifically the set P is defined by P = {i 0 i < k and (i mod 2) = 0} For instance, if each block consists of k = 8 bits, we have P = {0, 2, 4, 6} Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 32 / 38

Outline 1 Introduction 2 A new algorithm 3 Experimental Results Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 33 / 38

Experimental Results BBM, Klein and Ben-Nissan, 2007 FED, Kim, Kim and Park, 2007 BHM, Faro and Lecroq, SOFSEM 2009 BSKS, Faro and Lecroq, SOFSEM 2009 BFL, Faro and Lecroq, CPM 2009 All algorithms have been implemented in the C programming language and were used to search for the same binary strings in large fixed text buffers on a PC with Intel Core2 processor of 1.66GHz. Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 34 / 38

Experimental results for a Rand(0/1) 50 problem 30 25 BHM BSKS FED BFL 20 time 15 10 5 0 0 50 100 150 200 250 300 350 400 450 500 pattern length Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 35 / 38

Experimental results for a Rand(0/1) 70 problem 35 30 BHM BSKS FED BFL 25 20 time 15 10 5 0 0 50 100 150 200 250 300 350 400 450 500 pattern length Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 36 / 38

Experimental results for an encoded DNA sequence 20 18 BHM BSKS FED BFL 16 14 12 time 10 8 6 4 2 0 0 50 100 150 200 250 300 350 400 450 500 pattern length Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 37 / 38

Conclusion and perspectives algorithm for exact matching on binary strings and encoded DNA sequences combines a multi-pattern version of the BNDM algorithm with a simplified shift strategy of CW algorithm efficient in practical cases adapts easily to the exact multiple string matching case Faro and Lecroq (Catania and Rouen) Matching encoded sequences CPM 09 38 / 38