Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching



Similar documents
Fast string matching

A FAST STRING MATCHING ALGORITHM

A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW)

String Search. Brute force Rabin-Karp Knuth-Morris-Pratt Right-Left scan

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Biological Sequence Data Formats

Symbol Tables. Introduction

Bioinformatics Resources at a Glance

Robust Quick String Matching Algorithm for Network Security

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Genetic programming with regular expressions

Improved Single and Multiple Approximate String Matching

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur

Gene Models & Bed format: What they represent.

Searching BWT compressed text with the Boyer-Moore algorithm and binary search

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Network Protocol Analysis using Bioinformatics Algorithms

Improved Approach for Exact Pattern Matching

Regular Languages and Finite State Machines

MAKING AN EVOLUTIONARY TREE

8.1 Min Degree Spanning Tree

Pairwise Sequence Alignment

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Mass Spectra Alignments and their Significance

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

1. Domain Name System

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Final Project Report

Hardware Pattern Matching for Network Traffic Analysis in Gigabit Environments

Knuth-Morris-Pratt Algorithm

The string of digits in the binary number system represents the quantity

The finite field with 2 elements The simplest finite field is

Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur

Web Data Extraction: 1 o Semestre 2007/2008

Bio-Informatics Lectures. A Short Introduction

Analysis of Compression Algorithms for Program Data

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Turing Machines: An Introduction

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

Regular Expressions and Automata using Haskell

1 Approximating Set Cover

Regents Biology REGENTS REVIEW: PROTEIN SYNTHESIS

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

Reconfigurable FPGA Inter-Connect For Optimized High Speed DNA Sequencing

A Non-Linear Schema Theorem for Genetic Algorithms

CD-HIT User s Guide. Last updated: April 5,

To be able to describe polypeptide synthesis including transcription and splicing

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

Lecture 9. Semantic Analysis Scoping and Symbol Table

3515ICT Theory of Computation Turing Machines

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

An efficient matching algorithm for encoded DNA sequences and binary strings

Transcription and Translation of DNA

14.1 Rent-or-buy problem

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

CPSC 121: Models of Computation Assignment #4, due Wednesday, July 22nd, 2009 at 14:00

Information, Entropy, and Coding

A Tutorial in Genetic Sequence Classification Tools and Techniques

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

A parallel algorithm for the extraction of structured motifs

Guidelines for Establishment of Contract Areas Computer Science Department

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Quantum and Non-deterministic computers facing NP-completeness

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Introduction to Bioinformatics AS Laboratory Assignment 6

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software

AUTOMATED TEST GENERATION FOR SOFTWARE COMPONENTS

On line construction of suffix trees 1

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Image Compression through DCT and Huffman Coding Technique

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Name Class Date. binomial nomenclature. MAIN IDEA: Linnaeus developed the scientific naming system still used today.

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Useful Number Systems

Teaching Bioinformatics to Undergraduates

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

(Refer Slide Time: 2:03)

2) Write in detail the issues in the design of code generator.

EE 261 Introduction to Logic Circuits. Module #2 Number Systems

Bioinformatics Grid - Enabled Tools For Biologists.

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

THE TURING DEGREES AND THEIR LACK OF LINEAR ORDER

4.2 Sorting and Searching

GenBank, Entrez, & FASTA

The programming language C. sws1 1

Transcription:

COSC 348: Computing for Bioinformatics Definitions A pattern (keyword) is an ordered sequence of symbols. Lecture 4: Exact string searching algorithms Lubica Benuskova http://www.cs.otago.ac.nz/cosc348/ 1 Symbols of the pattern and the searched text are chosen from a predetermined finite set, called an alphabet (Σ) In general alphabet can be any finite set of symbols/letters In bioinformatics: DNA alphabet Σ = {A,C,G,T}, RNA alphabet Σ = {A,C,G,U}; protein alphabet Σ = {A,R,N, V} (20 amino acids) 2 Exact string searching or matching Exact string search algorithms Much of data processing in bioinformatics involves in one way or another recognising certain patterns within DNA, RNA (1 st assignment) or protein sequences. String-matching consists of finding one, or more or generally all the occurrences of a string of length m (called a pattern or keyword) within a text of the total length n characters. An example of an exact string search (match): Pat: Txt: HERE IS A SIMPLE 3 Algorithm Naïve string search algorithm (brute force) Preprocessing time 0 (no preprocessing) Knuth-Morris-Pratt algorithm O(m) O(n) Matching time average O(n+m), worst O(n m) O(m + Σ ) O(n/m), O(n) Rabin-Karp algorithm (suffix trees) O(m) O(n) average O(n+m), worst O(n m) O(m+z) z = number of matches 35 algorithms with codes at http://www-igm.univ-mlv.fr/~lecroq/string/ 4 Naïve string search (brute force) The most intuitive way is to slide a window of length m (pattern) over the text (of length n) from left to right one letter at a time. Naïve string search (brute force) If there is not a copy of the whole pattern in the first m characters of the text, we look if there s a copy of the pattern starting at the second character of the text: Within the window compare successive characters: ABCABCDABABCDABCDABDE BCD ABCABCDABABCDABCDABDE BCD 5 6

Naïve string search (brute force) If there is not a copy of the pattern starting at the second character of the text, we look if there s a copy of the pattern starting at the third character of the text, and so forth: Naïve string search (brute force) until we hit a match; then we continue in the same way along the text and count number of matches. ABCABCDABABCDABCDABDE BCD ABCABCDABABCDABCDABDE BCD Match! 7 8 Properties of the naïve search Can be used on-line (advantage) Usually takes O(n+m) steps not so bad The inner loop finds a mismatch quickly and moves on the next position quickly without going through all the m steps Worst case scenario O(nm) when searching for aaab in aaaaaaaaaaaaaaaaaaaaaaaab Integer i denotes the position within the searched txt, which is the beginning of the prospective match for pat Integer j denotes the character currently under consideration in pat - denotes a gap in the sequence ABC-ABCDAB-ABCDABCDABDE ABCDABD 9 10 Slide a sliding window of length m (pattern) over the text (of length n) from left to right. Within the window compare successive characters from left to right until a mismatch is hit. ABC-ABCDAB-ABCDABCDABDE ABCDABD When a mismatch occurs, the pattern itself is used to determine where to jump to the next meaningful position to continue, in this case i = j = 4: ABC-ABCDAB-ABCDABCDABDE ABCDABD 11 12

From the next meaningful position, i.e. i = 4, we proceed in the same way; There is a nearly complete match ABCDAB when we hit a mismatch again at pat[6] and txt[10]. ABC-ABCDAB-ABCDABCDABDE ABCDABD Wepassedan"AB" which could be the beginning of a new match, so we simply reset i = 8, j = 2 and continue matching the current character from left to right within a window. ABC-ABCDAB-ABCDABCDABDE ABCDABD 13 14 This search fails immediately, as the pat does not contain a gap, so we return to the beginning of pat, by resetting j = 0,and begin searching at i = 11in the text. ABC-ABCDAB-ABCDABCDABDE ABCDABD So we have returned to the beginning of pat and begin searching at i = 11, resetting j = 0. Once again we immediately hit upon a match "ABCDAB" but the next character, 'C', does not match the final character 'D' of the pat. ABC-ABCDAB-ABCDABCDABDE ABCDABD 15 16 Thus we set i = 15, to start at the two-character string "AB", set j = 2, in the pat, and continue matching from the current position. This time we are able to complete the match, whose first character is at txt[15]. ABC-ABCDAB-ABCDABCDABDE ABCDABD Properties of Knuth-Morris-Pratt algorithm Can be used on-line (advantage) like naïve search but it s substantially improved. Time to find match is only O(n) with O(m) preprocessing time. Partial match table should allow not to match any letter of txt more than once. Can be modified to search for multiple patterns in a single search. Match! 17 18

is a particularly efficient string searching algorithm, and it has been the standard benchmark for the practical string searching BM algorithm holds a window containing pat over txt, much as the naïve search does. This window moves from left to right, however, its improved performance is based around two clever ideas: 1. Inspect the window from right to left. 2. Recognize the possibility of large shifts in the window without missing a match. By fetching the S underlying the last character of the pat we learn: We are not standing on a match (because S isn't E). We wouldn't find a match even if we slid the pattern right by 1 (because S isn't L), by 2 (because S isn't P), etc. 19 20 Since S doesn't occur in the pattern at all, we can slide the pattern to the right by its own length without missing a match. This shift can be pre-calculated for every letter and stored in a table. This table is called a bad character shift table. Focus your attention on the right end of the pattern. E is not P, L is not P, but P= P so let us shift the pat to the right to align it with the P in the 21 22 MPLE We have discovered that MPLE occurs in the txt, let us put it in front of the pat like this: Now we can shift the pattern all way down to align this discovered occurrence in the txt with its last occurrence in the pattern (which is partly imaginary), i.e.: MPLE MPLE 23 24

There are only seven terminal substrings of the pattern, so we can pre-compute all these shifts too and store them in a table. This is sometimes called the good suffix shift table. MPLE In general, if the algorithm has a choice of more than one shifts, then it takes the largest one. We've aligned the MPLE but focus on the end of the pattern. E is not P, L is not P, but P=P so let us shift the pat to the right to align it with the P in the MPLE 25 Match! 26 : properties Observe that we have found the pattern without looking at all of the characters. Its speed derives from the fact that it can determine all occurrences of pat within txt without examining too many characters in txt. In fact, its average performance is O(n / m), that is, it gets faster as the pattern gets longer. We say the algorithm is sublinear in the sense that it generally looks at fewer characters than it passes. 27 Rabin-Karp algorithm: hashing uses the naïve search method (i.e. sliding window) and substantially speeds up the testing of equality of the pattern to the substrings in the text by using hashing. It is used for multiple pattern matching (in addition to single pattern matching), because it has the unique advantage of being able to find any one of k strings in O(n) time on average, regardless of the magnitude of k. The key to performance is the efficient computation of hash values of the successive substrings of the text. 28 Rabin-Karp algorithm hashing A hash function converts every string into a numerical value, called its hash value (code, sum), using for instance the ASCII value of characters. For example, hash( hello ) = 5. Algorithm exploits the fact that if two strings are equal, their hash values are also equal (there might be so-called hash collisions, though, that must be checked for letter by letter). All we have to do is to compute the hash value of the pattern we're searching for, and then look for substrings with the same hash value within the text (and then check letter by letter). Different variants of the algorithm compute hash values in different ways (adding, multiplying, etc.). 29 Rabin-Karp algorithm: properties One popular and effective hash function treats every substring as a number in some base, the base being usually a large prime. For example, if the substring is "hi" and the base b = 101, then hash( hi ) = h *b^1 + i *b^0 = 104*101+105*1 = 10,609 Rabin-Karp is inferior for single pattern searching to Boyer-Moore algorithm because of its slow worst case behaviour. However, Rabin-Karp is an algorithm of choice for multiple pattern search. That is, if we want to find many fixed length patterns in a text, say of length k, we can create a simple variant of Rabin-Karp that checks whether the hash of a given string in the text belongs to a set of hash values of patterns we are looking for. 30

Used for multiple pattern matching tasks Decription from the article and code by Tomas Petricek at http://www.codeproject.com/kb/recipes/ahocorasick.aspx The algorithm consists of two parts: In the first phase of the tree building, keywords are added to the tree. (The root node is used only as a place holder and contains links to other letters. ) Links created in this first step represents the goto function, which returns the next state when a character is matching. Example of the tree for keywords: his, hers, she The first part is the building of the tree from keywords/patterns you want to search for, and the second part is searching the text for the keywords using the previously built tree (finite state machine, FSM). FSM is a deterministic model of behaviour composed of a finite number of states and transitions between those states 31 32 The fail function is used when a character is not matching. For example, in the text shis, the failure function is used to exit from the she branch to his branch after the first two characters (because the third character is not matching). During the second phase, the BFS (breadth first search) algorithm is used for traversing through all the nodes. At each stage, the node to be expanded is indicated by a marker In general all the nodes are expanded at a given depths before any nodes at the next level are expanded 33 Help: Find the tutorial on efficient string search with suffix trees written by Mark Nelson at http://marknelson.us/1996/08/01/suffix-trees/ 34 Assume that generalised suffix tree has been built for the set of patterns D = {S 1, S 2,..., S K } of total length n = n 1 + n 2 +... + n K. All patterns have the same alphabet. You can search for patterns in such a way that: Check if a pattern P of length m is a substring in O(m) time. Find the first occurrence of the patterns P 1,...,P q of total length m as substrings in O(m) time. Find all z occurrences of the patterns P 1,...,P q of total length m as substrings in O(m + z) time. Conclusions Although data are memorized in various ways, text remains the main form to exchange information. String-matching is a very important subject in the wider domain of text processing (i.e. keyword search), not just bioinformatics. In bioinformatics, the patterns in strands of DNA, RNA and proteins, have important biological meaning, e.g. they are promoters, enhancers, operators, genes, introns, exons, etc. Often these meaningful patterns undergo mutations at some points, therefore we include in the patterns the so-called wildcards, to replace some of the characters (as in the assignment). 35 36