Maxime Crochemore, Thierry Lecroq. HAL Id: hal https://hal.archives-ouvertes.fr/hal

Similar documents
ibalance-abf: a Smartphone-Based Audio-Biofeedback Balance System

Mobility management and vertical handover decision making in heterogeneous wireless networks

On line construction of suffix trees 1

A graph based framework for the definition of tools dealing with sparse and irregular distributed data-structures

A usage coverage based approach for assessing product family design

Online vehicle routing and scheduling with continuous vehicle tracking

The truck scheduling problem at cross-docking terminals

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Minkowski Sum of Polytopes Defined by Their Vertices

Faut-il des cyberarchivistes, et quel doit être leur profil professionnel?

Equilibria and stability analysis of a branched metabolic network with feedback inhibition

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

QASM: a Q&A Social Media System Based on Social Semantics

Additional mechanisms for rewriting on-the-fly SPARQL queries proxy

Managing Risks at Runtime in VoIP Networks and Services

Full and Complete Binary Trees

GDS Resource Record: Generalization of the Delegation Signer Model

An update on acoustics designs for HVAC (Engineering)

VR4D: An Immersive and Collaborative Experience to Improve the Interior Design Process

Study on Cloud Service Mode of Agricultural Information Institutions

Indexing Techniques for File Sharing in Scalable Peer-to-Peer Networks

Aligning subjective tests using a low cost common set

Global Identity Management of Virtual Machines Based on Remote Secure Elements

Use of tabletop exercise in industrial training disaster.

Overview of model-building strategies in population PK/PD analyses: literature survey.

Approximate String Matching in DNA Sequences

ANIMATED PHASE PORTRAITS OF NONLINEAR AND CHAOTIC DYNAMICAL SYSTEMS

Information Technology Education in the Sri Lankan School System: Challenges and Perspectives

IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs

Optimization results for a generalized coupon collector problem

Expanding Renewable Energy by Implementing Demand Response

Application-Aware Protection in DWDM Optical Networks

Data Structures and Algorithms Written Examination

Lempel-Ziv Factorization: LZ77 without Window

Extended Application of Suffix Trees to Data Compression

New implementions of predictive alternate analog/rf test with augmented model redundancy

International Journal of Advanced Research in Computer Science and Software Engineering

Fast Edge Splitting and Edmonds Arborescence Construction for Unweighted Graphs

Training Ircam s Score Follower

Territorial Intelligence and Innovation for the Socio-Ecological Transition

Novel Client Booking System in KLCC Twin Tower Bridge

Towards Unified Tag Data Translation for the Internet of Things

Analysis of Compression Algorithms for Program Data

Rigorous Measurement of IP-Level Neighborhood of Internet Core Routers

Efficient Multi-Feature Index Structures for Music Data Retrieval

Partial and Dynamic reconfiguration of FPGAs: a top down design methodology for an automatic implementation

Suffix Tree Construction and Storage with Limited Main Memory

Distributed network topology reconstruction in presence of anonymous nodes

Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Performance Evaluation of Encryption Algorithms Key Length Size on Web Browsers

Contribution of Multiresolution Description for Archive Document Structure Recognition

Section IV.1: Recursive Algorithms and Recursion Trees

Cobi: Communitysourcing Large-Scale Conference Scheduling

Testing Web Services for Robustness: A Tool Demo

SELECTIVELY ABSORBING COATINGS

Virtual plants in high school informatics L-systems

Flauncher and DVMS Deploying and Scheduling Thousands of Virtual Machines on Hundreds of Nodes Distributed Geographically

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

Ontology-based Tailoring of Software Process Models

Testing LTL Formula Translation into Büchi Automata

A Non-Linear Schema Theorem for Genetic Algorithms

Regular Expressions and Automata using Haskell

Automatic Generation of Correlation Rules to Detect Complex Attack Scenarios

Improved Method for Parallel AES-GCM Cores Using FPGAs

Adaptive Fault Tolerance in Real Time Cloud Computing

Blind Source Separation for Robot Audition using Fixed Beamforming with HRTFs

Efficient Data Structures for Decision Diagrams

Towards Collaborative Learning via Shared Artefacts over the Grid

Linear Work Suffix Array Construction

The Effectiveness of non-focal exposure to web banner ads

Diversity against accidental and deliberate faults

Undulators and wigglers for the new generation of synchrotron sources

Telepresence systems for Large Interactive Spaces

Symbol Tables. Introduction

Automata on Infinite Words and Trees

Introduction to the papers of TWG18: Mathematics teacher education and professional development.

Algorithms and Data Structures

ANALYSIS OF SNOEK-KOSTER (H) RELAXATION IN IRON

A First Investigation of Sturmian Trees

Fast Sequential Summation Algorithms Using Augmented Data Structures

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

DEM modeling of penetration test in static and dynamic conditions

2 SYSTEM DESCRIPTION TECHNIQUES

Identifying Objective True/False from Subjective Yes/No Semantic based on OWA and CWA

An Automatic Reversible Transformation from Composite to Visitor in Java

An integrated planning-simulation-architecture approach for logistics sharing management: A case study in Northern Thailand and Southern China

Data Structures Fibonacci Heaps, Amortized Analysis

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software

Multilateral Privacy in Clouds: Requirements for Use in Industry

Cracks detection by a moving photothermal probe

A parallel algorithm for the extraction of structured motifs

What Development for Bioenergy in Asia: A Long-term Analysis of the Effects of Policy Instruments using TIAM-FR model

From Dance to Touch: Movement Qualities for Interaction Design

Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014

Understanding Big Data Spectral Clustering

ISO9001 Certification in UK Organisations A comparative study of motivations and impacts.

Proposal for the configuration of multi-domain network monitoring architecture

DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices

Analysis of Algorithms I: Binary Search Trees

Transcription:

Suffix tree Maxime Crochemore, Thierry Lecroq To cite this version: Maxime Crochemore, Thierry Lecroq. Suffix tree. Ling Liu and M. Tamer Özsu. Encyclopedia of Dataase Systems, Springer Verlag, pp.2876-2880, 2009. <hal-00470102> HAL Id: hal-00470102 https://hal.archives-ouvertes.fr/hal-00470102 Sumitted on 19 Mar 2013 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are pulished or not. The documents may come from teaching and research institutions in France or aroad, or from pulic or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, puliés ou non, émanant des étalissements d enseignement et de recherche français ou étrangers, des laoratoires pulics ou privés.

SUFFIX TREE Maxime Crochemore King s College London and Université Paris-Est, http://www.dcs.kcl.ac.uk/staff/mac/ Thierry Lecroq Université de Rouen, http://monge.univ-mlv.fr/~lecroq SYNONYMS Compact suffix trie DEFINITION The suffix tree S(y) of a non-empty string y of length n is a compact trie representing all the suffixes of the string. The suffix tree of y is defined y the following properties: All ranches of S(y) are laeled y all suffixes of y. Edges of S(y) are laeled y strings. Internal nodes of S(y) have at least two children. Edges outgoing an internal node are laeled y segments starting with different letters. The segments are represented y their starting position on y and their lengths. Moreover, it is assumed that y ends with a symol occurring nowhere else in it (the space sign is used in the examples of the present entry). This avoids marking nodes, and implies that S(y) has exactly n leaves (numer of non-empty suffixes). All the properties then imply that the total size of S(y) is O(n), which makes it possile to design a linear-time construction of the suffix tree. HISTORICAL BACKGROUND The first linear time algorithm for uilding a suffix tree in from Weiner [15] ut it requires quadratic space: O(n σ) where σ is the size of the alphaet. The first linear time and space algorithm for uilding a suffix tree is from McCreight [13]. It works off-line ie it inserts the suffixes from the longest one to the shortest one. A strictly sequential version of the construction of the suffix tree was descried y Ukkonen [14]. When the alphaet is potentially infinite, the construction algorithms of the suffix tree can e implemented to run in time O(n log σ) and then are optimal since they imply an ordering on the letters of the alphaet. On particular integer alphaets, Farach [5] showed that the construction can e done in linear time. The minimization of the suffix trie gives the suffix automaton. The suffix automaton of a string is also known under the name of DAWG, for Directed Acyclic Word Graph. Its linearity was discovered y Blumer et al. (see [1]), who gave a linear construction (on a fixed alphaet). The minimality of the structure as an automaton is from Crochemore [2] who showed how to uild with the same complexity the factor automaton of a text The compaction of the suffix automaton gives the compact suffix automaton (see [1]), A direct construction algorithm of the compact suffix automaton was presented y Crochemore and Vérin [4]. The same structure arises when minimizing the suffix tree. The suffix array of the string y consists of oth the permutation of positions on the text that gives the sorted list of suffixes and the corresponding array of lengths of their longest common prefixes (LCP). The suffix array of a

a 0 6 a 1 2 a 8 3 4 5 6 14 15 9 10 11 12 17 18 22 6 21 5 20 4 19 3 16 2 13 1 7 0 i 0 1 2 3 4 5 6 y[i] a a Figure 1: Suffix trie of aa. Nodes are numered in the order of creation. The small numers closed to each leaf correspond to the position of the suffix associated to the leaf. string, with the associated search algorithm ased on the knowledge of the common prefixes is from Maner and Myers [12]. It can e uilt in linear time on integer alphaets (see [8, 9, 10]). For the implementation of index structures in external memory, the reader can refer to Ferragina and Grossi [6]. SCIENTIFIC FUNDAMENTALS Suffix trees The suffix tree S(y) of the string y = aa is presented Figure 2. It can e seen as a compaction of the suffix trie T (y) of y given Figure 1. Nodes of S(y) and T (y) are identified with segments of y. Leaves of S(y) and T (y) are identified with suffixes of y. An output is defined, for each leaf, which is the starting position of the suffix in y. The two structures can e uilt y successively inserting the suffixes of y from the longest to the shortest. In the suffix trie T (y) of a string y of length n there exist n paths from the root to the n leaves: each path spells a different non-empty suffix of y. Edges are laeled y exactly one symol. The suffix trie can have a quadratic numer of nodes since the sum of the lengths of all the suffixes of y is quadratic. To get the suffix tree from the suffix trie, internal nodes with exactly one successor are removed. Then laels of edges etween remaining nodes are concatenated. Edges are now laeled with strings. This gives a linear numer of nodes since there are exactly n leaves and since every internal node (called a fork) has at least two successors, there can e at most n 1 forks. This also gives a linear numer of edges. Now, in order that the space requirement ecomes linear, since the laels of the edges are all segments of y, a segment y[i..i + l 1] is represented y the pair (i, l). Thus each edge can e represented in constant space. This technique requires to have y residing in main memory. Overall, there is a linear numer of nodes and a linear numer of edges, each node and each edge can e represented in constant space thus the suffix tree requires a linear space. There exist several direct linear time construction algorithms of the suffix tree that avoid the construction of the suffix trie followed y its compaction. The McCreight algorithm [13] directly constructs the suffix tree of the string y y successively inserting the suffixes of y from the longest one to the shortest one. The insertion of the suffix of y eginning at position i (ie y[i..n 1]) consist first in locating (creating it if necessary) the fork associated with the longest prefix of y[i..n 1] common with a longest suffix of y: y[0..n 1], y[1..n 1],..., or y[i 1..n 1]. Let us call the head u such a longest prefix and the tail v the remaining suffix such that y[i.. n 1] = uv. Once the fork p associated with u has een 2

(2,5) 1 0 3 (0, 2) (4,3) 4 2 0 (1,1) (2,5) 2 1 (6,1) 10 6 5 (6,1) (4,1) 9 5 7 (6,1) 8 4 (5, 2) 6 3 i 0 1 2 3 4 5 6 y[i] a a Figure 2: Suffix tree of aa. Nodes are numered in the order of creation. The small numers closed to each leaf correspond to the position of the suffix associated to the leaf. located it is enough to add a new leaf q laeled y i and a new edge laeled y (i + u, v ) etween p and q to complete the insertion of y[i..n 1] into the structure. The reader can refer to [3] for further details. The linear time of the construction is achieved y using a function called suffix link defined on the forks as follows: if fork p is identified with segment av, a A and v V, then s y (p) = q where fork q is identified with v. Suffix links are represented y dotted arrows on Figure 2. The suffix links create shortcuts that are used to accelerate heads computations. If the head of y[i 1..n 1] is of the form au (a A, u V ) then u is a prefix of the head of y[i..n 1]. Therefore, using suffix links, the insertion of the suffix y[i..n 1] consists first in finding the fork corresponding to the head of y[i..n 1] (starting from suffix link of the fork associated with au) and then in inserting the tail of y[i..n 1] from this fork. Ukkonen algorithm [14] works on-line ie it uilds the suffix tree of y processing the symols of y from the first to the last. It also uses suffix links to achieve a linear time computation. The Weiner, McCreight and Ukkonen works in O(n) time whenever O(n σ) space is used. If only O(n) space is used then the O(n) time ound should e replaced y O(n log min{n, σ}). This is due to the access time for a specific edge stored in each nodes. The reader can refer to [11] for specific optimized implementations of suffix trees. For particular integer alphaets, if the alphaet of y is in the interval [1..n c ] for some constant c, Farach [5] showed that the construction can e done in linear time. Generalized suffix trees are used to represent all the suffixes of all the strings of a set of strings. Word suffix trees can e used when processing natural languages in order to represent only suffixes starting after separators such as space or line feed. Indexes The suffix tree serves as a full index on the string: it provides a direct access to all segments of the string, and gives the positions of all their occurrences in the string. An index on y can e considered as an astract data type whose asic set is the set of all the segments of y, and that possesses operations giving access to information relative to these segments. The utility of considering the suffixes of a string for this kind of application comes from the ovious remark that every segment of a string is the prefix of a suffix of the string Once the suffix tree of a text y is uilt, searching for x in y remains to spell x along a ranch of the tree. If this walk is successful the positions of the pattern can e output. Otherwise, x does not occur in y. Any kind of trie that represents the suffixes of a string can e used to search it. But the suffix tree has additional features which imply that its size is linear. We consider four operations relative to the segments of a string y: the memership, the first position, the numer of occurrences and the list of positions. 3

The first operation on an index is the memership of a string x to the index, that is to say the question to know whether x is a segment of y. This question can e specified in two complementary ways whether we expect to find an occurrence of x in y or not. If x does not occur in y, it is often interesting in practice to know the longest eginning of x that is a segment of y. This is the type of usual answer necessary for the sequential search tools in a text editor. The methods produce without large modification the position of an occurrence of x, and even the position of the first or last occurrence of x in y. Knowing that x is in the index, another relevant information is constituted y its numer of occurrences in y. This information can differently direct the ulterior searches. Finally, with the same assumption than previously, a complete information on the localization of x in y is supplied y the list of positions of its occurrences. Suffix trees can easily answer these questions. It is enough to spell x from the root of S(y). If it is not possile, then x does not occur in y. Whenever x occurs in y, let w e the shortest segment of y that is such x is a prefix of w and w is associated with a node p of S(y). Then the numer of leaves of the sutree rooted in the node p gives the numer of occurrences of x in y. The smallest (respectively largest) of these leaves gives the position of the first (resp. last) position of x in y. The list of the numer of these leaves gives the list of position of x in y. KEY APPLICATIONS* Suffix trees are used to solve string searching prolems mainly when the text into which a pattern has to e found is fixed. It is also used in other string related prolems such as Longest Repeated Sustring, Longest Common Sustring. It can e used to perform text compression. Gusfield [7] gives many applications of suffix trees in computational iology. CROSS REFERENCE* Trie RECOMMENDED READING Between 5 and 15 citations to important literature, e.g., in journals, conference proceedings, and wesites. [1] A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler, M. T. Chen and J. Seiferas, The smallest automaton recognizing the suwords of a text Theor. Comput. Sci. 40(1):31 55, 1985. [2] M. Crochemore, Transducers and repetitions, Theor. Comput. Sci. 45(1):63 86, 1986. [3] M. Crochemore, C. Hancart and T. Lecroq, Algorithms on strings, Camridge University Press, 2007. [4] M. Crochemore and R. Vérin, On compact directed acyclic word graphs, in Structures in Logic et Computer Science, LNCS 1261, 192 211, 1997. [5] M. Farach, Optimal suffix tree construction with large alphaets, in Proceedings of the 38th IEEE Annual Symposium on Foundations of Computer Science, Miami Beach, Florida, 137 143, 1997. [6] P. Ferragina and R. Grossi, The string B-tree: A new data structure for string search in external memory et its applications, J. Assoc. Comput. Mach. 46:236-280, 1999. [7] D. Gusfield, Algorithms on strings, trees and sequences, Camridge University Press, 1997. [8] J. Kärkkäinen and P. Sanders, Simple linear work suffix array construction, in ICALP, LNCS 2719, 943 955, 2003. [9] D. K. Kim, J. S. Sim, H. Park and K. Park, Linear-Time Construction of Suffix Arrays, in CPM03, LNCS 2676, 186 199, 2003. [10] P. Ko and S. Aluru, Space Efficient Linear Time Construction of Suffix Arrays, in CPM03, LNCS 2676, 200 210, 2003. [11] S. Kurtz, Reducing the space requirement of suffix trees, Softw., Pract. Exper. 29(13):1149 1171 (1999) [12] U. Maner et G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. Comput. 22(5):935 948, 1993. 4

[13] E. M. McCreight, A space-economical suffix tree construction algorithm, J. Algorithms 23(2):262 272, 1976. [14] E. Ukkonen, On-line construction of suffix trees, Algorithmica 14(3):249 260, 1995. [15] P. Weiner, Linear pattern matching algorithm, in Proceedings of the 14th Annual IEEE Symposium on Switching et Automata Theory, Washington, DC, 1 11, 1973. 5