Basic Research in Computer Science BRICS RS Brodal et al.: Solving the String Statistics Problem in Time O(n log n)

Transcription

1 BRICS Bsic Reserch in Computer Science BRICS RS Brodl et l.: Solving the String Sttistics Prolem in Time O(n log n) Solving the String Sttistics Prolem in Time O(n log n) Gerth Stølting Brodl Rune B. Lyngsø Ann Östlin Christin N. S. Pedersen BRICS Report Series RS ISSN Mrch 2002

2 Copyright c 2002, Gerth Stølting Brodl & Rune B. Lyngsø & Ann Östlin & Christin N. S. Pedersen. BRICS, Deprtment of Computer Science University of Arhus. All rights reserved. Reproduction of ll or prt of this work is permitted for eductionl or reserch use on condition tht this copyright notice is included in ny copy. See ck inner pge for list of recent BRICS Report Series pulictions. Copies my e otined y contcting: BRICS Deprtment of Computer Science University of Arhus Ny Munkegde, uilding 540 DK 8000 Arhus C Denmrk Telephone: Telefx: Internet: [email protected] BRICS pulictions re in generl ccessile through the World Wide We nd nonymous FTP through these URLs: ftp://ftp.rics.dk This document in sudirectory RS/02/13/

3 Solving the String Sttistics Prolem in Time O(n log n) Gerth Stølting Brodl, Rune B. Lyngsø Ann Östlin Christin N. S. Pedersen, Mrch, 2002 Astrct The string sttistics prolem consists of preprocessing string of length n such tht given query pttern of length m, the mximum numer of non-overlpping occurrences of the query pttern in the string cn e reported efficiently. Apostolico nd Preprt introduced the miniml ugmented suffix tree (MAST) s dt structure for the string sttistics prolem, nd showed how to construct the MAST in time O(n log 2 n) nd how it supports queries in time O(m) for constnt sized lphets. A susequent theorem y Frenkel nd Simpson stting tht string hs t most liner numer of distinct squres implies tht the MAST requires spce O(n). In this pper we improve the construction time for the MAST to O(n log n) y extending the lgorithm of Apostolico nd Preprt to exploit properties of efficient joining nd splitting of serch trees together with refined nlysis. Keywords: Strings, suffix trees, string sttistics, periods, serch trees A short version of this pper hs een pulished s [5] BRICS (Bsic Reserch in Computer Science, funded y the Dnish Ntionl Reserch Foundtion), Deprtment of Computer Science, University of Arhus, Ny Munkegde, DK-8000 Århus C, Denmrk. E-mil: {gerth,cstorm}@rics.dk. Prtilly supported y the Future nd Emerging Technologies progrmme of the EU under contrct numer IST (ALCOM-FT). Supported y the Crlserg Foundtion (contrct numer ANS-0257/20). Deprtment of Sttistics, Oxford University, Oxford OX1 3TG, UK. E-mil: [email protected]. IT University of Copenhgen, Glentevej 67, DK-2400 Copenhgen NV. E-mil: [email protected]. Work done while t BRICS. Bioinformtics Reserch Center (BiRC), funded y Arhus University Reserch Foundtion. 1

4 1 Introduction The string sttistics prolem consists of preprocessing string S of length n such tht given query pttern α of length m, the mximum numer of non-overlpping occurrences of α in S cn e reported efficiently. Without preprocessing the mximum numer of non-overlpping occurrences of α in S cn e found in time O(n), y using liner time string mtching lgorithm to find ll occurrences of α in S, e.g. the lgorithm y Knuth, Morris, nd Prtt [14], nd then in greedy fshion from left-to-right compute the mximl numer of non-overlpping occurrences. Apostolico nd Preprt in [3] descried dt structure for the string sttistics prolem, the miniml ugmented suffix tree MAST(S), with preprocessing time O(n log 2 n) nd query time O(m) for constnt sized lphets. In this pper we present n improved lgorithm for constructing MAST(S) with preprocessing time O(n log n), nd prove tht MAST(S) requires spce O(n), which follows from recent theorem of Frenkel nd Simpson [9]. The sic ide of the lgorithm of Apostolico nd Preprt nd our lgorithm for constructing MAST(S), is to perform trversl of the suffix tree of S while mintining the lef-lists of the nodes visited in pproprite dt structures (see Section 1.1 for definition detils). Trversing the suffix tree of string to construct nd exmine the lef-lists t ech node is generl technique for finding regulrities in string, e.g. for finding squres in string (or tndem repets), [2, 18], for finding mximl qusi-periodic sustrings, i.e. sustrings tht cn e covered y shorter sustring, [1, 6], nd for finding mximl pirs with ounded gp [4]. All these prolems cn e solved using this technique in time O(n log n). Other pplictions re listed y Gusfield in [10, Chpter 7]. A crucil component of our lgorithm is the representtion of lef list y collection of serch trees, such tht the lef-list of node in the suffix tree of S cn e constructed from the lef-lists of the children y efficient merging. Hwng nd Lin [13] descried how to optimlly merge two sorted lists of length n 1 nd n 2,wheren 1 n 2,withO(n 1 log n 1+n 2 n 1 ) comprisons. Brown nd Trjn [7] descried how to chieve the sme numer of comprisons for merging two AVL-trees in time O(n 1 log n 1+n 2 n 1 ), nd Huddleston nd Mehlhorn [12] showed similr result for level-linked (2,4)-trees. In our lgorithm we will use slightly extended version of level-linked (2,4)-trees where ech element hs n ssocited weight. The rest of this section contins sic definitions nd lemms tht we will use in the ltter sections. In Section 2 we give precise definition of the string sttistics prolem nd MAST(S). In Section 3 we give ll the string properties nd definitions enling us to construct MAST(S) intimeo(nlog n), nd self-contined proof of the theorem of Frenkel nd Simpson tht ny string 2

5 Figure 1: The string hs periods 3, 6, 9, nd 11, i.e. the period of the string is 3 hs t most liner numer of distinct squres s sustrings. In Section 4 we descrie the vrint of level-linked (2,4)-trees used y our lgorithm. In Section 5 we present our lgorithm, nd in Section 6 we prove tht the running time of the lgorithm is O(n log n). 1.1 Preliminries Some of the terminology nd nottion used in the following origintes from [3], ut with minor modifictions. We let Σ denote finite lphet, nd for string S Σ we let S denote the length of S, S[i] theith chrcter in S, for 1 i S,ndS[i..j]=S[i]S[i+1] S[j] the sustring of S from the ith to the jth chrcter, for 1 i j S.ThesuffixS[i.. S ]ofsstrting t position i will e denoted S[i..]. An integer p, for1 p S, is denoted periodof S if nd only if the suffix S[p +1.. ]ofsis lso prefix of S, i.e. S[p +1.. ]=S[1.. S p]. The shortest period p of S is denoted the period of S, nd the string S is sid to e periodic if nd only if p S /2. Figure 1 shows the periods of string of length 11. A nonempty string S is squre, ifs=αα for some string α. In the rest of this pper S denotes the input string with length n nd α sustring of S. A non-empty string α is sid to occur in S t position i if α = S[i..i+ α 1] nd 1 i n α + 1. E.g. in the string the sustring occurs t positions 1 nd 8. The mximum numer of nonoverlpping occurrences of string α in string S, is the mximum numer of occurrences of α where no two occurrences overlp. E.g. the mximum numer of non-overlpping occurrences of in is three, since the occurrences t positions 1, 5 nd 9 do not overlp. The suffix tree ST(S) of the string S is the compressed trie storing ll suffixes of the string S$ where$/ Σ. Ech lef in ST(S) represents suffix S[i..]$ of S$ nd is nnotted with the index i. Ech edge in ST(S) is leled with nonempty sustring of S$, represented y the strt nd end 3

6 positions in S, such tht the pth from the root to the lef nnotted with index i spells the suffix S[i..]$. We refer to the sustring of S spelled y the pth from the root to node v s the pth-lel of v nd denote it L(v). We refer to the set of indices stored t the leves of the sutree rooted t v s the lef-list of v nd denote it LL(v). Since LL(v) is exctly the set of strt positions i where L(v) is prefix of the suffix S[i..]$,wehveFct1elow. Fct 1 If v is n internl node of ST(S), thenll(v) = c child of v LL(c), nd i LL(v) if nd only if L(v) occurstpositioniin S. Figure 2 shows the suffix tree of string of length 13. The prolem of constructing ST(S) hs een studied intensively nd severl lgorithms hve een developed which for constnt sized lphets cn construct ST(S)intime nd spce O( S ) [8, 16, 19, 20]. For non-constnt lphet sizes the running time of the lgorithms ecome O( S log Σ ). In the following we let the height of tree T e denoted h(t )nde defined s the mximum numer of edges in root-to-lef pth in T,ndlet the size of T e denoted T nd e defined s the numer of leves of T.For nodevin T we let T v denote the sutree of T rooted t node v, ndlet v = T v nd h(v) =h(t v ). Finlly, for node v in inry tree we let smll(v) denote the child of v with smller size (ties re roken ritrrily). The sic ide of our lgorithm in Section 5 is to process the suffix tree of the input string ottom-up, such tht we t ech node v spend mortized time O( smll(v) log( v / smll(v) )). Lemm 1 then sttes tht the totl time ecomes O(n log n) [17, Exercise 35]. Lemm 1 Let T e inry tree with n leves. If for every internl node v, c v = smll(v) log( v / smll(v) ), nd for every lef v, c v =0,then c v nlog n. v T Proof. The proof is y induction in the size of T.If T = 1, then the lemm holds vcuously. Now ssume inductively tht the upper ound holds for ll trees with t most n 1 leves. Consider tree with n leves where the numer of leves in the sutrees rooted t the two children of the root re k nd n k where 0 <k n/2. According to the induction hypothesis the sum over ll nodes in the two sutrees, is ounded y respectively k log k nd (n k) log(n k). The entire sum is thus ounded y k log(n/k)+klog k + (n k)log(n k)=klog n +(n k)log(n k) <nlog n, which proves the lemm. 4

7 $ 11 $ 9 $ 12 v $ 10 $ 6 $ 1 $ 3 $ 8 $ 5 $ 7 $ 2 $ 4 5 $ $ $ 2 2 $ 10 9 $ 1 8 $ 6 $ 1 $ $ $ 7 $ 2 Figure 2: To the left is the suffix tree ST(S) of the string S =. The node v hs pth-lel L(v) = nd lef-list LL(v) ={1,3,6,9}. To the right is the miniml ugmented suffix tree MAST(S) for the string S =. Internl nodes re lelled with the c-vlues. 2 The String Sttistics Prolem Given string S of length n nd pttern α of length m the following greedy lgorithm will compute the mximum numer of non-overlpping occurrences of α in S. Find ll occurrences of α in S y using n exct string mtching lgorithm. Choose the leftmost occurrence. Continue to choose greedily the leftmost occurrence not overlpping with ny so fr chosen occurrence. This greedy lgorithm will compute the mximum numer of occurrences of α in S in time O(n), since ll mtchings cn e found in time O(n), e.g. y the lgorithm y Knuth, Morris, nd Prtt [14]. In the string sttistics prolem we wnt to preprocess string S such tht queries of the following form re supported efficiently: Given query string α, wht is the mximum numer of non-overlpping occurrences of α in S? The mximum numer of non-overlpping occurrences of α is clled the c-vlue of α, denoted c(α). The preprocessing will e to compute the miniml ugmented suffix tree descried elow. Given the miniml ugmented suffix tree, string sttistics queries cn e nswered in time O(m). For ny sustring, α, ofsthere is exctly one pth from the root of ST(S) ending in node or on n edge of ST(S) spelling out the string α. This node or edge is clled the locus of α. In suffix tree ST(S) thenumerof leves in the sutree elow the locus of α in ST(S) tells us the numer of occurrences of α in S. These occurrences my overlp, hence the suffix tree is not immeditely suitle for the string sttistics prolem. The miniml ugmented suffix tree for S, denoted MAST(S) cn e constructed from the $ 4 5

8 suffix tree ST(S) s follows. A minimum numer of new uxiliry nodes re inserted into ST(S) in such wy tht the c-vlue for ll sustrings with locus on n edge (u, v), where u is the prent of v, hvec-vlue equl to c(l(v)), i.e. the c-vlue only chnges t internl nodes long pth from lef to the root. Ech internl node v in the ugmented tree is then leled y c(l(v)) to get the miniml ugmented suffix tree. Figure 2 shows the suffix tree nd the miniml ugmented suffix tree for the string. The spce needed to store MAST(S)isO(n), since y Lemm 6 the miniml ugmented suffix tree hs t most 3n internl nodes. 3 String Properties Lyndon nd Schutzenerger [15] proved the following periodicity lemm for periodic strings. Lemm 2 If string S hs two periods p, q S /2,thengcd(p, q) is lso period of S. If S is periodic, then y Lemm 2 the period p of S divides ll periods of S less thn or equl to S /2. Any prefix S of S with length t lest p lso hs period p. If S hs length t lest 2p, then the period of S y Lemm 2 divides p, implying tht the period of S lso is period of S. Corollry 1 If S hs period p S /2,thenpis lso the period of the prefixes S[1.. k] for 2p k S. The lemm elow gives chrcteriztion of how the occurrences of string α cn pper in S. Lemm 3 Let S e string nd α sustring of S. If the occurrences of α in S re t positions i 1 < <i k, then for ll 1 j<keither i j+1 i j = p or i j+1 i j > mx{ α p, p}, wherepdenotes the period of α. Proof. Consider two consecutive nd overlpping occurrences of α t positions i j nd i j+1 in S, i.e. there re no occurrences of α t position k for i j <k<i j+1. Let d = i j+1 i j 1. We will show tht neither of the two cses d<por p<d α pre possile. The two cses re illustrted in Figure 3. If d<p,thend< α since p α. By definition α occurs t positions i j nd i j + d, implying β = α[1.. α d] is oth prefix nd suffix of α. See Figure 3(). By definition, d is then period of α, contrdicting tht p is the shortest period of α. 6

9 S i j i j+1 α α } {{ } β d α d () The cse d<p S i j i j+1 p i j+1 i j+1 + p i j + α α α } {{ }} {{ } β β () The cse p<d α p Figure 3: The two cses considered in the proof of Lemm 3 If p<d α p, then we hve the inequlities i j <i j+1 p<i j+1 < i j+1 + p i j + α. If α hs period p, then α is prefix of the infinite string βββ..., whereβ=α[1.. p]. It follows tht β = S[i j+1.. i j+1 + p 1] = S[i j+1 p..i j+1 1], implying tht α occurs t position i j+1 p, which contrdicts the ssumption tht there is no occurrence of α etween positions i j nd i j+1. A consequence of Lemm 3 is tht if p α /2, then n occurrence of α in S t position i j cn only overlp with the occurrences t positions i j 1 nd i j+1. If p< α /2, then two consecutive occurrences i j nd i j+1,either stisfy i j+1 i j = p or i j+1 i j > α p. Corollry 2 If i j+1 i j α /2,theni j+1 i j = p where p is the period of α. Motivted y the ove oservtions we group the occurrences of α in S into chunks nd necklces. Letp denote the period of α. Chunks cn only pper if p < α /2. A chunk is mximl sequence of occurrences contining t lest two occurrences nd where ll consecutive occurrences hve distnce p. The remining occurrences re grouped into necklces. A necklce is mximl sequence of overlpping occurrences, i.e. only two consecutive occurrences overlp t given position nd the overlp of two occurrences is etween one nd p 1 positions long. Figure 4 shows the occurrences of the string in string of length 55 grouped into chunks nd necklces. By definition two necklces cnnot overlp, ut chunk cn overlp with nother chunk or necklce t oth ends. By Lemm 3 the overlp is t most p 1 positions. 7

10 Figure 4: The grouping of occurrences in string into chunks nd necklces. Occurrences re shown elow the string. Thick lines re occurrences in chunks. The grouping into chunks nd necklces is shown ove the string. Necklces re shown using dshed lines. Note tht necklce cn consist of single occurrence Figure 5: Exmples of the contriution to the c-vlues y n isolted necklce (left; α = nd the contriution is 5 = 9/2 ) nd n isolted chunk (right; α =, p = 2, nd the contriution is 3 = 8/ 5/2 ) In Figure 4 the chunk covering positions 9 23 overlps with the necklce covering positions 1 9, nd the chunk covering positions We now turn to the contriution of chunks nd necklces to the c-vlues. We first consider the cse where chunks nd necklces do not overlp. An isolted necklce or chunk is necklce or chunk tht does not overlp with other necklces nd chunks. Figure 5 gives n exmple of the contriution to the c-vlues y n isolted necklce nd chunk. Lemm 4 An isolted necklce of k occurrences of α contriutes to the c- vlue of α with k/2. An isolted chunk of k occurrences of α contriutes to the c-vlue of α with k/ α /p, wherepis the period of α. Proof. Since only two consecutive occurrences in necklce overlp, exctly every second occurrence in the necklce cn contriute to the c-vlues (see Figure 5 (left)). In chunk of k occurrences ll occurrences hve distnce p, implying tht only every α /p th occurrence contriutes to the c-vlue of α nd the stted contriution follows (see Figure 5 (right)). Motivted y Lemm 4, we define the nominl contriution of necklce of k occurrences of α to e k/2 nd the nominl contriution of chunk of k occurrences of α to e k/ α /p. The nominl contriution of necklce or chunk of α s is the contriution to the c-vlue of α if the necklce or chunk pperes isolted. The ctul contriution to the c-vlue of α s result of pplying the greedy lgorithm cn t most e one less for ech necklce nd chunk s rgued in the proof of Lemm 5 elow. 8

11 We define the excess of necklce of k occurrences to e (k 1) mod 2, nd the excess of chunk of k occurrences to e (k 1) mod α /p. The excess descries the numer of occurrences of α[1.. p] which re covered y the necklce or chunk, ut not covered y the mximl sequence of nonoverlpping occurences. We group the chunks nd necklces into collection of chins C y the following two rules: 1. A chunk with excess t lest two is chin y itself. 2. A mximl sequence of overlpping necklces nd chunks with excess zero or one is chin. For chin c Cwe define # 0 (c) to e the numer of chunks nd necklces with excess zero in the chin. We re now redy to stte our min lemm enling the efficient computtion of the c-vlues. The lemm gives n lterntive to the chrcteriztion in [3, Proposition 2]. Lemm 5 The mximum numer of non-overlpping occurrences of α in S equls the sum of the nominl contriutions of ll necklces nd chunks minus # 0 (c)/2. c C Proof. We consider the mximum numer of non-overlpping occurrences generted y the greedy lgorithm descried in Section 2. Let echunkornecklceofkoccurrences, nd let z = α /p if is chunk, nd z =2ifis necklce. From Lemm 3 only the first occurrence in cn overlp with the occurrence immeditely preceding, nd the overlp is t most p 1 positions. If there is no immedite preceding occurrence tht overlps with, or the immedite preceding occurrence is not reported y the greedy lgorithm, then the contriution to the c-vlue of equls the nominl contriution. If the immedite preceding occurrence of ws reported nd overlps with the first occurrence in, then the 2 + i z th occurrences in re reported for i =0,..., (k 1)/z 1, i.e. the numer of non-overlpping occurrences reported in equls the nominl contriution if nd only if k/z = (k 1)/z, i.e. hs nonzero excess. If the excess is zero, then the contriution of the chunk is one less thn the nominl contriution. It remins to count how often this lst cse hppens. 9

12 Consider chin consisting of single chunk with excess t lest two. Then the contriution of the chin equls the nominl contriution nd # 0 (c) =0. None of the two possile wys to report occurrences in the chunk y the greedy lgorithm includes the lst occurrence of the chunk. This implies tht the reporting within chunk or necklce of excess zero is only influenced y the necklces nd chunks to the left of contined within the sme chin s. Consider chin c where ll necklces nd chunks hve excess zero or one. For ll chunks nd necklces in the chin with excess one, the lst occurrence in the chunk or necklce is reported if nd only if the lst occurrence in the immedite preceding chunk or necklce in the chin is reported. It follows tht the lst occurrence is reported for every second chunk or necklce in the chin with excess zero, including the first with excess zero. We conclude tht the numer of chunks nd necklces with excess zero in the chin c where the contriution to the c-vlue of α is one less thn the nominl contriution is # 0 (c)/2. To ound the size of miniml ugmented suffix tree we need the following theorem of Frenkel nd Simpson, who proved tht string cn t most contin liner numer of distinct squres [9]. The proof given elow is slight simplifiction of [9, Theorem 1]. Theorem 1 (Frenkel nd Simpson) The numer of distinct squres occurring in nonempty string S is less thn 2 S. Proof. We prove tht ny position i in S cn t most e the rightmost occurrence of two distinct squres in S. Assume for the ske of contrdiction tht i is the rightmost occurrence of three squres αα, ββ nd γγ with = α, = β, c = γ nd 0 <<<c. Without loss of generlity we ssume s = γγ nd i = 1. Figure 6 shows the two cses to e considered. If c/2, then αα is prefix of γ. Therefore αα lso occurs t positions c + 1, which contrdicts the ssumption tht position one is the rightmost occurrence of αα. If >c/2, then we will show tht αα occurs t position p+1, where p is the period of α. By Lemm 3 the occurrences of α t positions +1, +1, nd c+1 will hve distnce t lest p (α is prefix of β nd γ, which occur respectively t positions +1ndc+ 1), i.e. + p c p. Sinceαhs period p nd occurs t positions one nd + 1, it follows tht α occurs t positions p +1 nd +p+1if S[ p+1.. +p] nds[2 p p] re sustrings of α. We hve S[ p p] =β[ p+1.. +p] =S[+ p p] = α[+ c p c+p]since+ c p+1 ++p c p+1> 1 nd + c+p, nds[2 p p] =β[2 p p]= α[2 p p] since2 p+1 >c p+1 p p+1 = 1 nd 2 +p 2 (+p)+p =. We conclude tht αα occurs t position p+1, 10

13 α β α γ β γ () 2 α γ α β γ α β γ p p () 2 α > γ Figure 6: The two cses in the proof of Theorem 1. Long dshed lines re occurrences of α nd short dshed lines re occurrences of α[1.. p] which contrdicts the ssumption tht the rightmost occurrence of αα is t position one. Since no squre strts t position S the theorem holds. Lemm 6 The miniml ugmented suffix tree for string S hs t most 3 S internl nodes. Proof. Since the suffix tree for string S hs S internl nodes we only need to show tht there re t most 2 S extr nodes in the miniml ugmented suffix tree. If the numer of extr nodes re limited y the numer of squres in the string the thesis follows from Theorem 1. Extr nodes re only locted on edges when there is chnge in the c- vlue long tht edge. The c-vlue chnges due to two resons. First, when two sustrings ecome equl s they get shorter nd second, when two occurrences of sustring no longer overlp s they get shorter. The first reson only gives rise to chnged c-vlues in nodes lredy in the suffix tree. If α occurs t position i nd i + α 1 oth occurrences cn not e counted in the c- vlue of α. However, the occurrences of α = α[1.. α 1], t positions i nd i + α 1, do not overlp nd the c-vlue my chnge. This only hppens when there is squre α α in S. 4 Level-Linked (2,4)-Trees In this section we consider how to mintin set of sorted lists of elements s collection of level-linked (2,4)-trees where the elements re stored t the leves 11

14 in sorted order from left-to-right, nd ech element cn hve n ssocited rel vlued weight. For detiled tretment of level-linked (2,4)-trees see [12] nd [17, Section III.5]. The opertions we consider supported re: NewTree(e, w): Cretes new tree T contining the element e with ssocited weight w. Serch(p, e): Serches for the element e strting the serch t the lef of tree T tht p points to. Returns reference to the lef in T contining e or the immedite predecessor or successor of e. Insert(p, e, w): Cretes new lef contining the element e with ssocited weight w nd inserts the new lef immedite next to the lef pointed to y p in tree T, provided tht the sorted order is mintined. Delete(p): Deletes the lef nd element tht p is pointer to in tree T. Join(T 1,T 2 ): Conctentes two trees T 1 nd T 2 nd returns reference to the resulting tree. It is required tht ll elements in T 1 re smller thn the elements in T 2 w.r.t. the totl order. Split(T,e): Splits the tree T into two trees T 1 nd T 2, such tht e is lrger thn ll elements in T 1 nd smller thn or equl to ll elements in T 2. Returns references to the two trees T 1 nd T 2. Weight(T ): Returns the sum of the weights of the elements in the tree T. Hoffmn et l. [11, Section 3] considered the cse where elements re unweighted, nd showed how level-linked (2,4)-trees support ll the ove opertions, except Weight, within the time ounds stted in Theorem 2 elow. Theorem 2 (Hoffmnn et l.) Level-linked (2,4)-trees support the opertions NewTree, Insert nd Delete in mortized constnt time, Serch in time O(log d) where d is the numer of elements in T etween e nd p, ndjoin nd Split in mortized time O(log min{ T 1, T 2 }). To llow ech element to hve n ssocited weight we extend the construction from [11, Section 3] such tht we for ll nodes v in tree store the sum of the weights of the leves in the sutree T v, except for the nodes on the pths to the leftmost nd rightmost leves, in the following denoted extreme nodes. These sums re strightforwrd to mintin while relncing (2,4)- tree under node splittings nd fusions, since the sum t node is the sum of the weights t the children of the node. For ech tree we lso store the totl weight of the tree. 12

15 Theorem 3 Weighted level-linked (2,4)-trees support the opertions NewTree nd Weight in mortized constnt time, Insert nd Delete in mortized time O(log T ), Serch in time O(log d) where d is the numer of elements in T etween e nd p, ndjoin nd Split in mortized time O(log min{ T 1, T 2 }). Proof. The opertions NewTree nd Weight tke mortized constnt time, since the totl weight of tree is explicitly stored, nd Serch tkes time O(log d) since it is unffected y the presence of weights. The presence of weights increses the time for Insert nd Delete to O(log T ), since in the worst-cse ll weights stored on the pth from the root to the new/deleted lef must e recomputed for every insertion nd deletion. For the opertions Join nd Split we need to rgue tht the stored weights cn e updted within the climed time ounds. The Join opertion proceeds y first linking the root of the tree of miniml height h = O(log min{ T 1, T 2 }) s child of n extreme node with height h + 1, followed y sequence of node splittings. The linking cuses 2h 1 extreme nodes not to e extreme nodes ny longer, forcing the weights to e computed for these nodes. This cnedoneintimeo(h). Since for ech node splitting the weights cn e updted in constnt time nd the totl weight of T is the sum of the weights of T 1 nd T 2, the mortized time ound for Join follows. A Split consists of sequence of node splittings, unlinking the sutree rooted node of height h = O(log min{ T 1, T 2 }), nd sequence of node fusions. The weights t the involved nodes cn similrly to Join e updted during the sequence of node splittings nd fusions. The unlinking only cretes new extreme nodes. Finlly the weight of the resulting tree of miniml height cn e computed y trversing the extreme nodes of this tree in time O(h). The weight of the lrger tree cn e computed s the difference etween the weight of T nd the weight of the tree of miniml height. 5 The Algorithm In this section we descrie the lgorithm for constructing the miniml ugmented suffix tree for string S of length n. The nlysis is presented in Section 6 nd shows tht the lgorithm runs in time O(n log n). 5.1 Algorithm ide The lgorithm strts y constructing the suffix tree, ST(S), for S. Thesuffix tree is then ugmented with extr nodes nd c-vlues for ll nodes to get the miniml ugmented suffix tree, MAST(S), for S. The ugmenttion of ST(S) to MAST(S) strts t the leves nd the tree is processed in ottom-up fshion. At ech node v encountered on the wy up the tree the c-vlue for 13

16 No weight Chins Weight=# 0 (c) Chunks, Necklces Weight=numer of leves Occurrences Figure 7: The dt structure is 3-level serch tree. the pth-lel L(v) is dded to the tree, nd t ech edge new nodes nd their c-vlues re dded if there is chnge in the c-vlue long the edge. To e le to compute efficiently the c-vlues nd decide if new nodes should e dded long edges the indices in the lef-list of v, LL(v), re stored in dt structure tht keeps trck of necklces, chunks, nd chins, s defined in Section Dt structure Let α e sustring of S. The dt structure D(α) is serch tree for the indices of the occurrences of α in S. The leves in D(α) re the leves in LL(v), where v is the node in ST(S) such tht the locus of α is the edge directly ove v or the node v. The serch tree, D(α), will e orgnized into three levels to keep trck of chins, chunks, nd necklces. The top level in the serch tree stores chins, the middle level chunks nd necklces, nd the ottom level occurrences. See Figure 7. Top level: Unweighted (2,4)-tree (cf. Theorem 2) with the chins s leves. The key of chin is the leftmost index in the chin. Middle level: One weighted (2,4)-tree (cf. Theorem 3) for ech chin, with the chunks nd necklces s leves. The leftmost indices in the chunks nd necklces re the keys. The weight of lef is 1 if the excess of the chunk or necklce is zero, otherwise the weight is 0. The totl weight of tree on the middle level is # 0 (c), where c denotes the chin represented y the tree. Bottom level: One weighted (2,4)-tree (cf. Theorem 2) for ech chunk nd necklce, with the occurrences in the chunk or necklce s the leves. 14

17 The weight of lef is 1. The totl weight of tree is the numer of occurrences in the chunk or the necklce. Together with ech of the 3-level serch trees, D(α), some vriles re stored. NCS(α) stores the sum of the nominl contriution s defined in Section 3 for ll chunks nd necklces, ZS(α) storesthesum c C # 0(c)/2, where C is the set of chins. By Lemm 5 the mximum numer of nonoverlpping occurrences of α is NCS(α) ZS(α). We lso store the totl numer of indices in D(α) nd list of ll chins contining t lest one chunk, denoted CHAINLIST(α). The list will keep pointers to the roots of the trees for the chins, nd pointers in the opposite direction re kept in the tree. CHAINLIST(α) will e useful when ll chunks re processed t the sme time nd we do not wnt to spend time on serching for chunks in chins with no chunks. Finlly we store, p(α), which is the smllest difference etween the indices of two consecutive occurrences in D(α). Note tht, y Corollry 2, p(α) istheperiodofαif there is t lest one chunk. For convenience, we will sometimes refer to the tree for chin, chunk, or necklce just s the chin, chunk, or necklce. For the top level tree in D(α) we will use level-linked (2,4)-trees, ccording to Theorem 2, nd for the middle nd ottom level trees in D(α) we will use weighted level-linked (2,4)-trees, ccording to Theorem 3. In these trees predecessor nd successor queries re supported in constnt time. We denote y l(e) ndr(e) the indices to the left nd right of index e. To e le to check fst if there re overlps etween two consecutive trees on the middle nd ottom levels we store the first nd lst index in ech tree in the root of the tree. This cn esily e kept updted when the trees re joined nd split. We will now descrie how the suffix tree is processed nd how the dt structures re mintined during this process. 5.3 Processing events We wnt to process edges in the tree ottom-up, i.e. for decresing length of α, so tht new nodes re inserted if the c-vlue chnges long the edge, the c-vlues for nodes re dded to the tree, nd the dt structure is kept updted. The following events cn cuse chnges in the c-vlue nd the chin, chunk, nd necklce structure. 1. Excess chnge: When α ecomes i p(α), for i = 2,3,4,... the excess nd nominl contriution of chunks chnges nd we hve to updte the dt structure nd possily dd node to the suffix tree. 2. Chunks ecome necklces: When α decreses nd ecomes 2p(α) chunk degenertes into necklce. At this point we join ll overlpping 15

18 chunks nd necklces into one necklce nd possily dd node to the suffix tree. 3. Necklces nd chins rek-up: When α decreses two consecutive occurrences t some point no longer overlp. The result is tht necklce or chin my split, nd we hve to updte the necklce nd chin structure nd possily dd node to the suffix tree. 4. Merging t internl nodes: At internl nodes in the tree the dt structures for the sutrees elow the node re merged into one dt structure nd the c-vlue for the node is dded to the tree. To keep trck of the events we use n event queue, denoted EQ, thtis common priority queue of events for the whole suffix tree. The priority of n event in EQ is equl to the length of the string α when the event hs to e processed. Events of type 1 nd 2 store pointer to ny lef in D(α). Events of type 3, i.e. tht two consecutive overlpping occurrences with index e 1 nd e 2,wheree 1 <e 2, terminte to overlp, store pointer to the lef e 1 in the suffix tree. For the lef e 1 in the suffix tree pointer to the event in EQ is lso stored. Events of type 4 stores pointer to the internl node in the suffix tree involved in the event. When the suffix tree is constructed ll events of type 4 re inserted into EQ. For node v in ST(S) the event hs priority L(v) nd stores pointer to v. The pointers re used to e le to decide which dt structure to updte. The priority queue EQ is implemented s tle with entries EQ[1]...EQ[ S ]. All events with priority x re stored in linked list in entry EQ[x]. Since the priorities of the events considered re monotonic decresing, it is sufficient to consider the entries of EQ in single scn strting t EQ[ S ]. The events re processed in decresing order of the priority nd for events with the sme priority they re processed in the order s ove. Events of the sme type nd with the sme priority re processed in ritrry order. In the following we only look t one edge t the time when events of type 1, 2, nd 3retkencreof Excess chnge The excess chnges for ll chunks t the sme time, nmely when α = i p(α) for i = 2,3,4,... To updte the dt structure when the excess chnges, ll chunks re first removed from D(α). Then the new excess nd nominl contriution re computed nd finlly the tree is reconstructed s the removed chunks re reinserted into D(α). This is ll done for ech chin in CHAINLIST(α) sepertely. In detils the following is done. 16

19 1. For ech chin in CHAINLIST(α) find the sorted list of chunks in the chin. Note tht t lest every other element in the chin re chunks, since two necklces do not overlp. 2. For ech chunk remove it from its chin y splitting the tree for the chin. At most two Split-opertions per chunk re needed to temporry mke ech chunk chin y itself. The chin which the chunk is prt of is replced y the t most three new (overlpping) chins in the top-level tree, unless the chunk lredy ws chin y itself. Keep CHAINLIST(α) updted y deleting the split chin from the list nd inserting the resulting chins, if they contin t lest one chunk. Keep ZS(α) updted y first sutrcting the old contriution of the chin to ZS(α) nd then dding the contriutions of the resulting chins. 3. Recompute the excess nd nominl contriution for ll chunks nd updte NCS(α) ccordingly y dding the chnge in nominl contriution to NCS(α) for ech chunk. Set the weight of chunks with excess 0 to 1 nd set the weight of chunks with excess t lest 1 to 0. In the new tree the chin structure my hve chnged. Chunks for which the excess increses to two or more will ecome seprte chins, while chunks where the excess ecomes less thn two my join two or three chins into single chin. 4. Now we wnt to reconstruct the tree y reinserting the chunks into the tree. This is done sepertely for ech chin in the originl CHAINLIST(α). Let c 1,...,c m denote the chunks from left to right in chin nd let T i denote the tree for the chin contining chunk c i nd do the following. If c i hs excess 2 or more then nothing is done since c i is chin y itself. If the excess is 0 or 1 we check if the chunk overlps the chins to the left nd right, denoted T l nd T r, nd in tht cse if the chins shll e joined. () If c i to the left overlps with necklce or chunk with excess 0 or 1, then remove T l from the top-level tree nd join T i nd T l. Denote the resulting tree T i. Delete from CHAINLIST(α) thechin contining c i nd lso the chin T l if it is in the list. Insert the new chin into CHAINLIST(α). Updte ZS(α) ccordingly. () If T r to the right overlps with necklce then join T i with T r fter removing T r from the top-level tree. Denote the resulting tree T i. Updte ZS(α) ccordingly. If T r = T i+1, i.e. it is lso chunk, then the overlp etween T i nd T i+1 will e checked s descried ove when continuing with T i+1. 17

20 Continue until ll chunks re checked nd repet the sme procedure for ech chins in the originl CHAINLIST(α). If α =2p(α) then insert n event of type 2 with priority 2p(α) intoeq, with pointer to ny lef in D(α). If α = i p(α) > 2p(α), then insert n event of type 1 with priority (i 1) p(α) intoeq, with pointer to ny lef in D(α) Chunks ecome necklces. When α decreses to 2p(α) ll chunks ecome necklces t the sme time. At this point ll chunks nd necklces tht overlp must e joined into one necklce. Note tht ll chunks hve excess 0 or 1 when α =2p(α) nd since we first recompute the excess, ll overlpping chunks nd necklces re in the sme chin. Hence, wht we hve to do is to join ll chunks nd necklces from left to right, in ech chin. The following is done. 1. For ech chin in CHAINLIST(α) join ll chunks nd necklces from left to right. Note tht there re never two consecutive necklces in the sme chin, i.e. there re t most two Join-opertions per chunk. Updte NCS(α) ndzs(α) y dding the difference in nominl contriution to NCS(α) nd y sutrcting # 0 (c) fromzs(α) for ech chnged chin nd dd it for the new chins. 2. Set CHAINLIST(α) to e empty, since there re no chunks left Necklces nd chins rek-up When two consecutive occurrences of α with indices e 1 nd e 2 terminte to overlp this my cuse necklce or chin to rek up into two necklces or chins. If e 1 nd e 2 elong to the sme chin then the chin reks up in two chins. If e 1 nd e 2 elong to the sme necklce then oth the necklce nd the chin split etween e 1 nd e 2. The following is done to updte the dt structure. 1. Using the pointers stored t the leves of ST(S), check if the indices e 1 nd e 2 re lredy in different chins. If they re then nothing is done. 2. If e 1 nd e 2 re in the sme chin, represented in the tree y T c,thenc shll e split. () If oth indices elong to the sme necklce then the necklce shll e split. Remove the tree for the necklce from T c y performing two Split-opertions. The result is three trees, T c1, T n,nd T c2. Split the tree for the necklce into two prts y performing 18

21 Split(T n,e 2 )=T n1,t n2.insertt n1 in T c1 nd T n2 in T c2 using two Join-opertion. Finlly, insert the new chin into the top-level tree nd updte NCS(α) ndzs(α). () If the two indices elong to two different sutrees t the ottomlevel then only the chin hs to e split. Insert the new chin into the top-level tree nd updte ZS(α). Updte CHAINLIST(α) if necessry, y first removing c from the list (if it is in it). Then for the two new chins, check if they contin t lest one chunk nd in tht cse insert the chins into CHAINLIST(α) Merging t internl nodes Let α e sustring such tht the locus of α is node v in the suffix tree. The lef-list for v, LL(v), is the union of the lef-lists for the sutrees elow v. Hence, t the nodes in the suffix tree the dt structures for the sutrees should e merged into one. We ssume tht the edges elow v hve een processed for α s descried ove for events 1, 2, nd 3. Let T 1,...,T t e the sutrees elow v in the suffix tree. We never merge more thn two dt structures t the time. If there re more thn two sutrees, the merging is done in the following order: T = Merge(T,T i ), for i =2,...,t, where T = T 1 to strt with. This cn lso e viewed s if the suffix tree is mde inry y replcing ll nodes of degree lrger thn 2 y inry tree with edges with empty lels. We will now descrie how to merge the dt structures for two sutrees. The merging will e done y inserting ll indices from the smller of the two lef-lists into the dt structure for the lrger one. Let T denote the 3- level serch tree to insert new indices in nd denote y e 1,...,e m the indices to insert, where e i <e i+1. The insertion is done y first splitting the tree T t ll positions e i for incresing i =1,...,m. The tree is then reconstructed, from left to right, t the sme time s the new indices re inserted. More exctly the following is done. 1. For ll indices e i, i =1,...,m split T (t ll levels necessry) y performing Split(e i ). Denote the resulting 3-level trees T 0,...,T m+1 where e i is lrger thn ll indices in T j for j<i, nd smller thn ll other indices. Note tht T i is n empty tree if there re no indices in T etween e i nd e i+1. When middle-level tree, for chin c, issplit,updtezs(α) y sutrcting # 0 (c) nd dding the new # 0 -vlues for the two new trees. Updte CHAINLIST(α) (fromthetreet) if necessry, y first removing c from the list (if it is in it). Then for the two new chins, check if they contin t lest one chunk nd in tht cse insert the chins into 19

22 CHAINLIST(α). When ottom-level tree is split, updte NCS(α) y sutrcting the nominl contriution of the split chunk or necklce nd dding the two new nominl contriutions. 2. The tree is reconstructed from left to right y inserting the new indices in incresing order. Let T e the tree reconstructed for ll indices smller thn e i in the two trees to e merged, i.e. T is the union of e 0,...,e i nd T 0,...,T i. Initilly T equls T 0. To proceed, we wnt to insert e i nd ll indices etween e i nd e i+1 into T.(Ifi=mwe insert ll remining indices.) This is done s follows. Check if the occurrence with index e i overlps the occurrence for the rightmost index in the tree T reconstructed so fr nd in tht cse, check if this index is prt of necklce or chunk. () If the occurrence with index e i does not overlp ny occurrence to the left then crete new chin with one necklce with e i s the only index. Insert the chin into T. () If the occurrence with index e i overlps n occurrence with index e l to the left nd the overlp is less thn α /2 thenife l is prt of necklce then insert e i into this necklce using one Join-opertion. If e l is prt of chunk, c l, then crete new chin with one necklce with e i s the only index. If the excess of c l is 2 or more then insert the chin in T, otherwise join the two chins. (c) If the occurrence with index e i overlps n occurrence with index e l to the left nd the overlp is more thn α /2 thenife l is prt of necklce remove e l from the necklce nd crete new chunk with e l nd e i s the two indices. Crete new chin for the chunk with two indices e l nd e i. Since the excess of the chunk with 2 indices is 1, join the two chins. If e l is prt of chunk then insert e i into the chunk. If the excess increses from 1 to 2 then let the chunk e chin y itself. If the excess chnges to 0 fter eing 2 or more then join the chunk with the chin to the left if possile. The tree T is t this point the tree reconstructed for ll indices up to index e i. The next step is to include lso the tree T i.unlesst i is empty the following is done. Check if the occurrence with index e i overlps the leftmost occurrence in T i, with index e r. If it does not we just join the trees T nd T i. If it does overlp the following is done. () The overlp etween e i nd e r is less thn α /2. If oth e i nd e r re prt of necklces then join the necklces nd the chins they re prt of. If t lest one of e i nd e r is prt of chunk, then join 20

23 the chins they re prt of unless the excess of the chunk is 2 or more. () The overlp etween e i nd e r is more thn α /2. If oth e i nd e r re prt of necklces then remove e i nd e r from their trees nd crete new chunk with the two indices. Since the excess of the new chunk is 1, insert the chunk in one of the chins nd join the chins. If one of e i nd e r is prt of chunk nd the other is prt of necklce, then remove the index in the necklce nd insert it into the chunk. Let c 1 nd c 2 e the two chins where the two indices were locted originlly. If the excess of the chunk ecomes 0 or 1 then join the chins c 1 nd c 2, otherwise let the chunk e chin y itself y splitting the chin for the chunk. If oth e i nd e r re prt of chunks then join them. Check the excess of the joined chunk. If the excess is 2 or more, then mke the chunk chin y its own. Keep CHAINLIST(α) updted ll the time during the ove procedure. Ech time chin is split, joined, or removed, delete the chin from CHAINLIST(α) (if it is in the list). Ech time new chin is creted, e.g. s result of split or join, insert the chin into CHAINLIST(α) if it contin t lest one chunk. Updte the glol vriles ZS(α) nd NCS(α), respectively, whenever two trees, t the middle or ottom level, respectively, re joined or split. Every time, during the ove descried procedure, when two overlpping occurrences with indices e i nd e j,wheree i <e j, from different sutrees re encountered the event (e i,e j ) with priority e j e i is inserted into the event queue EQ nd the previous event, if ny, with pointer to e i is removed from EQ. Updtep(α)toe j e i if this is smller thn the current p(α) vlue. If α > 2p(α) theninsertneventoftype1withpriority α /p(α) p(α) into EQ, with pointer to ny lef in D(α). 6 Anlysis Theorem 4 The miniml ugmented suffix tree, MAST(S), for string S of length n cn e constructed in time O(n log n) nd spce O(n). Proof. The lgorithm in Section 5 strts y constructing the suffix tree, ST(S), for the input string S in time O(n log n). The lef-lists of ll n leves in ST(S) re creted in constnt time for ech lef, i.e. in totl time O(n). The proof uses n mortiztion rgument, llowing ech edge to e processed in mortized constnt time, nd ech inry merge t node (in the inry version) of ST(S) of two lef-lists of sizes n 1 nd n 2, with n 1 n 2, in mortized 21

24 time O(n 2 log n 1+n 2 n 2 ). From Lemm 1 it follows tht the totl time for processing the internl nodes nd edges of ST(S) iso(n). In the following we will first give the time for processing the different events. We will then define potentil for ech dt structure mintined long n edge or node of ST(S). We will rgue tht the processing of events of type 1, 2, nd 3 relese sufficient potentil to py for processing the event, nd tht inry merge in n event of type 4 tkes time O(n 2 log n 1+n 2 n 2 ) nd increses the potentil y t most the sme mount. When events of type 1 re processed the first step is to find the sorted list of chunks in ech chin. For lef-list contining m chunks, this is done in time O(m) since only chins contining t lest one chunk re exmined nd since t lest every other element in chin is chunk. The next step is to mke ech chunk chin y itself y performing t most two split opertions per chunk, hence t most 2m split opertions. The splitting is done from left to right in ech chin. Let CHAINLIST(α) = nd let c i,fori=1,..., e the size of chin i in the list. Let s ij e the size of the left tree resulting from the jth splitting of chin i in CHAINLIST(α). Then i=1 c i LL(v) nd 2 i j=1 s ij c i where i is the numer of chunks in chin i. Hence, i=1 2i j=1 s ij LL(v). According to Theorem 2 nd 3 split opertion tkes time log(min{ T, T }), where T nd T re the resulting trees. The totl time of ll splittings is t most O( 2i i=1 j=1 log s ij) O(mlog LL(v) m ). Finlly the chins for the chunks re joined with other chins. Similrly to the split opertions, the joining tkes time O(m log LL(v) m ). All other opertions, e.g. to recompute the excess nd updte the glol vriles, tke O(m). Hence, the totl time to process n event of type 1 is O(m log LL(v) m ). When n event of type 2 is processed ll chunks nd necklces in the sme chin re joined into one chin y joining the trees in ech chin contining t lest one chunk. Since the joining is done from left to right in ech chin the totl time for ll join opertions in n event of type 2, like in event of type 1, is O(m log LL(v) m ), which is lso the totl time for processing of type event 2. When n event of type 3 is processed, it is first checked if the two occurrences, tht terminte to overlp, elong to the sme chin. Let e 1 nd e 2 e the indices of the two occurrences nd let c e the chin which e 1 elongs to. Denote y c the numer of occurrences of α in c. Sincewehve pointer to e 1, wlking up the tree for c to decide if e 2 lso elongs to c tkes time O(log c ). At the sme time it cn lso e checked if the two indices elong to the sme necklce. If the two indices elong to the sme chin nd/or necklce the chin nd/or necklce is split y using t most two split opertion on ech level, i.e. t most four split opertions. According to Theorem 3, split opertion tkes time O(log c ). According to Theorem 2, inserting the new chin in the top-level tree tkes constnt time. Hence, the totl time to process n event of type 3 is O(log c ). 22

25 For ech node v in ST(S) n event of type 4 is processed, which consists of sequence of inry mergings s descried in Section 5, y viewing ST(S) s inry tree. A inry merging is performed y inserting the indices from the smller of the two sutrees into the dt structure for the lrger sutree. Denote y e 1,...,e n1 the n 1 indices in the smller sutree. Denote y D(α) the dt structure for the lrger sutree nd let n 2 e the numer of indices in D(α). When e 1,...,e n1 re inserted into D(α) the tree T for D(α) is split t e 1,...,e n1 into n smller trees T 0,...,T n1 +1. Let s 0 e the numer of indices in D(α) smller thn e 1,lets i,fori=1,...,n 1 1, e the numer of indices lrger thn e i nd smller thn e i+1 in D(α), nd let s n1 e the numer of indices in D(α) lrger thn e n1. According to Theorems 2 nd 3 split tkes time O(log min{ T, T }), where T nd T re the two resulting trees. It follows tht the totl time for the splitting t e 1,...,e n1 is n1 +1 i=1 O(log s i )= n 1 +1 i=1 O(log n 1+n 2 n 1 )=O(n 1 log n 1+n 2 n 1 ). After the splitting the trees nd the indices e 1,...,e n1 re joined. The time ound for joining is proportionl to the time ound for splitting. All other opertions, like updting NCS(α)ndZS(α), tke time O(n 1 ). Since oth splitting nd joining tkes time O(n 1 log n 1+n 2 n 1 )itfollowsfromlemm1thtthetotltimefor processing ll events of type 4 is O(n log n). We now turn to comine the ove time ounds for the different events into n mortiztion rgument, showing tht the running time of the lgorithm is dominted y the time for ll events of type 4, i.e. the running time of the lgorithm is O(n log n). Let v e node in the suffix tree nd let α e string with locus v or locus on the edge immeditely ove v. The dt structure D(α) hs potentil Φ(D(α)). Let C e the set of chins stored in D(α). For chin c, let c denote the numer of occurrences of α in c. We define the potentil of D(α) ythesum Φ(D(α)) = Φ 1 (α)+φ 2 (α)+ c C Φ 3 (c), where the rôle of Φ 1,Φ 2,ndΦ 3 is to ccount for the potentil required to e le to process events of type 1, 2, nd 3 respectively. Let k denote the numer of chunks in D(α) ndletgdenote the numer of green chunks (defined elow) in D(α). The three potentils Φ 1,Φ 2,ndΦ 3, re defined y Φ 1 (α) = 7glog v e, g Φ 2 (α) = klog v e, k Φ 3 (c) = 2 c log c 2, with the exceptions tht Φ 1 (α) = 0 if g = 0, nd Φ 2 (α) = 0 if k = 0. Note tht the potentil of lef of ST(()S) is zero nd tht Φ 1 nd Φ 2 re 23

26 () S α i α i + pk j p j () S i α α α i + pk j p j (c) S i α α α i + pk j p j Figure 8: Coloring of chunks while shortening α: () red chunk, () green chunk, (c) recomputtion of excess nd green-to-red color chnge monotonic incresing for g nd k in the rnge [0.. v ]. For chunk, with leftmost occurrence of α t position i, consider the sustring S[i..j] with mximl j nd S[i..j] hving period p, wherep=p(α)istheperiodofα. Note tht the rightmost occurrence t position r in the chunk lwys covers position j p+ 1, since otherwise there is n occurrence of α t position r + p. We denote the chunk green if nd only if α mod p j i+1 mod p. Otherwise the chunk is red. The importnt property of the color of chunks is tht when chunk chnges color from red to green while shortening α, then there exists node in ST(S) where the merging of the lef lists increses the numer of occurrences in the chunk y exctly one occurrence, i.e. it is sufficient to chrge ech insertion in inry merge y the cretion of one green chunk. The increse in the potentil Φ 1 cn then e chrged to the processing of events of type 4. When chunk ecomes green in ST(S), then α mod p = j i +1modp, implying tht the lst occurrence of α in the chunk is t position j α +1 nd covers position j ut not j +1. Since S[j+1] S[j+1 p], it follows tht the indices of the occurrences of α t positions j α +1 nd j α+1 p cnnot e contined in the sme lef-list of child, i.e. the locus of α in ST(S) is nodevof degree t lest two, nd the chunk consists of indices from the lef-lists of two distinct children of v. Figure 8 illustrtes the color chnges for chunk while shortening α. Consider n event of type 1 where the excess of ll k chunks is recomputed. This event only occurs when mod p = 0, i.e. ll chunks re green nd g = k. After the recomputtion of the excess ll chunks ecome red. The chnge of excess of the k chunks in the worst-cse implies tht the numer of chins is reduced y 2k, ecuse chunks previously hving excess 2 now hve excess 24

27 1 nd re merged with overlpping necklces. It follows tht c C Φ 3(c) is incresed y t most 2k log v v 2k +4k 6klog k. The chnge in potentil is Φ(D(α)) 7klog v e k +6klog v v k klog k, i.e. sufficient potentil is relesed to py for the O(k log v k )timeforprocessingtheevent. Consider n event of type 2 where ll k chunks ecome necklces, nd where g of the k chunks re green. Converting the k chunks to necklces does not chnge the numer of chins, since ll chunks hve excess zero or one. The chnge in potentil is Φ(D(α)) klog v k, i.e. sufficient potentil is relesed to py for the O(k log v k )timeforprocessingtheevent. Consider n event 3 where chin is split into two chins of size n 1 nd n 2, where n 1 n 2. The chnge in potentil is Φ(D(α)) = Φ 3 (D(α)) (2n 1 log n 1 2) + (2n 2 log n 2 2) (2(n 1 + n 2 ) log(n 1 + n 2 ) 2) = log(n 1 + n 2 ) log n 1 log n 2 2 log(2n 2 ) log n 1 log n 2 2= log n 1 1, i.e. sufficient potentil is relesed to py for the O(log n 1 ) time for processing the event. Aove we rgued tht the totl time for processing ll events of type 4 is O(n log n), y showing tht inserting the indices from lef-list of size n 1 into the dt structure for lef-list of size n 2 tkes time O(n 1 log n 1+n 2 n 1 ), where n 1 n 2. The merging cn t most crete n 1 new chunks in the dt structure, nd increse the numer of green chunks y t most n 1. If there re g green chunks efore the merging, then there re t most g + n 1 green chunks fter the merging nd the chnge in the Φ 1 potentil for the resulting dt structure is Φ 1 7(g + n 1 )log (n 1 +n 2 ) e 7glog n 2 e 7n 1 log (n 1 + n 2 ) e, g+n 1 g g + n 1 ( which is O n 1 log n 1+n 2 n 1 ). ( Similrly, the merging cretes t most n 1 new chunks nd we get Φ 2 = O n 1 log n 1+n 2 n 1 ). For Φ 3 we oserve tht the n 1 insertions first crete n 1 singleton chins with Φ 3 potentil zero. Ech new singleton chin cn then e joined with the chins to the left nd right of it. If c 1,...,c k re the chins which re joined into other chins, k 2n 1, then the chnge in the potentil Φ3 for the dt structure is k Φ 3 (log c i +2) 2k+klog n 1 + n 2 = O(n 1 log n 1 + n 2 ). k n i=1 1 We conclude tht the totl mortized time for the inry merging of two leflists of size n 1 nd n 2 is O(n 1 log n 1+n 2 n 1 ). By Lemm 1 the totl mortized time for hndling ll events of type 4 is O(n log n). The event queue EQ is implemented s n rry with n entries, ech contining linked list of the events with specific priority. Insertions nd 25

28 deletions cn e done in constnt time, given tht we mintin pointer to the event to delete. Hence, the totl time for the opertions in EQ is limited y the totl numer of events processed y the lgorithm nd the spce used is O(n + m), where fm is the mximum numer of events in EQ t the sme time. To show tht the numer of events of type 1 is O(n) we hve to show tht the excess chnges t most once on n edge (u, v) in the suffix tree. Let u e ove v in the suffix tree. Assume tht the excess chnges twice long (u, v). Let L(u) =α u nd let L(v) =α v. Sy tht the excess chnges t length i p(α v ) nd (i 1) p(α v ), where α v >i p(α v )>(i 1) p(α v ) > α u. It follows tht α v [1.. (i 1) p(α v )] occurs t oth positions 1 nd 1 + p(α v )inα v.by considering the rightmost occurence of α v in S, this mens tht there hs to e n insertion into the lef-list etween v nd u, i.e. there is node etween v nd u. This is contrdiction nd we conclude tht there cn not e two excess chnges long ny single edge. The numer of events of type 2 is not more thn the numer of events of type 1, i.e. t most O(n). The totl numer of events of type 3 is limited y the numer of insertions of indices into the dt structure in processing event of type 4. This is shown ove to e t most O(n log n). At ny time during the execution of the lgorithm ech lef is involved in t most one event of type 3 s the first of the indices for the overlpping occurrences, hence t most O(n) eventsoftype3reineq t the sme time. The totl numer of events of type 4 is equl to the numer of internl nodes in the suffix tree, which is t most O(n). It follows tht the event queue EQ will use O(n) spce nd insertions nd deletions will tke time O(n log n) in totl. References [1] A. Apostolico nd A. Ehrenfeucht. Efficient detection of qusiperiodicities in strings. Theoreticl Computer Science, 119: , [2] A. Apostolico nd F. P. Preprt. Optiml off-line detection of repetitions in string. Theoreticl Computer Science, 22: , [3] A. Apostolico nd F. P. Preprt. Dt structures nd lgorithms for the string sttistics prolem. Algorithmic, 15: , [4] G. S. Brodl, R. Lyngsø, C. N. S. Pedersen, nd J. Stoye. Finding mximl pirs with ounded gp. Journl of Discrete Algorithms, Specil Issue of Mtching Ptterns, 1(1):77 104, [5] G. S. Brodl, R. B. Lyngsø, A. Östlin, nd C. N. S. Pedersen. Solving the string sttistics prolem in time o(n log n). In Proc. 29th Interntionl 26

29 Colloquium on Automt, Lnguges, nd Progrmming, volume 2380 of Lecture Notes in Computer Science, pges [6] G. S. Brodl nd C. N. S. Pedersen. Finding mximl qusiperiodicities in strings. In Proc. 11th Comintoril Pttern Mtching (CPM), volume 1848 of Lecture Notes in Computer Science, pges Springer Verlg, Berlin, [7] M. R. Brown nd R. E. Trjn. A fst merging lgorithm. Journl of the ACM, 26(2): , [8] M. Frch. Optiml suffix tree construction with lrge lphets. In Proc. 38th Ann. Symp. on Foundtions of Computer Science (FOCS), pges , [9] A. S. Frenkel nd J. Simpson. How mny squres cn string contin? Journl of Comintoril Theory, Series A, 82(1): , [10] D. Gusfield. Algorithms on Strings, Trees nd Sequences: Computer Science nd Computtionl Biology. Cmridge University Press, [11] K. Hoffmnn, K. Mehlhorn, P. Rosenstiehl, nd R. E. Trjn. Sorting Jordn sequences in liner time using level-linked serch trees. Informtion nd Control, 86(1-3): , [12] S. Huddleston nd K. Mehlhorn. A new dt structure for representing sorted lists. Act Informtic, 17: , [13] F. K. Hwng nd S. Lin. A simple lgorithm for merging two disjoint linerly ordered sets. SIAM Journl of Computing, 1(1):31 39, [14] D. E. Knuth, J. H. Morris, nd V. R. Prtt. Fst pttern mtching in strings. SIAM Journl of Computing, 6: , [15] R. C. Lyndon nd M. P. Schutzenerger. The eqution m = n c p in free group. Michign Mthemticl Journl, 9: , [16] E. M. McCreight. A spce-economicl suffix tree construction lgorithm. Journl of the ACM, 23(2): , [17] K. Mehlhorn. Sorting nd Serching, volume 1 of Dt Structures nd Algorithms. Springer Verlg, Berlin, [18] J. Stoye nd D. Gusfield. Simple nd flexile detection of contiguous repets using suffix tree. Theoreticl Computer Science, 270: ,

30 [19] E. Ukkonen. On-line construction of suffix trees. Algorithmic, 14: , [20] P. Weiner. Liner pttern mtching lgorithms. In Proc. 14th Symposium on Switching nd Automt Theory, pges 1 11,

31 Recent BRICS Report Series Pulictions RS Gerth Stølting Brodl, Rune B. Lyngsø, Ann Östlin, nd Christin N. S. Pedersen. Solving the String Sttistics Prolem in Time O(n log n). Mrch pp. Shorter version ppers in Widmyer, Triguero, Morles, Hennessy, Eidenenz nd Conejo, editors, 29th Interntionl Colloquium on Automt, Lnguges, nd Progrmming, ICALP 02 Proceedings, LNCS 2380, 2002, pges RS Olivier Dnvy nd Myer Golderg. There nd Bck Agin. Mrch ii+11 pp. To pper in Peyton Jones, editor, Proceedings of the 7th ACM SIGPLAN Interntionl Conference on Functionl Progrmming, This report supersedes the erlier report BRICS RS RS Aske Simon Christensen, Anders Møller, nd Michel I. Schwrtzch. Extending Jv for High-Level We Service Construction. Mrch RS Ulrich Kohlench. Uniform Asymptotic Regulrity for Mnn Itertes. Mrch pp. RS-02-9 Ann Östlin nd Rsmus Pgh. One-Proe Serch. Ferury pp. Appers in Widmyer, Triguero, Morles, Hennessy, Eidenenz nd Conejo, editors, 29th Interntionl Colloquium on Automt, Lnguges, nd Progrmming, ICALP 02 Proceedings, LNCS 2380, 2002, pges RS-02-8 Ronld Crmer nd Serge Fehr. Optiml Blck-Box Secret Shring over Aritrry Aelin Groups. Ferury pp. RS-02-7 Ann Ingólfsdóttir, Anders Lyhne Christensen, Jens Alsted Hnsen, Jco Johnsen, John Knudsen, nd Jco Illum Rsmussen. A Formliztion of Linkge Anlysis. Ferury vi+109 pp. RS-02-6 Luc Aceto, Zoltán Ésik, nd Ann Ingólfsdóttir. Equtionl Axioms for Proilistic Bisimilrity (Preliminry Report). Ferury pp. To pper in Kirchner nd Ringeissen, editors, Algeric Methodology nd Softwre Technology: 9th Interntionl Conference, AMAST 02 Proceedings, LNCS, 2002.