Analysis of Algorithms and Data Structures for Text Indexing Moritz G. Maaß

Transcription

1 FAKULTÄT FÜR INFORMATIK TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Effiziente Algorithmen Anlysis of Algorithms nd Dt Strutures for Text Indexing Moritz G. Mß

2 FAKULTÄT FÜR INFORMATIK TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Effiziente Algorithmen Anlysis of Algorithms nd Dt Strutures for Text Indexing Moritz G. Mß Vollständiger Adruk der von der Fkultät für Informtik der Tehnishen Universität Münhen zur Erlngung des kdemishen Grdes eines Doktors der Nturwissenshften (Dr. rer. nt.) genehmigten Disserttion. Vorsitzender: Prüfer der Disserttion: Univ.-Prof. Dr. Dr. h.. mult. Wilfried Bruer 1. Univ.-Prof. Dr. Ernst W. Myr 2. Prof. Roert Sedgewik, Ph.D. (Prineton University, New Jersey, USA) Die Disserttion wurde m 12. April 2005 ei der Tehnishen Universität Münhen eingereiht und durh die Fkultät für Informtik m 26. Juni 2006 ngenommen.

3 Astrt Lrge mounts of textul dt like doument olletions, DNA sequene dt, or the Internet ll for fst look-up methods tht void serhing the whole orpus. This is often omplished using tree-sed dt strutures for text indexing suh s tries, PATRICIA trees, or suffix trees. We present nd nlyze improved lgorithms nd index dt strutures for ext nd error-tolernt serh. Affix trees re dt struture for ext indexing. They re generliztion of suffix trees, llowing idiretionl serh y extending pttern to the left nd to the right during retrievl. We present n lgorithm tht onstruts ffix trees on-line in oth diretions, i.e., y ugmenting the underlying string in oth diretions. An mortized nlysis yields tht the lgorithm hs liner-time worst-se omplexity. A spe effiient method for error-tolernt serhing in ditionry for pttern llowing some mismthes n e implemented with trie or PATRICIA tree. For given mismth proility q nd given mximum of llowed mismthes d, we study the verge-se omplexity of the numer of omprisons for serhing in trie with n strings over n lphet of size σ. Using methods from omplex nlysis, we derive suliner ehvior for d < log σ n. For onstnt d, we n distinguish three ses depending upon q. For exmple, the serh omplexity for the Hmming distne is σ(σ 1) d /((d + 1)!)log d+1 σ n + O(log d n). To enle n even more effiient serh, we utilize n index of limited d-neighorhood of the text orpus. We show how the index n e used for vrious serh prolems requiring error-tolernt look-up. An verge-se nlysis proves tht the index size is O(n log d n) while the look-up time is optiml in the worst-se with respet to the pttern size nd the numer of reported ourrenes. It is possile to modify the dt struture so tht its size is ounded in the worst-se while the ound on the look-up time eomes verge-se. iii

4

5 Aknowledgments First, I thnk my dvisor Ernst W. Myr for his helpful guidne nd generous support throughout the time of reserhing nd writing this thesis. Furthermore, I m lso thnkful to the urrent nd former memers of the Lehrstuhl für Effiziente Algorithmen for interesting disussions nd enourgement, espeilly to Thoms Erleh, Sven Kosu, Hnjo Täuig, nd Johnnes Nowk. Lstly, I m grteful to Anj Heilmnn nd my mother for proofreding the finl work. v

6

7 Contents 1 Introdution From Suffix Trees to Affix Trees Approximte Text Indexing Tree-Bsed Aelertors for Approximte Text Indexing Registers for Approximte Text Indexing Thesis Orgniztion Pulitions Preliminries Elementry Conepts Strings Trees over Strings String Distnes Bsi Priniples of Algorithm Anlysis Complexity Mesures Amortized Anlysis Averge-Cse Anlysis Bsi Dt Strutures Tree-Bsed Dt Strutures for Text Indexing Rnge Queries Text Indexing Prolems Rie s Integrls Liner Constrution of Affix Trees Definitions nd Dt Strutures for Affix Trees Bsi Properties of Suffix Links Affix Trees Additionl Dt for the Constrution of Affix Trees Implementtion Issues Constrution of Compt Suffix Trees On-Line Constrution of Suffix Trees Anti-On-Line Suffix Tree Constrution with Additionl Informtion Construting Compt Affix Trees On-Line Overview Detiled Desription vii

8 viii CONTENTS An Exmple Itertion Complexity Bidiretionl Constrution of Compt Affix Trees Additionl Steps Improving the Running Time Anlysis of the Bidiretionl Constrution Approximte Trie Serh Prolem Sttement Averge-Cse Anlysis of the LS Algorithm Averge-Cse Anlysis of the TS Algorithm An Ext Formul Approximtion of Integrls with the Bet Funtion The Averge Comptifition Numer Allowing Constnt Numer of Errors Allowing Logrithmi Numer of Errors Remrks on the Complexity of the TS Algorithm Applitions Text Indexing with Errors Definitions nd Dt Strutures A Closer Look t the Edit Distne Wek Tries Min Indexing Dt Struture Intuition Definition of the Bsi Dt Struture Constrution nd Size Min Properties Serh Algorithms Worst-Cse Optiml Serh-Time Bounded Preproessing Time nd Spe Conlusion 115 Biliogrphy 118 Index 129

9 Figures nd Algorithms 2.1 Exmples of Σ + -trees Illustrtion of the liner-time Crtesin tree lgorithm Suffix trees, suffix link tree, nd ffix trees for The ffix trees for nd for The proedure nonize() Construting ST() from ST() The proedure denonize() The proedure updte-new-suffix() Construting ST() from ST() The funtion gettrgetnodevirtuledge() Construting AT() from AT(), suffix view Construting AT() from AT(), prefix view The LS lgorithm The TS lgorithm Illustrtion for the proof of Lemm Lotion of the poles of g(z) in the omplex plne Illustrtion for Theorem Prmeters for seleted omprison-sed string distnes The relevnt edit grph forinterntionl ndintertion Exmples of ompt nd wek tries ix

10

11 Chpter 1 Introdution Computers, lthough initilly invented for numeri lultions, hve hnged the wy tht text is written, proessed, nd rhived. Even though mny more texts thn initilly expeted or hoped for re still printed to pper, writing is mostly done eletronilly tody. Hving text ville in digitl form hs mny dvntges: It n e rhived in very little spe, it n e reprodued with little effort nd without loss in qulity, nd it n e serhed y omputer. The ltter is gret improvement over previous methods espeilly euse it is not neessry to rete fixed set of terms tht re used to index the douments. When employing omputer, one ommonly llows every word to e used for serhing, often lled full-text serh. Effiient methods for serhing in texts re studied in the relm or pttern mthing. Pttern mthing is lredy rther mture field of reserh with numerous textooks ville [CR94, AG97, Gus97]; the si lgorithms re lso prt of stndrd ooks on lgorithms [CLR90, GBY91]. Serhing text for n ritrry pttern is usully lmost s fst s reding the text (espeilly, if reding is hmpered in speed euse the text is stored on slow medi suh s hrd disk). The ese with whih eletroni douments n e stored hs lso led to n enormous inrese in the mount of textul dt ville. The Internet (in prtiulr the World Wide We), for exmple, is tremendous nd growing olletion of text douments. The widely used serh engine Google reports to index more thn eight illion we pges. 1 Another type of textul dt is iologil sequene dt (e.g., DNA sequenes, protein sequenes). A populr dtse for DNA sequenes, GenBnk, ws reported to store over 33 illion ses for over speies in August 2003 nd to grow t rte of over 1700 new speies per month [BKML + 04]. The sheer size of these text olletions mkes on-line serhes unfesile. Therefore, lot of reserh hs een devoted to methods for effiient text retrievl [FBY92] nd text indexes. A text index is dt struture prepred for doument or olletion of douments tht filittes effiient queries for serh pttern. These queries n hve different types. For query with single pttern, one n serh, e.g., for ll ourrenes of serh string, for ll ourrenes of n element of set of strings, or for ll ourrenes of sustring tht hs length twenty nd ontins t lest ten equl hrters the possiilities for different riteri re endless nd usully the result of 1 These numers were tken from s of Mrh

12 FROM SUFFIX TREES TO AFFIX TREES speifi pplitions. For the lssil prolem of serhing for ll ourrenes of given string (the pttern ) in preproessed text, mny lgorithms nd dt strutures hve een proposed. The most prominent dt struture is the suffix tree, whih n e onstruted in liner time [Wei73, MC76, Ukk95, Fr97]. It llows finding ll ourrenes of pttern in time proportionl to the pttern s length nd the numer of outputs. Due to limittions of spe, the suffix rry ws introdued [MM93] s spe effiient replement for the suffix tree with slightly worse onstrution nd look-up omplexities. Reent developments, however, yield liner onstrution time [KSPP03, KA03, KS03] nd hieve n optiml query time [AKO02, AOK02, AKO04]. In prtie, onstrution methods with non-liner symptoti running times seem to e even fster [LS99, BK03, MF04]. For tight spe requirements, ompressed vrints of the suffix rry hve een suggested [GV00, FM00]. The prtil relevne of the suffix tree hs thus een redued. Nevertheless, euse of its ler struture, it often serves s prdigm for the design of new lgorithms (e.g., in [NBY00]). While the omputer llows reting nd rhiving lrge olletions of text effiiently, it sometimes lso leds to less reful hndling nd revising of douments. Errors my lso rise from mistkes or mesuring prolems during the experimentl genertion. Moreover, in iologil ontext, error-tolernt mthing is useful even for immulte dt, e.g., for similrity serhes. The field of pproximte pttern mthing dels with this prolem of serhing strings llowing errors. The most ommonly used models re Hmming distne [Hm50] nd edit distne, lso known s Levenshtein distne [Lev65]. For so lled on-line serhes, where no preproessing of the doument orpus is done, vriety of lgorithms is ville for mny different error models. A survey ontining more thn 120 referenes is given in [Nv01]. The reserh on text indexes llowing errors is y fr not s mture. We review the suffix tree s prdigm for ext text indexing nd s the sis for ffix trees in the next setion. We then turn to pproximte text indexing, very tive field of reserh with still quite few open questions. 1.1 From Suffix Trees to Affix Trees The suffix tree, lso sometimes referred to s suword tree, is rooted tree llowing ess to ll suffixes (nd thus to ll sustrings) of text from the root of the tree. When nnotting the suffix numers t the leves, the sutree elow pth orresponding to sustring of the text is ompt representtion of ll ourrenes of this sustring. This priniple llows n ext serh in optiml time. The first liner time onstrution lgorithm for suffix trees ws given y Weiner [Wei73]. MCreight desried simpler nd more effiient lgorithm [MC76]. A renewed interest in pttern mthing hs lso led to the oneptully most elegnt onstrution lgorithm y Ukkonen [Ukk95]. It ws lter shown y Giegerih nd Kurtz tht Ukkonen s nd MCreight s lgorithms re lmost the sme on n strt level of opertions. All three lgorithms onsider strings over n lphet of onstnt size. An lgorithm to onstrut suffix trees over integer lphets ws proposed y Frh [Fr97]. This lgorithm uses very different pproh, dividing the underlying string into odd nd even positions. (Note tht this pproh is lso tken y the liner time suffix rry onstrution lgorithms.) The

13 CHAPTER 1. INTRODUCTION 3 three previous lgorithms rely on uxiliry edges lled suffix links insted. Suffix links introdue dditionl struture themselves. The nlysis of this struture ws the strting point for the development of ffix trees. Before we go into more detil, let us give high level desription of suffix trees; the forml definitions re introdued in Chpter 2. A suffix tree ontins ll suffixes of string t in suh wy tht from the root of the tree we n strt reding ny sustring of t. This is done y first hoosing rnh upon the first hrter, then upon the seond nd so on. Thus, ll sustrings strting with the hrterre stored in rnh leled. Nodes tht do not present hoie, i.e., whih hve only one outgoing edge, re usully ompressed. Tht is, the nodes re eliminted y joining the inoming nd outgoing edge into one. Suh suffix trees re lled ompt. Eh node in suffix tree represents string nd is the root of the sutree ontining ll sustrings of t strting with this string. Pssing long n edge to nother node lengthens the represented string. Suffix links re used to move from one node to nother so tht the represented string is shortened t the front. This is extremely useful euse two suffixes of string lwys differ y the hrters t the eginning of the longer suffix. Thus, suessive trversl of suffixes n e sped up gretly y the use of suffix links. Giegerih nd Kurtz [GK97] hve shown tht there is strong reltionship etween the suffix tree uilt from the reverse string (often lled reverse prefix tree) nd the suffix links of the suffix tree uilt from the originl string. The suffix links of suffix tree form tree y themselves. In essene, this tree prtilly resemles the suffix tree of the reverse string. Atomi suffix trees ontin node for every sustring of t. Here, the suffix link struture is even identil (see Setion for detils). A similr reltionship ws lredy shown y Blumer et l. [BBH + 87] for ompt direted yli word grphs (-DAWG). A -DAWG is essentilly suffix tree where isomorphi sutrees re merged. As shown there, the direted yli word grphs (DAWG) ontins ll nodes of the ompt suffix tree nd uses the edges of the reverse prefix tree s suffix links. Beuse the nodes of the -DAWG re invrint under reversl of the string, one n gin symmetri version of the -DAWG y dding the suffix links s reverse edges. Although not stted expliitly, Blumer et l. [BBH + 87] thus lredy oserved the dul struture of suffix trees s stted ove. Beuse the suffix links of the ompt suffix tree lredy form prt of the suffix tree for the reverse string, it is nturl to sk whether the suffix tree n e ugmented to represent oth strutures. This would lso entute even more struture of the underlying string. Suh struture ws introdued nd hristened ffix trees y Stoye [Sto95] (see lso [Sto00]). He lso gve first lgorithm for onstruting ompt ffix trees on-line, ut the running time of the lgorithm presented is not liner. In Chpter 3, we desrie liner-time lgorithm for the onstrution of ffix trees. The struture of our lgorithm is very lose to tht of Stoye s lgorithm, ut we introdue some vitl new elements. Moreover, we put stronger emphsis on the derivtion from Ukkonen s nd Weiner s suffix tree onstrution lgorithms. The ore prolem for hieving liner omplexity nd the reson why Stoye s lgorithm did not hve liner running time is tht the lssil lgorithms y Ukkonen or MCreight expet suffix links to e of length one. Tht is, suffix links re expeted to point from node representing ertin string to node representing extly the string tht results from removing the first hrter. As n esily e seen in Figure 3.1, this is the se for suffix trees ut not for ffix trees (nodes nd in Figure 3.1(d)

14 APPROXIMATE TEXT INDEXING nd nodes nd in Figure 3.1(e)). It seems tht the prolem nnot e solved y trversing the tree to find the pproprite tomi suffix links. Stoye himself gve n exmple where this pproh might even led to qudrti ehvior. We will use dditionl dt olleted in pths to estlish view of the ffix tree s if it were the orresponding suffix tree (the pths orrespond to the oxed nodes in Figure 3.1). This will led to liner-time onstrution lgorithm. 1.2 Approximte Text Indexing In pproximte text indexing, we index single text or multiple texts to find pproximte mthes of query string, i.e., mthes llowing some errors. The text or the olletion of texts tht the index is uilt for is lled the doument orpus. We only del with the se of single query string lled the pttern. There re two mjor indexing tsks: text indexing nd ditionry indexing. 2 For the first prolem, text is indexed to query for ourrenes of sustrings of the text. For the seond, ditionry of strings is indexed s to find ll mthing entries for query. We disuss more vrints in Setion 2.4 (e.g., the se of multiple texts), ut they n usully e solved with the sme methods one the two si prolems re solved. In the following, let the size of doument orpus e n nd the size of the pttern e m. The si ide of text indexing is to spend some effort in prepring n dditionl dt struture, the index, from text in order to elerte susequent serhes for pttern. For ext text indexing, n optiml query time liner in the size of the pttern nd the numer of outputs n e rehed. From prtil point of view, query time involving n dditionl term logrithmi or polylogrithmi in the size of the text my still e resonle. Similr omplexities re frequently found in other serh dt strutures. For exmple, serh in red-lk tree (see, e.g., [CLR90]) for n elements tkes time of O(log n). Besides the query time, the size of n index is nother very importnt prmeter. We lredy mentioned tht the spe demnds of the suffix tree lthough liner in the size of the text re still too lrge for some pplitions, whih hs led to the introdution of lterntives like suffix rrys. Even for these, ompressed vrints re studied. This nd the ever-growing olletions of text douments underline the importne of the prmeter size. The third prmeter, side from look-up time nd index size, is the preproessing time (nd possily spe). This prmeter is usully the lest importnt euse text indexes re uilt one on reltively stti text orpus to filitte mny queries. For exmple, the first suffix rry lgorithm y Mner nd Myers [MM93] hd preproessing time of O(n log n) for texts of n hrters. Although this is worse thn the O(n) onstrution time for suffix trees, suffix rrys gined populrity quikly euse they onsumed muh less spe. In tht se, even the worse look-up time did not hinder the suess. There pper vrious definitions of pproximte text indexing in the literture. Besides the existene of mny different error models (see Setion 2.1.3), the definition 2 The ditionry indexing prolem is not to e onfused with the ditionry mthing prolem. In the ltter, ditionry of ptterns is indexed nd queried y text. For eh position in the text ll mthing ptterns from the ditionry re sought. The ext version n e solved with utomtons [AC75]. For the pproximte version see [FMdB99, AKL + 00, BGW00, CGL04].

15 CHAPTER 1. INTRODUCTION 5 of index is not ler. The roder definition just requires the index to speed up serhes [NBYST01], wheres striter pproh requires to nswer on-line queries in time proportionl to the pttern length nd the numer of ourrenes [AKL + 00]. In our opinion, slightly wekened definition tht requires n index to give effiient worst-se gurntees on the serh time is most pproprite. For the ske of mking ler distintion, we ll the methods giving suh gurntees registers nd the other methods elertors. 3 Clssifying our work from this perspetive, we present the verge-se nlysis of the effet of using trie s n elertor for ditionry indexing in Chpter 4. In Chpter 5, we first desrie register with ounded verge-se size for ritrry ptterns nd then omintion of register for short ptterns with n elertor for long ptterns. A survey on elertors is given in [NBYST01]. The ville methods re lssified y the serh pproh nd the dt strutures. The serh pprohes re lssified into neighorhood genertion nd prtitioning. The si dt strutures re suffix trees, suffix rrys, omplete q-grm tles, nd smpled q-grm tles. As lredy mentioned, suffix rrys re very similr to suffix trees. We hndle these nd other treesed strutures in the next setion. In Setion 1.2.2, we turn our ttention to registers. Further referenes onerning q-grm-sed pprohes n e found in [Kär02]. A very different pproh is tken y Chávez nd Nvrro [CN02]. Using metri index they hieve look-up time O(mlog 2 n + m 2 + o) with n index of size O(n log n) with O(n log 2 n) onstrution time, where ll omplexities re hieved on verge. Another somewht different pproh is tken y Griele et l. [GMRS03]. For restrited version of Hmming distne llowing d mismthes in every window of r hrters, they desrie n index tht hs verge size O(n log l n), for some onstnt l nd verge look-up time O(m + o). The ide is sed on dividing the text into smller piees so tht strings of length m hve unique ourrenes. At eh level, ll strings within speified distne re generted nd indexed to identify possile lotions of the pttern. These re then serhed using stndrd methods. Therefore, the pproh lssifies s n elertor or filter. A speil se of ditionry indexing is the nerest neighor prolem, whih sks for the losest string to query mong olletion of n strings of the sme length m. The speifi prolem hs its origin in omputtionl geometry. A very erly definition n e found in [MP69]. A survey on this prolem is given in [Ind04]. The work lso disusses some lower ounds (see lso [BOR99, BR00, BV02] nd [CCGL99]), whih t lest for [BV02] trnsfer to the pproximte ditionry indexing prolem showing tht, under ertin ssumptions, ny lgorithm using n index of size 2 (nm)o(1) nd (nm) o(1) dditionl working spe needs t lest Ω(mlog m) time for query Tree-Bsed Aelertors for Approximte Text Indexing Digitl trees hve wide rnge of pplitions nd elong to the most studied dt strutures in omputer siene. They hve een round for yers; for their usefulness 3 None of these terms pper in the literture. We introdue them here in order to etter lssify different pprohes in this setion only. The term register seemed pproprite euse worst-se methods usully work y reding off the ourrenes of pttern fter omputing ertin strt points. On the other hnd, verge-se methods often just redue the numer of lotions where to hek for n ourrene, thus elerting n on-line serh lgorithm.

16 APPROXIMATE TEXT INDEXING nd euty of nlysis, they hve reeived lot of ttention. Using tree (espeilly trie or suffix tree) trversls for indexing prolems is ommon tehnique. For instne, in omputer linguistis one often needs to orret misspelled input. Shulz nd Mihov [SM02] pursue the ide of orreting misspelled words y finding orretly spelled ndidtes from ditionry implemented s trie or n utomton. They uild n utomton for the input word nd trverse the trie with it in depth first serh. The serh utomton is liner in the size of the pttern if only onstnt numer of errors is llowed. A similr pproh hs een investigted y Oflzer [Ofl96], exept tht he diretly lultes edit distne insted of using n utomton. Fljolet nd Pueh [FP86] nlyze the verge-se ehvior of prtil mthing in k-d-tries. A pttern in k domins with s speified vlues nd (k s) don t re symols is serhed. Eh entry in k-dimensionl dt set is represented y the inry string onstruted y ontenting the first its of the k domin vlues, the seond its, the third its, nd so forth. Using the Mellin trnsform, Fljolet nd Pueh prove tht the verge serh time in dtse of n entries under the ssumption of n independent uniform distriution of the its is O(n 1 s/k ). The nlysis is extended y Kirshenhofer et l. [KPS93] to the symmetri se, where the exponent depends upon k, s, nd the it proility p (n expliit formul nnot e given). In the sme pper, the uthors lso derive symptoti results for the vrine, thus proving onvergene in proility. In terms of ordinry strings, serhing in k-d-trie orresponds to mthing with fixed msk of width k ontining s don t res tht is iterted through the pttern. It is possile to rndomize this fixed msk whih yields relxed k-d-trees, whih re nlyzed with similr results y Mrtinéz et l. [MPP01]. Bez-Ytes nd Gonnet [BYG96] study the prolem of serhing regulr expressions in trie. The regulr expression is used to uild deterministi finite stte utomton, whose size depends only upon the size of the query (lthough it is possily exponentil in the query size). The utomton is simulted on the trie, nd hit is reported every time finl stte is rehed. Extending the verge-se nlysis of Fljolet nd Pueh [FP86], the uthors show tht the verge serh time depends upon the lrgest eigenvlue (nd its multipliity) of the inidene mtrix of the utomton. As result, they prove tht only suliner numer of nodes of the trie is visited. In nother rtile, Bez-Ytes nd Gonnet [BYG90, BYG99] study the verge ost of lulting n ll-ginst-ll sequene mthing. Here, ny sustrings tht mth eh other with ertin (fixed) numer of errors re sought. With the use of tries, the verge time is shown to e suqudrti. Apostolio nd Szpnkowski [AS92] note tht suffix trees nd tries for independent strings symptotilly do not differ too muh under the symmetri Bernoulli model. This onjeture is hrdened y Jquet nd Szpnkowski [JS94] nd y Jquet et l. [JMS04]. Therefore, it seems resonle to expet similr results on tries nd suffix trees (s is done y Bez-Ytes nd Gonnet [BYG96]). For pproximte indexing (with edit distne), Nvrro nd Bez-Ytes [NBY00] desrie method tht flexily prtitions the pttern in piees tht n e serhed in suliner time in the suffix tree for text. 4 For n error rte α = d m, where m 4 For n tul implementtion, Nvrro nd Bez-Ytes suggest to use suffix rrys insted of suffix trees. This is n exmple of how the suffix tree, lthough possily inferior to the suffix rry, plys n importnt role s prdigm for the design nd nlysis of lgorithms.

17 CHAPTER 1. INTRODUCTION 7 is the pttern length nd d the llowed numer of errors, they show tht sulinertime serh is possile if α < 1 e/ Σ, therey prtitioning the pttern into j = (m+d)/(log Σ n) piees. The threshold 1 e/ Σ plys two roles, it gives ound on the serh depth in suffix tree nd it gives ound on the numer of verifitions needed. In Nvrro [Nv98], the ound is investigted more losely. It is onjetured tht the rel threshold, where the numer of mthes of pttern in text dereses exponentilly fst in the pttern length, is α = 1 / Σ with Higher error rtes mke filtrtion lgorithm useless euse too mny positions hve to e verified. More reful tree trversl tehniques n lower the numer of nodes tht need to e visited. This ide is pursued y Jokinen nd Ukkonen [JU91] (on DAWG), lthough no etter ound thn O(nm) is given for the worst-se. A ktrking pproh on suffix trees ws proposed y Ukkonen [Ukk93], hving look-up time O(mq log q + o) nd spe requirement O(mq) with q = min{n, m d+1 } for d errors. This ws improved y Cos [Co95] to look-up time O(mq + o) nd spe O(q + n). No ext verge-se nlysis is ville for these lgorithms Registers for Approximte Text Indexing Reserh hs minly onentrted on methods with good prtil or verge-se performne. Until reently, the est known result for text indexing ws due to Amir et l. [AKL + 00], who desrie n index for one error under edit distne. The ide is sed on uilding two suffix trees for the text nd its reverse. A query for pttern is implemented y omputing for eh prefix nd eh orresponding suffix of the pttern the intersetion of the leves in the sutrees of oth suffix trees. For the ltter, the leves re leled with the suffix (respetively, prefix) numers nd stored in two rrys. To void reporting mthes multiple times, the hrter involved in the mismth is dded s third dimension. On these rrys, three-dimensionl rnge serhing lgorithm is used to ompute those leves orresponding to sustrings mthing with one error. The size nd preproessing time of the rnge serhing dt struture is O(n log 2 n). Eh rnge query tkes time O(log n log log n), leding to n overll query time of O(mlog n log log n + o). By designing n improved lgorithm for rnge serhing over suh tree ross produts, this result ws improved y Buhsum et l. [BGW00] to O(n log n) index preproessing time nd index size nd O(mlog log n + o) query time. Another pproh ws tken y Nowk [Now04]. It solves the text indexing prolem for one mismth under Hmming distne with verge preproessing time O(n log 2 n), verge index size O(n log n), nd worst-se query time O(m + o). The lgorithm is sed on onstruting tree derived from the suffix tree of the text tht ontins ll strings with one mismth ourring diretly fter node in the suffix tree. The error tree is onstruted y suessively merging sutrees of the suffix tree. Queries re implemented effiiently y mking se distintion upon the lotion of the mismth. We extend the ide to edit distne llowing onstnt numer of errors nd to other indexing prolems in Chpter 5. Reently, Cole et l. [CGL04] proposed dt struture llowing onstnt numer d of errors tht works for don t res (lled wild rds), the Hmming distne, nd the edit distne. For Hmming distne with d mismthes, worst-se query

18 THESIS ORGANIZATION time of O(m + ( log n) d /(d!) log log n + o) is hieved using dt struture tht hs size nd preproessing time O(n( log n) d /(d!)) in the worst-se. For the edit distne, the query time eomes O(m + ( log n) d /(d!) log log n + 3 d o), nd the index size nd preproessing time is O(n( log n) d /(d!)). The onstnt vries for eh given omplexity. The ide is sed on entroid pth deomposition of the suffix tree nd the onstrution of errt trees for the pths. The omplexities result from the ft tht ny string rosses t most log n entroid pths tht hve to e hndled seprtely. The pproh of Cole et l. [CGL04] n lso e used for the pproximte ditionry indexing prolem. For n strings of totl size N, they hieve query time of O(m + ( log n) d /(d!) log log N + o) with O(N + n( log n) d /(d!)) spe nd O(N + n( log n) d /(d!)+n log N) preproessing time for Hmming distne with d mismthes. For edit distne with d errors, their pproh supports queries in time O(m + ( log n) d /(d!) log log N + o) using nd index tht hs preproessing time O(N + n( log n) d /(d!) + N log N) nd size O(N + n( log n) d /(d!)). Previous results on ditionry indexing only llowed one error. For ditionries of n strings of length m nd Hmming distne, Yo nd Yo [YY97] (erlier version in [YY95]) present nd nlyze dt struture tht hieves query time of O(mlog log n + o) nd spe O(N log m). The result is hieved y dividing the words reursively into two hlves nd serhing one hlf extly. To improve the query time, ertin d queries re hndled seprtely. The nlysis is onduted under ell-proe model with itwise omplexity (i.e., ounting it proes), thus no preproessing time is given. Also for the Hmming distne nd ditionries of n strings of length m eh (i.e., with N = nm), Brodl nd G sienie [BG96] present n lgorithm sed on tries, whih uses similr ides thn those in [AKL + 00]. For the strings in the ditionry, two tries re generted, the seond one ontining the reversed strings. In the first trie, the strings re numered lexiogrphilly, while the other one is ppended with sorted lists tht re itertively trversed upon query nd ompred with the intervls from the first trie. The lists re internlly linked from prent to hild nodes, thus sving some work. Their dt struture hs size nd preproessing time O(N) nd supports queries with one mismth in time O(m). Brodl nd Venktesh [BV00] nlyze the prolem in ell-proe model with word size m. Trnslted to our uniform omplexity, the query time of their pproh is O(m) using O(N log m) spe. The ide is sed on using perfet hsh funtions to quikly determine the position of n error. The dt struture n e onstruted in rndomized expeted time O(Nm). 1.3 Thesis Orgniztion In Chpter 2, we introdue the relevnt nottion on strings, trees, nd string distnes. We desrie onepts for the nlysis of our index strutures nd lgorithms nd review some si methods nd dt strutures. We lso give more detils on the methods used for the verge-se nlysis. Chpter 3 presents liner time lgorithm for idiretionl on-line onstrution of ffix trees. We first desrie the ffix trees dt struture. After reviewing lgorithms

19 CHAPTER 1. INTRODUCTION 9 for the onstrution of suffix trees, we devise liner-time on-line onstrution lgorithm nd nlyze its omplexity. The desription of the neessry hnges for the idiretionl onstrution nd the mortized nlysis finish the hpter. The mthemtilly most hllenging nlysis is presented in Chpter 4. We ompute the verge omplexity of n lgorithm sed on tries for pproximte ditionry indexing nd ompre it to simple pproh tht ompres pttern sequentilly ginst the whole dtse. Therefore, we first give the nlysis of the verge omplexity of the simple lgorithm. We then derive n ext formul for the trie-sed pproh. After estlishing some si ounds, we ompute the mximl expeted speed-up. Then we nlyze the omplexity for onstnt numer of errors nd finlly ompute the threshold on the numer of errors where the performne of the trie-sed pproh eomes symptotilly the sme s the performne of the nive method. Chpter 5 presents text index for pproximte pttern mthing with d errors. It hs worst-se optiml query time nd resonle verge size O(n log d n) for d errors nd text size n. We first formlize some intuitive properties of edit distne nd introdue slightly ltered trie dt struture. We then desrie the min indexing struture sed on error trees, detil the serh lgorithm, nd prove size nd time ounds. Next, we nlyze the verge size for n implementtion with worst-se look-up time. The hpter finishes with the nlysis of modified pproh tht gives worst-se gurntees on the index size ut only verge-se ounds on the query time. Finlly, Chpter 6 summrizes our results nd gives some diretions for further reserh. 1.4 Pulitions The results of this thesis hve in prt een pulished or nnouned t onferenes. The results of Chpter 3 hve ppered in [M00] nd [M03]. The verge-se nlysis of the trie serh lgorithm in Chpter 4 hs een pulished s n extended strt [M04] nd s tehnil report [M04]. A journl version hs een sumitted to ALGORITHMICA. The results of Chpter 5 hve een otined in joint work with Johnnes Nowk. In prtiulr, the joint work inluded the ide of defining error trees reursively nd mking distintion upon the position of the error nd the height of the tree. The uthor of this work ws responsile for sing the error trees on lerly defined error sets, whih llowed for strightforwrd generliztion to ny onstnt numer of errors nd enled us to prove useful dihotomy for the serh lgorithm. Furthermore, the uthor introdued rnge queries to selet ourrenes in sutrees, whih solved the prolem of nswering queries in worst-se optiml time nd llowed the dption to wide rnge of text indexing prolems, nd hieve trde-off etween query time nd index size y defining wek tries. Additionlly, the verge-se nlysis ws provided y the uthor of this thesis. First results on n index for one error nd edit distne hve een sumitted nd epted for pulition [MN04]. An extended strt for the full set of results will e presented t the CPM 2005 nd is sheduled to pper in [MN05]. A tehnil report desriing the results hs lso een pulished [MN05].

20

21 Chpter 2 Preliminries This work is onerned with the design nd nlysis of text indexes. In this hpter, we introdue the si onepts nd nottion needed in order to desrie our lgorithms nd ondut the nlysis. Furthermore, we review some si tehniques relted to text indexing. 2.1 Elementry Conepts The mening of text in nrrower sense is tht of written text, e.g., in ook, whih is strutured into words. Text indexing refers to roder definition of text inluding unstrutured text suh s DNA sequenes. Therefore, we introdue the notion of string. A string is sequene of symols. When reting n index on string or set of strings, we lso refer to the ltter s text or text orpus. We query n index with nother string lled the pttern. We desrie the si nottion for strings in the next setion. We then introdue onepts of trees, whih re n elementry serh struture for strings. Finlly, euse rel dt is often fr from eing perfet, we onlude the setion with some definitions of errors in strings Strings A string is sequene of symols from n lphet Σ. Unless otherwise stted, we ssume finite lphet Σ. We let Σ l denote the set of ll strings of length l, i.e., the set of ll hrter sequenes t 1 t 2 t 3 t l where t i Σ. The speil se Σ 0 is the set ontining only the empty string denoted y ε. The set of ll finite strings is defined s Σ = l 0 Σl nd the set of ll non-empty, finite strings y Σ + = Σ\{ε}. The length of string is the numer of hrters of the sequene, i.e., the string t = t 1 t 2 t 3 t l hs length t = l. We n esily reverse string t = t 1 t 2 t 3 t l, whih we denote y t R = t l t l 1 t l 2 t 1. For strings u, v Σ, we denote their ontention y uv. For u, v, w Σ nd t = uvw, u is prefix, v sustring, nd w suffix of t. The sustring strting t position i nd ending t position j is denoted y t i,j, where t i,j = ε whenever j < i. We denote y t i, the suffix strting t position i nd y t,j the prefix ending t position j. We sy tht string u ours t position i of t if 11

22 ELEMENTARY CONCEPTS u = t i t i+ u 1. The set of ourrenes of u in t is denoted y ourrenes u (t) = {i u = t i t i+ u 1 }. (2.1) A sustring v is lled proper if there exist i 1 nd j l suh tht v = t i t j. For string t = t 1 t 2 t 3 t l, the set of sustrings is denoted y the set of proper sustrings y the set of suffixes y nd the set of prefixes y sustrings (t) = {t i t j 1 i, j l}, (2.2) psustrings (t) = {t i t j 1 < i j < l}, (2.3) suffixes (t) = {t i t l 1 i l}, (2.4) prefixes (t) = {t 1 t i 1 i l}. (2.5) A suffix u of t is lled proper if it is not the sme s t, i.e., u t. It is lled nested if it hs t lest two ourrenes, i.e., dditionlly to u suffixes(t) we hve u psustrings(t) prefixes(t). Alterntively defined, suffix u of t is nested if u suffixes(t) nd ourrenes u (t) > 1. Similrly, prefix u of t is lled proper if it is not the sme s t, nd it is lled nested if it is proper prefix with more thn one ourrene in t. We ll sustring u of t right-rnhing if it ours t two different positions i nd j of t followed y different hrters, Σ, i.e., t i,i+ u = u, t j,j+ u = u, nd. A sustring u of t is lled left-rnhing if it ours t two different positions i, j of t preeded y different hrters, Σ, i.e., t i 1,i+ u 1 = u, t j 1,j+ u 1 = u, nd. We illustrte the previous definitions with the string nn: The suffix nn is not proper, the suffixnn is proper ut not nested, nd the suffixn is proper nd nested. There is no right-rnhing sustring in nn, ut the sustring n is left-rnhing euse n nd nn re sustrings of nn. The following importnt ft n e esily seen. Oservtion 2.1 (The right-rnhing property is inherited y suffixes). If w is right-rnhing in t, then ll suffixes of w re lso right-rnhing in t. Let S Σ e finite set of strings. The size of S is defined s S = u S u. The rdinlity of S is denoted y S. For ny string u Σ, we let u prefixes(s) denote tht there is string v in S suh tht u prefixes(v). We define mxpref(s) to e the length u of the longest ommon prefix u of ny two different strings in S. The prefix-free redution of set of strings S is the set of strings in S tht hve no extension in S, i.e., those strings tht re not prefix of ny other string. We denote the prefix-free redution of S y pfree (S) = {u u S nd Σu S}. (2.6) A set S is lled prefix-free if no string is the prefix of ny other string, tht is, pfree(s) = S.

23 CHAPTER 2. PRELIMINARIES Trees over Strings Trees re used in the implementtion of mny well-known serh strutures nd lgorithms, e.g., AVL-trees, (,)-trees, red-lk trees [Knu98, CLR90]. When the universe of keys onsists of strings, we n do etter thn the forementioned omprisonsed methods. This is lled digitl serhing [Knu98]. The si ide is ptured in the definition of Σ + -trees [GK97]. A Σ + -tree T is rooted, direted tree with edge lels from Σ +. For eh Σ, every node in T hs t most one outgoing edge whose lel strts with (unique rnhing riterion). In essene, Σ + -tree llows one-to-one mpping etween ertin strings nd the nodes of the tree. An edge is lled tomi if it is leled with single hrter. A Σ + -tree is lled tomi if ll edges re tomi. A Σ + -tree is lled ompt if ll nodes other thn the root re either leves or rnhing nodes. Repling sequenes of nodes tht eh hve only one hild y single edge with the ontented lels is lled pth ompression. Let p e node of the Σ + -tree T, we denote this y p T. Let u e the string tht results from ontenting the edge lels on the pth from the root to p. We define the pth of p y pth(p) = u. By the unique rnhing riterion there is one-toone orrespondene etween u nd p. In suh se, it is onvenient to introdue the nottion u to identify the node p. We define the string depth of node p to e depth(p) = pth(p). The node depth is defined s the numer of nodes on the pth from the root to node. The ltter is less importnt to us euse sine edge lels re never empty it is ounded from ove y the string depth. In the following, we thus refer to the string depth simply y the depth. For Σ + -tree, we define its height to e the mximl depth of n inner node. Eh Σ + -tree T defines word set tht is represented y the tree. The word set words(t ) is the set of strings tht n e red long pths strting t the root of the tree. Formlly, we hve words (T ) = {u there exists p T with u prefixes (pth (p))}. (2.7) Note tht pfree(words(t )) is extly the set of strings represented y the leves of T. For n tomi or ompt Σ + -tree T there is one-to-one orrespondene etween the set of strings words(t ) nd the tree T. For n tomi Σ + -tree T there is even node p with pth(p) = u for every string u words(t ). For non-tomi, espeilly for ompt Σ + -trees, we distinguish whether string hs n expliit or n impliit lotion. If for string u words(t ) there exists node p T, then we sy tht u hs n expliit lotion, otherwise we sy tht u hs n impliit lotion. Wheres expliit lotions n onveniently e represented y node, we hve to introdue virtul nodes for impliit lotions. Let u words(t ) e string with n impliit lotion. By definition, there must e node p with u prefixes(pth(p)). Let q e the prent of p, then u = pth(q)v for some v Σ +. We define the pir (q, v) to e the virtul node for u. A virtul node is onept tht llows us to tret ompt Σ + -trees in the sme wy s tomi Σ + -trees. With proper representtion of virtul node (e.g., s (nonil) referene pir in Chpter 3), we n extend this to n lgorithmi level designing lgorithms for tomi Σ + -trees ut representing them more spe effiiently y ompt Σ + -trees. For the Σ + -tree T, we denote y T p the sutree rooted t node p. It mkes no differene whether p is rel or virtul node. It is onvenient to use the sme nottion

24 ELEMENTARY CONCEPTS for strings. For u words(t ), we let T u denote the sutree rooted t the (possily virtul) node p with pth(p) = u. The words represented y T u re speil suffixes of the word set represented y T, i.e., words(t u ) = {v uv words(t )} String Distnes Wherever rel dt is entered into omputing system either y hnd or y some utomti devie, there is the hne for errors. Furthermore, mny times the proess of generting the dt, e.g., y mesuring it, my e error prone. Therefore, it is neessry to desrie nd hndle errors in texts nd strings. There re mny possile wys of desriing the distne (or similrity) of two strings. A very generl model of distne n e desried y set of rules lled opertions of the type Σ Σ ssoited with ertin ost. The distne is the totl ost needed to trnsform one string into the other y sequentilly pplying the opertions to sustrings. Beuse this generl model llows enoding Turing mhine omputtion, the distne etween two strings my e undeidle. In this work, we only del with distnes sed on opertions of the type Σ {ε} Σ {ε}, whih re esily omputle y dynmi progrmming. It is possile to define distnes tht operte on lrger loks, ut there is no generlly epted, widespred model. Furthermore, if the model is not refully hosen, it might not e esily omputle, e.g., there re lok distnes tht re NPomplete [LT97]. The restrition to opertions over sustrings of t most one hrter results in three different edit opertions: insertion {ε} Σ, deletion Σ {ε}, nd sustitution Σ Σ. The ost of eh opertion n e hosen differently y the type of the opertion nd y the hrters tht re involved. It is lso possile to restrit the use only to ertin types of opertions, whih n lso e hieved y setting the ost of the undesired opertions to infinity. We ssume tht the so defined distnes re positive, symmetri, nd oey the tringle inequlity to deserve their nme. In the simplest version, we only llow sustitutions. This orresponds to ompring two strings hrter-wise. For these omprison-sed string distnes, we ssign vlue d(, ) {0, 1} to the omprison of eh pir of hrters, Σ. Thus, the distne etween two strings u, v Σ is simply defined y { if u v, d(u, v) = u i=1 d(u i, v i ) otherwise. (2.8) The most elementry omprison-sed distne is the Hmming distne where d(, ) is zero if nd only if =, nd it is one otherwise [Hm50]. The Hmming distne ounts the numer of mismthes etween two strings of the sme length. A ommon extension re don t-re symols. A don t-re symol x is speil hrter for whih d(x, ) is lwys zero. If we llow only insertions nd deletions, we get the longest ommon susequene distne. The vlue is often divided y the string length to yield vlue etween zero nd one, giving some mesure of string similrity. The generl se is usully lled weighted edit distne. It n e omputed vi dynmi progrmming using the following reursive eqution. For u Σ k nd v Σ l,

25 CHAPTER 2. PRELIMINARIES 15 we ompute d(u 1 u k, v 1 v l ) = min d(u 1 u k, v 1 v l 1 ) + d(ε, v l ) d(u 1 u k 1, v 1 v l ) + d(u k, ε) d(u 1 u k 1, v 1 v l 1 ) + d(u k, v l ), (2.9) where d(ε, ε) = 0. In the ove formul, the top line represents n insertion, the middle line deletion, nd the ottom sustitution or mth. In the most si vrint, we ssign ost of one for n insertion, deletion, or sustitution nd ost of zero for mth. The result is simply lled edit distne or Levenshtein distne [Lev65]. It ounts the miniml numer of insertions, deletions, nd sustitutions to trnsform one string into nother. We should mention tht there re mny other models for string distne ut none s thoroughly studied s the forementioned ones. Besides those lso onsidered in the pttern mthing or omputtionl iology ommunity (see [Nv01] for list nd referenes), there re numer of speilized distnes in informtion retrievl or reord linkge (see, e.g., [PW99]). 2.2 Bsi Priniples of Algorithm Anlysis Before emrking on the design nd nlysis of our lgorithms, we review some si priniples. In prtiulr, we hve to gree on how to mesure the ost or omplexity of n lgorithm nd dt struture, i.e., we desrie the mesure for our resoures time nd spe. Seondly, we review mortized nlysis, whih is used hevily in Chpter 3. Finlly we desrie the si onepts of verge-se nlysis, needed for Chpters 4 nd Complexity Mesures In order to ssess the qulity of n lgorithm y ompring it with other lgorithms, we need ommon mesure. The si resoures we ompre re time nd spe usge. We ompre these symptotilly y desriing the reltion etween the size of the input, nd the spe nd time needed. The omplexity of n lgorithm is thus desription of the growth in the mount of resoures used. There re silly two hoies for time omplexity nd two for spe omplexity. Most omputers do not work with its ut with hunks of its of ertin size, lled words. Bsi opertions re usully implemented in the min proessor to tke only onstnt numer of lok tiks (e.g., ddition, omprison, or even multiplition). The word size usully lso ffets the memory ess, i.e., the us width. Therefore, word is often ssumed s si unit in the omplexity nlysis. Regrding time omplexity, we n either ssume uniform ost model with onstnt time per opertion or logrithmi ost model where the ost of eh opertion is liner in the numer of its of the numers operted on. The uniform ost model is very populr euse it reflets the sitution where the size of numers nd pointers does not exeed the word size of the omputer. This is usully the se for text lgorithms when ll dt fits into the min memory, for whih pointer n normlly e stored in single word. Usully, even the size of the dt storle on rndom

26 BASIC PRINCIPLES OF ALGORITHM ANALYSIS ess hrd disk does not exeed one or two words. While opertions like multiplition n indeed depend on the size of the numer, this ft is often not visile due to dvned miroproessor rhiteture fetures like pipelining, out-of-order exeution, or hing strtegies (memory ess is often ottlenek). Espeilly regrding text indexes, there re two min models for mesuring spe. The word model ounts the numer of words of memory used. A more detiled nlysis, lled it model, ounts the numer of its used. When the size of the input is limited, e.g., y the min memory, the differene in oth models is only onstnt ftor (the word size). In prtie, the size of n index plys n importnt role, so the onstnt ftor hidden y the Lndu nottion is of interest. Unless expliitly stted otherwise, we ssume uniform ost model for time omplexity nd the word model for spe omplexity in this work Amortized Anlysis Amortized nlysis is tehnique to ound the worst-se time omplexity of n lgorithm. It is ommonly used to nlyze sequene of opertions nd prove worst-se ound on the verge time per opertion. There re three si tehniques lled ggregte, ounting, nd potentil method (see, e.g., Chpter 18 in [CLR90]). We mke hevy use of the ounting method in Chpter 3; therefore, we review the method on dynmi tles, dt struture tht we lso need in Chpter 3. A dynmi tle is growing rry with mortized onstnt ost for dding n element. Elements re dded to the end until previously reserved spe is filled. Eh time this hppens, new, lrger spe is limed, nd the old elements re opied to this spe. Referening nd hnging elements works in the sme wy s for norml rrys. Dynmi tles n lso e implemented to support deletions, ut we refer the reder to [CLR90] for the detils. The si ide of mortized nlysis is to distriute the ost rising for ertin opertions, e.g., when the rry rehes its urrent mximum nd the elements need to e opied to new lotion. Therefore, we onsider n infinite sequene of opertions nd show tht the first n opertions n e exeuted in n steps, i.e., every step tkes mortized onstnt time. By the ggregte tehnique, we would just ound the totl numer of opertions nd divide it y n. By the potentil method, we would ssign ertin potentil to the dt struture (muh like the potentil energy of ike on top of hill) nd use it to perform ertin ostly opertions tht redue the potentil. Here, for the ounting method, we ssign oins to ertin ples in the dt struture. In prtiulr, we put two oins on eh dded element to py for the susequent restruturing. In sense, we put ertin mount of time in nk ount to e used lter. The ounting method differs from the potentil method in tht one nnot neessrily see the oins when looking t the dt struture. Therefore, we hve to prove tht the oins re ville when we need them. For the dynmi tle, we only desrie the dt struture nd the opertion to dd n element. The dt struture onsists of n rry A of size m nd fill level l. It stores l elements. As long s l +1 m, new element is stored in A[l +1]. When the fill level rehes m, new rry A of size 2m is reted nd the elements from A re opied to A. The rry A is then relesed nd repled y A. We need to perform m dditionl opertions to opy the elements from A to A. Afterwrds, we n store the

27 CHAPTER 2. PRELIMINARIES 17 new element in l + 1 s usul. Note tht fter we hve opied elements from A to A we hve l = m 2. To prove tht dding n element tkes mortized onstnt time, we dd two oins on eh newly inserted element if l + 1 m. Thus, the mortized time to dd n element is 2+1 = 3 opertions for dding two oins nd storing the new element in A. At the time where the rry hs to e enlrged, we hve dded the elements A[ m 2 + 1] to A[m]. On eh of these elements, we hve stored two oins. Therefore, we hve extly m oins, whih we use to py for the time needed to opy the elements from A to the newly lloted A. We then dd the new element together with two oins. As result, dding new element hs n mortized onstnt ost of one rel nd two delyed opertions Averge-Cse Anlysis Averge-se nlysis nlyzes the expeted ehvior of lgorithms. This is espeilly interesting in the se tht n lgorithm is exeuted mny times with different inputs, thus giving Fortun hne to do her duty nd even out some inputs with high osts. The primry gol of verge-se nlysis is to find etter understnding of the ehvior of lgorithms s it is oserved in prtie. The usefulness of n verge-se nlysis lso depends upon the dopted proilisti model tht hs to e suffiiently relisti. The model is therefore the sis of the nlysis. In this setion, we desrie some ommon models for verge-se nlysis of string lgorithms. A very thorough tretment of the sujet is ville in [Szp00]. In the verge-se nlysis of string lgorithms, the rndom input is usully string or set of strings. It is ommon to ssume tht eh string w is the prefix of n infinite sequene of symols {X k } k 1 generted y rndom soure from the lphet Σ. The ssumptions out the soure determine the proilisti model. The simplest soure is the memoryless soure. The sequene {X k } k 1 is ssumed to e the result of independent Bernoulli trils with fixed proility Pr{X k = } for eh hrter Σ. We distinguish the symmetri se where the proility for eh hrter my e different nd the symmetri se where Pr{X k = } = 1 Σ. When the ssumption of hrter-wise independene is too strong, one n use Mrkov soure. Here, the proility of eh hrter depends upon the history, i.e., it depends on the l previous hrters for n l-th order Mrkov hin. Usully, only first order Mrkov hins re treted, where the proilities re given y Pr{X k = X k = }. We do not work with Mrkov hins ut with the following more generl model. Memoryless nd Mrkov soures re omprised in sttionry nd ergodi soures. An in-depth tretment n e found in [Kl02]. Let (Ω, F, µ) e proility spe for the rndom sequene {X k } k 1. The set Ω ontins ll possile reliztions of strings, tht is, Ω = {ω; ω = {ω k } k 1 nd ω k Σ for ll k}, (2.10) the σ-field F is generted y ll finite sets F m 1 = {ω 1 ω m ; ω k Σ for ll 1 k m}, (2.11)

28 BASIC PRINCIPLES OF ALGORITHM ANALYSIS nd µ is proility mesure on F. Let θ : Ω Ω e the shift opertor, i.e., θ(ω 1, ω 2, ω 3,...) = ω 2, ω 3,.... (2.12) A rndom sequene {X k } k 1 is lled sttionry if θ({x k } k 1 ) hs the sme distriution s the sequene itself, tht is, θ is mesure preserving: µ θ 1 = µ. A set I Ω is lled shift-invrint if θ 1 (I) = I. Let S e the set of shiftinvrint sets in Ω, then S is σ-field. A sttionry sequene is lled ergodi if the shift-invrint σ-field is trivil, i.e., for eh element I S either µ(i) = 0 or µ(i) = 1. Sttionry nd ergodi sequene re of interest euse the time nd spe mens re equl, i.e., we n find the expeted vlue of the X k y verging over the sequene: E[X k ] = n 1 n i=1 X i. One we hve fixed the proilisti model, we n nlyze the expeted time or spe omplexity using vrious methods from proility theory nd lso nlytil tools (n pproh lredy tken in [Knu98]). Before we emrk on this tsks, we introdue some dditionl properties needed nd relted results. Let {X k } k 1 e rndom sequene generted y sttionry nd ergodi soure. For n < m, let F m n e the σ-field generted y {X k } m k=n with 1 n m. The soure stisfies the mixing ondition if there exist positive onstnts 1, 2 R nd d N suh tht for ll 1 m m + d n nd for ll A F m 1, B Fn m+d, the inequlity 1 Pr {A}Pr {B} Pr {A B} 2 Pr {A}Pr {B} (2.13) holds. Note tht this model enompsses the memoryless model nd sttionry nd ergodi Mrkov hins. Under this model, the following limit (the Rény entropy of seond order) exists ( ln w Σn (Pr {w})2) r 2 = lim n 2n, (2.14) whih n e proven y using su-dditivity [Pit85]. The soure stisfies the strong α-mixing ondition if funtion α : N R exists with lim d α(d) = 0 suh tht, for ll 1 m m + d n nd for ll A F m 1, B F n m+d, the inequlity (1 α(d))pr {A}Pr {B} Pr {A B} (1 α(d))pr {A}Pr {B} (2.15) holds. We do not only onsider expeted omplexities in the verge-se nlysis of lgorithms or dt strutures, ut lso proilisti ounds, i.e., sttements out the proility tht ertin events our. In prtiulr, ehvior is witnessed with high proility (w.h.p.) when the proility grows s 1 o(1) with respet to the input size n, i.e., the proility pprohes one s n inreses. For rndom vriles depending on growing sequenes, we n lso define onvergene proilities. A rndom vrile X n onverges in proility to vlue x if for ny ǫ > 0 lim Pr { X n x < ǫ} = 1. (2.16) n

29 CHAPTER 2. PRELIMINARIES 19 A rndom vrile X n onverges lmost surely to vlue x if for ny ǫ > 0 { } lim Pr N sup X n x < ǫ n N = 1. (2.17) As n exmple nd for lter use, we reprodue some results regrding the mximl length of repeted sustring in string of length n. We repitulte here results from [Szp93] (see lso [Szp93]). Let H n denote the mximl length of repeted string with oth ourrenes strting in the first n hrters of sttionry nd ergodi sequene {X k } k 1. If the soure stisfies the mixing ondition (2.13) nd if r 2 > 0, then one n prove tht { Pr H n (1 + ǫ) lnn } n ǫ lnn, (2.18) r 2 for some onstnt nd ǫ > 0. Thus, H n is with high proility O(lnn). Indeed, even lmost sure onvergene n e proven: H n lim n lnn = 1 (.s.), (2.19) r 2 whih is vlid if the soure stisfies the strong α-mixing ondition nd dditionlly d=0 α(d) < holds. The ltter ondition is lwys fulfilled for sttionry nd ergodi Mrkov hins [Br86]. We hve just ounded the (string) height of suffix tree tht we will introdue in Setion Most Σ + -trees uilt from sets of rndom string generted y sttionry nd ergodi soure hve logrithmi height [Pit85]. 2.3 Bsi Dt Strutures We lredy desried dynmi tles in Setion In Chpter 4, we nlyze serh lgorithm on tries or PATRICIA trees. The index struture desried in Chpter 5 works for ll kinds of Σ + -trees, e.g., tries, PATRICIA trees, suffix trees, nd generlized suffix trees. We introdue ll these dt strutures in the following setion. Setion desries how to implement rnge queries in optiml time, whih we lso need in Chpter Tree-Bsed Dt Strutures for Text Indexing In this setion, we present some of the most ommon dt strutures for text indexing in terms of the Σ + -tree defined ove. These strutures re the sis of our lgorithms. We desrie here how the dt strutures re interpreted theoretilly nd lso give some si ide on how they n e implemented. There re mny implementtion spets for Σ + -trees depending on the pplition. Some pplitions require tht the prent of node is essile in onstnt time wheres others lwys trverse the tree from the root to the leves. Some si tehniques n e identified. In mny ses, the edges re not leled expliitly ut y using pointers into n underlying string or set of strings. This n e either strt nd n end pointer or strt

30 BASIC DATA STRUCTURES pointer nd length. In this wy, eh edge tkes onstnt spe while still llowing onstnt time ess to the lel. Furthermore, it my e possile to store the pointers impliitly (see, e.g., [GKS03]). Beuse we re deling with trees, it is lso ommon to store the edges (i.e., their prmeters) t the trget nodes. Another hoie onerning the implementtion is the orgniztion of rnhing. The hildren of node re ommonly essed y the first hrter of the orresponding edge lel. The most ommon tehniques re rrys, hshing, linked list, serh trees, or omintion of these. The fstest ess is possile using rrys indexed y the first hrter of the orresponding edge lel, whih tkes Σ pointers per node. The most spe effiient implementtion uses linked list, whih needs one pointer per hild. Here the ess time is proportionl to the numer of hildren or O( Σ ) in the worst se. Hshing nd serh trees provide solutions whih ompromise etween oth extremes. Asymptotilly, the solutions re ll equivlent euse we ssume onstnt size lphet. The simplest vrint of Σ + -tree is the trie, introdued y Brndis [Bri59] nd y Fredkin [Fre60]. A trie for prefix-free set of strings S is Σ + -tree T suh tht pfree(words(t )) = S, i.e., eh string in S is represented y lef, nd there re no other leves in T. The representtion of edges does not mtter symptotilly, ut the lssi trie hs tomi edges from the root to the inner nodes nd then omptifies the edges leding to the leves. An exmple is given in Figures 2.1() nd 2.1(). The ide of ompressing ll pths of trie leds to the PATRICIA tree introdued y Morrison [Mor68]. An exmple is shown in Figure 2.1(). A PATRICIA tree is essentilly simple representtion of the ompt Σ + -tree for the set of strings S. It is not unommon to tlk of ompt tries insted of PATRICIA trees. A suffix tree for string t is Σ + -tree tht ontins ll the sustrings of t. Formlly, suffix tree T for string t must stisfy words (T ) = sustrings (t). (2.20) Usully, one is only interested in the ompt suffix tree (CST) euse it hs liner size. In the following we will lwys refer to the ompt suffix tree simply y the suffix tree. The exmple of suffix tree in Figure 2.1(d) lso inludes dditionl edges (dshed rrows). These re lled suffix links. A suffix link points from node p to node q if pth(q) represents suffix of pth(p) in T. In prtiulr, pth(q) should e the longest proper suffix of pth(p). In suffix trees, usully only suffix links for inner nodes re dded. Historilly, suffix links were n invention to filitte liner-time onstrution of suffix trees [Wei73, MC76, Ukk95], ut there exist other pplitions for suffix links in suffix trees, e.g., omputing mthing sttistis [CL94] or finding tndem repets in liner time [GS04]. Another ommon feture of suffix trees visile in Figure 2.1(d) is the use of sentinel, in this se the hrter $. A sentinel is hrter tht does not our nywhere else in the string. By the ddition of suh unique hrter, there n e no nested suffixes, thus eh suffix is represented y lef in the suffix tree. If the leves re leled y the numer of the suffix of t tht they represent, then the suffix tree n e used to find ll ourrenes of pttern w. Strting from the root, we simply follow the edges whih spell out w. The leves of the sutree T w represent

31 e s e CHAPTER 2. PRELIMINARIES 21 r 3 4 r 4 2 l l ll 6 1 i d id 8 u l l u ll y 0 9 y s l l s 16 e ell 13 t o p k t o p k () A trie () A trie with ompressed lef edges r ll 4 $ 2 $ ppi$ 4 1 e id ssi 5 ppi$ 6 u 11 ssippi$ i ll p 0 9 y 12 9 i$ pi$ 10 mississippi$ 7 s 13 ell to k p i 13 si ppi$ ssippi$ ppi$ ssippi$ () A PATRICIA tree (d) A suffix tree Figure 2.1: Exmples of Σ + -trees representing the prefix-free set of stringser,ell, id,ull,uy,sell,stok,stop (-) nd suffix tree formississippi$ (d).

32 BASIC DATA STRUCTURES the ourrenes of w, i.e., if the lef p T w is leled i, then there is n ourrene of w strting t i. Hene, we n report l ourrenes of pttern w in text t in time O( w + l). The suffix tree thus solves the text indexing prolem in optiml time. The prolem is to preproess given text t to nswer the following queries: Given pttern w, report ll i suh tht t i,i+ w 1 is equl to w. As noted lredy, suffix trees n e onstruted in liner time [Wei73, MC76, Ukk95, Fr97] (Ukkonen s lgorithm is desried in Chpter 3). It is lso possile to uild suffix tree for set of strings, so-lled generlized suffix tree. A Σ + -tree T is generlized suffix tree for the set of strings S if words (T ) = t S sustrings (t). (2.21) The esiest wy to onstrut generlized suffix tree for set S is to rete new string u = t 1 $ 1 t 2 $ 2 t S $ S s the ontention of ll strings t i in S seprted y different sentinels $ i. The suffix tree for u is equivlent to the generlized suffix tree for S if leves re ut off fter eh sentinel. Note tht it is not neessry to use S different sentinels. We n use omprison funtion tht onsiders two sentinels to e different. The drwk is tht there might e two outgoing edges leled y the sentinel t single node whih must then e hndled in mthing nd retrievl. To onlude this setion, we mention tht there re lterntive dt strutures suh s the direted yli word grph (DAWG) [BBH + 87], the suffix rry [MM93], the suffix tus [Kär95], or the suffix vetor [MZS02]. The most importnt of these is the suffix rry, whih hs eome vile replement for the suffix tree [AKO04] Rnge Queries In the previous setion, we desried method to find ll l ourrenes of pttern w in string t y mthing the pttern in the suffix tree uilt from t nd reporting the indies stored t the leves s output. This lgorithm runs in time O( w + l) whih is liner in the size of the query pttern nd the numer of reported indies. We ll suh ehvior output sensitive. An lgorithm is output sensitive if it is optiml with respet to the input nd output, i.e., it hs running time tht is liner in the size of the input nd of the outputs. Another importnt prolem is the doument listing prolem. The prolem is to preproess given set of strings S = {d 1,...,d m } (lled douments) to nswer the following queries: Given pttern w, report ll i suh tht d i ontins w. Let T e generlized suffix tree for S where the leves re leled with the doument numers, i.e., lef p is leled with k if pth(p) is suffix of d k. The doument listing prolem n e solved y mthing the pttern w nd reporting ll different doument numers ourring t leves in T w. Unfortuntely, the numer of leves my e muh lrger thn the numer of reported douments. The simple lgorithm is not output sensitive. This nd other similr prolems n e solved optimlly y using onstnt time rnge queries [Mut02], whih we present next. The following rnge queries ll operte on rrys ontining integers. The gol is to preproess these rrys in liner time so tht the queries n e nswered in onstnt time.

33 CHAPTER 2. PRELIMINARIES 23 p p r q q RMC RMC () Nodes p nd q re found strting from the right-most hild (leled RMC ). () The node p eomes left hild of the new node r whih is inserted s right hild of the node q. Figure 2.2: One step in the liner-time lgorithm to uild the Crtesin tree. After inserting the node r, it eomes the new right-most hild. Oserve how no node on the pth from the old right-most hild to the node p n ever e on the right-most pth gin. A rnge minimum query (RM) for the rnge (i, j) on n rry A sks for the smllest index ontining the minimum vlue in the given rnge. Formlly, we seek the smllest index k with i k j nd A[k] = min i l j A[l]. A ounded vlue rnge query (BVR) for the rnge (i, j) with ound on rry A sks for the set of indies L in the given rnge where A ontins vlues less thn or equl to. The set is given y L = {l i l j nd A[l] }. A olored rnge query (CR) for the rnge (i, j) on rry A sks for the set of distint elements in the given rnge of A. These re given y C = {A[k] i k j}. RM queries n e nswered in onstnt time nd with liner proessing [GBT84]. The ide is sed on lowest ommon nestor (LCA) queries, whih n e nswered in onstnt time [HT84, SV88, BFC00, BFCP + 01]. We riefly desrie n lgorithm for RM queries long the lines of [BFC00]. The first step is to uild the Crtesin tree B for the rry A. Let A e indexed 1 through n. The Crtesin tree [Vui80] is inry tree. The root represents the smllest index i with the minimum element in A, the left sutree is Crtesin tree for the surry A[1...i 1], nd the right sutree is Crtesin tree for the surry A[n...i 1]. The Crtesin tree B n e uilt in liner time y itertively uilding the tree for the surry A[1...i]. At eh step, we keep pointer to the right-most node (the node we reh when tking only right edges). When the new vlue A[i] is dded, we move towrds the root until we find node p ontining smller vlue. If no suh node is found, the new vlue is the smllest so fr nd eomes the new root with the old root eing its left hild. Otherwise, let q e the right hild of p (if it exists). We dd new node r with the vlue A[i] in ple of q s the right hild of p nd mke q the left hild of r. Figure 2.2 shows sketh of this step. Beuse every trversed node ends up in the left sutree of r, no node is trversed twie in this wy. As result, the totl time to uild B is O(n).

34 BASIC DATA STRUCTURES In the seond step, we onstrut three new rrys from n Euler tour of the Crtesin tree. The first rry H of size 2n ontins the node depths enountered during the trversl, the seond rry E of size 2n ontins the indies of A stored t the nodes, nd the third rry M of size n stores n ourrene in E for eh index in A. The Crtesin tree n e disrded fter this step. Note tht suessive vlues in H differ y plus or minus one, thus we ll H ±1-rry. Note further tht, if k is the index of the minimum element in the rnge (i, j) of H, then A[E[k]] is the minimum element in the rnge (E[i], E[j]) of A. This llows nswering RM queries on A y RM queries on H. As lst step, we need to prepre RM queries on the ±1-rry H. We divide H into loks of length = log n 2. A douly indexed rry L ontins in L[i, j] the index of the minimum vlue in H[i...(i + 2 j 1)]. It hs size O(n) nd we n onstrut it y dynmi progrmming using the eqution { L[i, j 1] if H[L[i, j 1]] < H[L[i + 2 j 1 1, j 1]] L[i, j] = L[i + 2 j 1 1, j 1] otherwise. (2.22) The vlues L[i, 1] re omputed y simple sn of H. Beuse H is ±1-rry, we represent eh lok of size y n integer x etween zero nd 2 2 n. For eh pir of reltive offsets (k, l) we store in n rry B[x, k, l] the reltive offset of the smllest vlue. We hve 0 k, l <, thus B hs size O( n log 2 n). An dditionl rry P of size n is used to store the integer representtion of the loks. To find the index of the minimum element in H in the rnge (i, j) we first ompute the lok oundries i = i nd j = j. If we hve j i <, then we n find the index of the minimum element in H y looking up B[P[i ], i i, j i ]. Otherwise we use rry L to find the index of the minimum element in the rnge (i, j ) y ompring H[L[i, k]] nd H[L[j + 2 k 1 1, k]] where k is hosen suh tht 2 k i j 2 k+1. For the rnges (i, i ) nd (j, j), we look up the indies of the minim in the rry B y B[P[i ], i i, 1] nd B[P[j + 1], 0, j j ]. As result, we n nswer minimum rnge queries on A in onstnt time with the rrys H, E, M, L, B, P using O(n) spe. BVR queries on rry A n e nswered sed on RM queries [Mut02] in time O( L ) nd with liner preproessing. The ide is to suessively perform RM queries: Let k e the nswer to the RM query (i, j), then either A[k] >, in whih se we re finished with this surnge, or A[k], in whih se we ontinue y querying (i, k 1) nd (k + 1, j). Eh query either stops or yields new element of L. Thus, we get inry tree like query struture, the inner nodes orresponding to suessful queries nd the leves to unsuessful queries or empty intervls. The tree is inry, so we perform O( L ) queries in totl. Finlly, CR queries n lso e nswered in time O( C ) nd with liner preproessing [Mut02]. The solution is sed on BVR queries. A new rry R is onstruted with R[i] = j if j < i is the lrgest index with A[j] = A[i], nd with R[i] = 1 if no suh index exists. Thus, olored rnge query (i, j) on A n e nswered y ounded vlue rnge query (i, j) with ound i on R: The resulting indies in L re trnslted y simple look-up to C = {A[l] l L}. No vlue A[l] n our twie euse we would hve R[l ] = l > i for the seond ourrene l. We now turn k to our originl seletion prolem on the generlized suffix

35 CHAPTER 2. PRELIMINARIES 25 tree T. To hieve n optiml query time, we rete n rry A ontining the doument numers in the order they re enountered t the leves y depth first trversl. At the sme time, we store the index used for the left-most nd for the right-most lef in the rry t eh inner node. Assume we hve mthed pttern w in the tree nd let p e the next node in the tree with w suffixes(pth(p)). Let i e the index of the left-most nd j e the index of the right-most lef stored in T p. Thus, the doument numers of the leves in the sutree T w re those in the surry A[i j], nd we n retrieve the different doument numers y CR query on A in liner time in the numer of outputs. Chpter 5 ontins more pplitions of rnge queries. 2.4 Text Indexing Prolems An indexing prolem is defined y dtse nd query type. We del here with text indexing prolems, so the dtse is either single string t Σ or olletion of strings C Σ. A query type defines wht the eligile query ptterns re nd wht the output (in dependene on the input pttern) should e. It is possile to define prolems where the query is set of strings, ut here we re only interested in the nonil version where the query is single string lled the pttern possily ompnied y some dditionl prmeters. In ext pttern mthing, the output of query involves property on the set of ourrenes of pttern w in the dtse. We lredy got to know the text indexing prolem in Setion 2.3.1, where there is no dditionl property, i.e., the omplete set is to e reported. The solution given there n esily e extended for indexing set of strings. The prolem is then sometimes lso lled the ourrene listing prolem. The prolem is to preproess given olletion of texts S = {t 1,...,t n } to nswer the following queries: Given pttern w, report ll (i, j) suh tht t j i,i+ w 1 is equl to w. In Setion 2.3.2, we desried the doument listing prolem where the property is the set of doument identifiers i of those strings t i S tht ontin n ourrene of the pttern w. Besides ourrene nd doument listing, the third nonil type of prolem defined for olletion of strings S = {t 1,...,t n } is the ditionry indexing prolem. For query with the pttern w, the desired output is the set of identifiers i suh tht t i = w. For ext text indexing, optiml lgorithms for indexing single text hve een known for while. The extensions for indexing set of douments re often strightforwrd. The use of rnge queries rodens the possiilities even further (see [Mut02]). Chpters 4 nd 5 del with the prolem of pproximte text indexing. Approximte text indexing ses its output on the set of pproximte ourrenes. 1 An pproximte ourrene of pttern w in string t with respet to mximl distne k in ertin distne model d : Σ Σ is sustring t i,j with d(t i,j, w) k. The prmeter k 1 The use of pproximte here is not to e onfused with the use of pproximte in lot of other rnhes of omputer siene, suh s omplexity theory. In our ontext, pproximte hs welldefined mening nd is desired property of the solution. On the ontrry, for instne, n NP-hrd prolem is pproximle if we n find solution tht is within ounded rnge of the optimum vlue. Thus, the term pproximte hs negtive touh euse it is otherwise prtilly impossile to ompute the rel, optiml solution (unless P=NP).

36 RICE S INTEGRALS n either e prt of the input or prt of the prolem instne. The following sheme llows lssifition of the text indexing prolems in this work. Text Corpus This is either string t Σ ( Σ ) or olletion of strings S Σ ( P(Σ ) ). String Distne The distne is one of the stndrd distnes: edit distne ( edit ), Hmming distne ( hmm ), equlity ( ext, for ext indexing), or nother distne (e.g., omprison-sed distne). Distne Bound The ound n e given with the query ( in ), fixed numer ( k ), proportion with respet to the pttern size ( α ), or nother funtion of the input prmeters. Output Type: We n report ourrenes (doument numer, strt nd end point; o ), positions (doument numer nd strt point orresponding to suffix of the doument strting t the given position; pos ), or douments (doument numer only; do ). Identifiers of higher hierrhil strutures (suh s setions, hpters, volumes) re lso possile. Applition Rule: We define the suffix ( suff ), the prefix ( pref ), the sustring ( sustr ), nd the entire string rule ( ll ). A possile output ndidte is reported if the distne etween suffix (prefix, sustrings, the entire string) of the ndidte nd the pttern is elow the distne ound. These five tegories llow us to lssify eh text indexing prolem. In the nottion of our lssifition sheme, we denote the doument listing prolem y the quintuple P(Σ ) ext 0 do sustr. The stndrd edit distne text indexing prolem tht seeks ll positions where n pproximte ourrene of the pttern with no more thn k errors strts is desried y Σ edit k pos suff. In Chpter 4, we tret mong others the Hmming distne ditionry indexing prolem denoted y P(Σ ) hmm f(n) do ll where we seek ll strings in S mthing pttern with Hmming distne t most f(n) with n = S. 2.5 Rie s Integrls In Chpter 4, we nlyze the expeted numer of nodes visited y reursive serh proess. The losed formul tht we derive for the expeted running time is of the form n ( ) n ( 1) k f n,k. (2.23) k k=m The sum turns out to e of polynomil growth, so we witness n exponentil nelltion effet. The terms of the sum re exponentilly lrge nd lternte in sign. Sums of this type re intimtely onneted to tries nd other digitl serh trees. Similr prolems pper very often in their nlysis (see, e.g., [Szp00]). Suh sums hve lso een onsidered in generl setting y Szpnkowski [Szp88] nd Kirshenhofer [Kir96]. It is ommon tehnique to use Rie s integrls. The si ide is ptured in the following theorem.

37 CHAPTER 2. PRELIMINARIES 27 Theorem 2.2 (Rie s integrls). Let f(z) e n nlyti ontinution of f(k) = f k tht ontins the hlf line [m, ). Then n ( ( 1) k n )f k = ( 1)n n! f(z) dz, (2.24) k 2πı z(z 1) (z n) k=m C where C is positively oriented urve tht enirles [m, n] nd does not inlude ny of the integers 0, 1,...,m 1 or other singulrities of f(z). The formul lredy ppered in [Nör24] (Chpter 8, 1), see lso [Knu98, FS95, Szp00]. The proof is simple pplition of the Cuhy Residue Theorem. We riefly go over some definitions to repitulte the si ide. All of this is widely known (see, e.g., [Rud87]). A funtion is sid to e meromorphi in n open set Ω if it is nlyti in Ω \ A nd hs pole t every point in A, where A Ω is set with no limit points in Ω. In essene, meromorphi funtion hs only ountly mny poles nd is otherwise nlyti. For eh point z 0 A we n sutrt the pole from the meromorphi funtion f nd get funtion f(z) m k=1 whih is nlyti in the viinity of z 0. The term m k=1 k (z z 0 ) k, (2.25) k (z z 0 ) k is lled the priniple prt of f nd m is the order of the pole. The omplex vlue k is lled the residue of f t z 1 denoted y res[f(z), z = z 0 ] = 1. (2.26) Let C e simple losed urve in Ω enirling the poles 1,..., n A. The Cuhy Residue Theorem then sttes tht 1 n f(z)dz = res[f(z), z = k ], (2.27) 2πı C k=1 where C is trversed in mthemtilly positive sense (i.e., ounterlokwise). A rillint introdution into the tehnique of Rie s integrls is given in [FS95]. The kernel in (2.24) n lso e expressed in terms of Euler s Gmm nd Bet funtions (see, e.g., [Tem96]) y n! Γ(n + 1)Γ(z n) = = ( 1) n+1 Γ(n + 1)Γ( z) z(z 1) (z n) Γ(z + 1) Γ(n z + 1) = B(n + 1, z n) = ( 1) n+1 B(n + 1, z) (2.28) If the funtion f(z) in (2.24) hs moderte (polynomil) growth, the ontourintegrl n e repled y simple integrl long line R(z) = (usully = m ) n ( ) ( 1) k n f k = 1 k 2πı k=m +ı ı f( z)b(n + 1, z)dz, (2.29)

38 RICE S INTEGRALS euse, for fixed n, the kernel B(n + 1, z n) grows symptotilly s O(z n ) for z. Detils re provided in Setion To proeed further we either ound the integrl or sweep over some residues, very similr to the usge of the Mellin trnsform. The following symptoti expnsion (in the sense of Poinrè) of the rtio of Gmm funtions is very helpful nd lredy suggested in [Szp88]. Γ(z + α) Γ(z + β) k Γ(α β + 1) k!γ(α β + 1 k) B(α β+1) k (α)z k, (2.30) for ll onstnts α, β with β α N s z with rg(z) < π [TE51]. The B n () (x) re the generlized Bernoulli polynomils (see [Nör24], Chpter 6, 5, or [Tem96]). They re defined y the generting funtion ( ) t e xt e t = 1 k=0 B () k (x) tk k!, t < 2π. (2.31) They re multivrite polynomils in nd x of degree n. The first polynomils re B () 0 (x) = 1, B() 1 (x) = 2 + x, nd B() 2 (x) = x 2 (1+12x) 12. For the pproximtion of Γ(z) Γ(n+1) Γ(n+1+z) we hve to e reful euse z is not onstnt. Indeed, n hs to e regrded s onstnt when integrting over z rnging from ı to + ı. Fortuntely, the expnsion turns out to e uniform to ertin degree [Fie70]. In this lesser known form, the uniform symptoti expnsion is Γ(z + α) m Γ(z + β) = 1 Γ(1 + α β) zα β k! Γ(1 + α β k) B(1+α β) k (α)z k k=0 ( + O z α β m (1 + α β m ) (1 + α + α β ) m), (2.32) whih is uniformly vlid for rg(z + α) < π, (1 + α β )(1 + α + α β ) = o(z), nd β α N s z. Besides the lredy defined generlized Bernoulli polynomils, we lso enounter generlized Bernoulli numers defined y ( ) t e t = 1 k=0 B () t k k k!, t < 2π, (2.33) Bernoulli numers defined y t e t 1 = t k B k k! k=0, t < 2π, (2.34) nd Eulerin polynomils defined y 1 x 1 xe t(1 x) = k=0 A k (x) tk k! = k=0 t k k! k A k,l x l, l=0 t < lnx x 1. (2.35)

39 CHAPTER 2. PRELIMINARIES 29 See [Tem96] for the Bernoulli numers B () k nd B k. See [Com74] for the Eulerin polynomils A k (x), nd [GKP94] for the Eulerin numers A k,l. Rie s integrls re one of mny methods tht trnsfer disrete prolem into the relm of omplex nlysis. We do not tret (nor use) other methods, ut we mention here tht nother very powerful method is the Mellin trnsform. Indeed, there is n intimte onnetion (the Poisson-Mellin-Newton Cyle). We refer to [FR85, FGK + 94, FGD95, FS96] for detils. For our purposes, Rie s integrls re well suited. In prtiulr, the use of the Mellin trnsform requires n dditionl depoissoniztion step to regin the oeffiients [Szp00]. In Chpter 4, we trnsform our sum to line integrl, whih is very similr to the use of the Mellin trnsform.

40

41 Chpter 3 Liner Bidiretionl On-Line Constrution of Affix Trees In this hpter, we present liner-time lgorithm to onstrut ffix trees. An ffix tree for string t is omintion of the suffix tree for t nd the suffix tree for the reversed string t R. Besides exhiiting this inherently eutiful dulity struture, ffix trees llow serhing for pttern lternting the diretion of reding, i.e., y strting in the middle nd extending to oth sides. In ontrst, the suffix tree only llows to serh pttern reding from left to right. The lternting serhes llow to serh for ptterns of the kind uvu R more effiiently using n ffix tree. An pplition is, e.g., the serh for ertin RNA-relted ptterns [MP03], e.g., hirpin loops. A hirpin loop is the result of two onding omplementry piees of n RNA strnd. These n e found y serhing for ptterns of the form uvw, where w is the reversed omplementry to u. We egin this hpter y desriing the ffix tree dt struture nd giving the neessry definitions. We ontinue with liner-time lgorithm to uild the ffix tree on-line y reding string only in one diretion from left to right (or vie vers). Finlly, we desrie nd nlyze the most generl se of uilding the ffix tree online y ppending hrters on either side of the string. This se requires very detiled mortized nlysis, whih finishes the hpter. 3.1 Definitions nd Dt Strutures for Affix Trees The ffix tree is sed on ompleting the struture exhiited y the suffix links of the suffix tree. Before defining ffix trees, we investigte some properties of suffix links tht led nturlly to the question of whether nd how ffix trees n e uilt Bsi Properties of Suffix Links Rell our definition of suffix links from Setion A suffix link points from node p to node q if pth(q) is the longest proper suffix of pth(p). For every node p, there n e t most one node representing longest proper suffix, wheres node q n represent the shortest proper suffix of multiple other nodes. Beuse ε is suffix of every string nd is represented y the root, we n dd suffix link for every node. 31

42 DEFINITIONS AND DATA STRUCTURES FOR AFFIX TREES () () () (d) (e) Figure 3.1: Exmple trees sed on the string. Figure () shows the suffix tree for t =, Figure () shows the suffix link tree of the suffix tree for t =, Figure () shows the suffix tree for the reversed string t =, Figure (d) shows view of the ffix tree for t = foused on the suffix struture, nd Figure (e) shows the ffix tree foused on the prefix struture. Thus, the suffix links form tree struture themselves. For suffix link from p to q, let pth(p) = upth(q). We lel the suffix link with u R. This wy, if we hve two suffix links from p to q to r leled u nd v, we get pth(p) = u R pth(q), pth(q) = v R pth(r), nd pth(p) = (vu) R pth(r). Hene, we n red nd ontente suffix link lels from the root downwrd. This leds to the definition of the suffix link tree [GK97]. The suffix link tree T R of Σ + -tree T tht is ugmented with suffix links is defined s the tree tht is formed y the suffix links of T. Figure 3.1() shows the suffix tree for the string with the suffix links from toto the root s it would hve een onstruted, e.g., y Ukkonen s lgorithm. These links re leled with nd. We dded suffix links for the leves, too, from vi,,, nd to the root. These links re leled with,,,, nd. Figure 3.1() shows the resulting suffix link tree. There is dulity etween reversing the string nd tking the suffix link tree, proven in [GK97]. For ny string t, there is strong dulity for the tomi suffix tree AST(t) nd wek dulity for the (ompt) suffix tree CST(t). In prtiulr, we hve AST(t) R = AST(t R ) nd words(cst(t) R ) words(cst(t R )). For the tomi suffix tree, this n esily e seen s follows. Let t = t 1 t n nd let w e sustring of t. The string w is expliitly represented y node p in AST(t) with w = pth(p) = t i t i+1 t i+2 t j. Beuse AST(t) is tomi, ll edges hve length one nd ll sustrings of t hve n expliit lotion. Therefore, there is lso node q with pth(q) = t i+1 t i+2 t j nd suffix link leled t i links p to q. By indution, we find hin of suffix links with lels t i, t i+1, t i+2,...,t j. Thus, the string t j t j 1 t j 2 t i = w R is represented in AST(t) R. As result, ll sustrings of t R re represented in AST(t) R. Beuse AST(t R ) is defined y words(ast(t R )) = sustrings(t R ), this proves the dulity. While this ft out tomi suffix trees is interesting in its own, it only serves s n illustrtion here. The remining hpter is only onerned with ompt suffix trees. Heneforth, suffix tree

43 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 33 refers only to the ompt version gin. For suffix trees, the ove onstrution does not work euse not ll sustrings of t re represented expliitly. The suffix link tree CST(t) R represents only suset of the words of CST(t R ). In ft, it represents ll sustrings tht hve n expliit lotion in CST(t). The following lemms give n intuition of the prt of sustrings(t) tht re represented in the suffix link tree CST(t) R. These results re ommon knowledge nd vritions pper throughout the literture. We formlize the results in suitle wy. Lemm 3.1 (Expliit lotions in suffix trees). In the suffix tree T for the string t, sustring u is represented y n inner node (its lotion is expliit) if nd only if it is right-rnhing sustring. It is represented y lef if nd only if it is non-nested suffix of t. Finlly, the sustring is represented y the root if nd only if it is the empty string. There re no other nodes in T. Proof. Clerly, the empty string is lwys represented y the root. If u words(t ) is non-nested suffix of t, no v Σ + nd w Σ exist suh tht t = wuv. If u were not represented y lef, there would hve to e node p suh tht pth(p) = uv with v Σ +. This would e ontrdition to u eing non-nested. Therefore, u is represented y lef. If w words(t ) is right-rnhing in t, there exist, Σ with nd w, w sustrings(t). The string w nd w re in words(t ) y the definition of suffix trees. Hene, there re nodes p nd q suh tht w is prefix of pth(p) nd w is prefix of pth(q). From the uniqueness of string lotions in Σ + -trees nd from it follows tht p q. Therefore, there must e rnhing node r suh tht r is n nestor of p nd q, nd pth(r) is prefix of pth(p) nd pth(q). Therefore, pth(r) must e prefix of w. If pth(r) w, there would e two edges leding out of r with lels strting with the sme hrter, whih would ontrdit T eing Σ + -tree. As result, pth(r) = w nd w is represented y node. Let p e node in T. If p is not the root (representing the empty string), y the omptness of T, it must e either lef or rnhing node. If p is lef, it represents sustring u = pth(p) of t. Suppose u is not suffix, then the string uw with w Σ + is in sustrings(t) nd onsequently in words(t ). There must e node q with pth pth(q) = uw, nd hene p nnot e lef euse it must e the prent of q. Suppose u is nested suffix. By definition, we hve u psustrings(t) prefixes(t). Agin there must e w Σ + with uw sustrings(t) leding to the sme ontrdition. As result, pth(p) must represent non-nested suffix. If p is rnhing inner node, there must e two nodes q nd r whih re hildren of p. Let w = pth(p). Beuse ll outgoing edges of p hve lels with different strt hrters, we hve pth(q) = wu nd pth(r) = wv for u, v Σ. Therefore, w, w re sustrings of t nd w is right-rnhing in t. A very importnt onsequene of the previous lemm nd Oservtion 2.1 is the following. Lemm 3.2 (Chin property of suffix links). Let p e node in the suffix tree T for the string t. The node p is either the root, the lef representing the shortest non-nested suffix, or p hs n tomi suffix link.

44 DEFINITIONS AND DATA STRUCTURES FOR AFFIX TREES Proof. Let p e node tht is not the root nd not the node representing the shortest non-nested suffix of t. Let u = pth(p). If p is lef, it represents non-nested suffix. Beuse p does not represent the shortest non-nested suffix, there is shorter non-nested suffix v suh tht v = u for some Σ. Sine v is non-nested suffix, there is node q representing v, nd there is n tomi suffix link from p to q. If p is not lef (nd not the root), it must e rnhing node. Therefore, u is right-rnhing in t. Let v Σ e proper suffix of u suh tht v = u for some Σ. By Oservtion 2.1, v is lso right-rnhing nd, y Lemm 3.1, represented y node q. Therefore, there is n tomi suffix link from p to q. From the ove lemms, we know tht the suffix link tree T R of suffix tree T for the string t represents suset of ll sustrings of t R, nmely the left-rnhing sustrings nd t R itself (see Figures 3.1() nd 3.1()). The ide of n ffix tree is to ugment suffix tree with nodes suh tht it represents ll suffixes of t nd tht its reverse represents ll suffixes of t R (respetively the prefixes of t) Affix Trees Affix trees n e defined s dul suffix trees. Formlly, n ffix tree T for string t is Σ + -tree with suffix links suh tht words (T ) = sustrings (t) words ( T R) = sustrings ( t R). nd Figures 3.1(d) nd 3.1(e) show the ffix tree for. Figure 3.1(d) fouses on the suffix-tree-like struture nd Figure 3.1(e) on the prefix-tree-like struture. The differenes to the suffix tree nd to the prefix tree in eh struture re the dditionl nodes tht re oxed (ompre (), (), (d), nd (e) in Figure 3.1). Note tht sustring w of t might hve two different lotions in the ffix tree T for t: the lotion of w in the suffix tree prt nd the lotion of w R in the prefix tree prt. If these lotions re impliit, then they re not represented y the sme node ut lie in two different edges. For exmple, the lotion of ppers in the suffix edge etween the nodes nd (Figure 3.1(d)) nd in the prefix edge etween the nodes nd (Figure 3.1(e)). The lotion of ppers in the suffix edge etween the nodesnd (Figure 3.1(d)) nd in the prefix edge etween the root nd (Figure 3.1(e)). Hving two different impliit lotions omplites the proess of mking lotion expliit (i.e., inserting node for the lotion). Both impliit lotions must e found for tht. To insert node (respetively in the prefix view) into the ffix tree shown in Figures 3.1(d) nd 3.1(e), the two edges (,) nd (root,) hve to e split. The ffix tree for string t represents two trees in one. It represents the suffix tree for t nd the suffix tree for t R. The ltter is lled the prefix tree for the string t. From Lemm 3.1, we lerned tht the ffix tree ontins ll nodes tht would e prt of the suffix tree nd ll nodes tht would e prt of the prefix tree. As n pplition of this lemm, one n lssify eh node whether it elongs to the suffix prt, the prefix prt, or oth prts. Thus, we define the following node type. A node is suffix node if it is suffix lef (it hs no outgoing suffix edges), if it is the root, or if it is

45 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 35 rnhing node in the suffix tree (it hs t lest two outgoing suffix edges). A node is prefix node if it is prefix lef (it hs no outgoing prefix edges), if it is the root, or if it is rnhing node in the prefix tree (it hs t lest two outgoing prefix edges). The ttriute needs not e stored expliitly ut n e reonstruted on-line. Edges re either suffix or prefix edges. The suffix links re the (reversed) prefix edges nd vie vers. Therefore, ll edges need to e trversle in oth diretions. In the following, we distinguish etween the suffix struture nd the prefix struture of the ffix tree. Throughout the remining hpter, let ST(t) denote the suffix tree nd let AT(t) denote the ffix tree uilt from the string t, whih we ll the underlying string. The prefix tree n e denoted y ST(t R ) Additionl Dt for the Constrution of Affix Trees An essentil property of ffix trees is its inherent dulity. In the finl setions of this hpter, we show how to uild the ffix trees y expnding string in oth diretions. To support suh ehvior, we n implement string s pir of dynmi tles (see Setion 2.2.2). The first tle ontins the positive indies (strting t zero) nd the seond tle the negtive indies (strting t 1). When swithing the view (from suffix to prefix), simple trnsltion f : x 1 x lso swithes the indies nd llows to red the string in ompletely reversed mnner. Thus, on the implementtion level we numer strings differently, strting t zero. To distinguish this se, we denote the hrters of string t y t[i], where i rnges from zero to n 1. For the lgorithmi prt, we need to e le to keep trk of lotions in the suffix or ffix tree nd e le to reh these quikly. In Setion 2.1.2, we introdued virtul nodes s mens for referening impliit lotions. A referene pir is representtion of virtul node of suffix tree introdued in [Ukk95] for tht reson. For suffix tree T for the string t, referene pir is pir (p, (o, l)) where p is node (lled the se of the referene pir), nd (o, l) represents sustring u = t o,o+l 1. It orresponds to the virtul node (p, u). For ese of nottion, we lso denote the referene pir (p, (o, l)) y (p, u) knowing tht u is sustring of t nd n e represented y (o, l). To tully reh the lotion, we strt t the se nd move long edges whose lels represent the sme string s u. As mentioned ove, there my e two lotions of sustring of t in the ffix tree, one in the suffix struture nd one in the prefix struture. We therefore need to distinguish etween suffix referene pir, where the represented lotion is rehed y moving long suffix edges, nd prefix referene pir, where the represented lotion is rehed y moving long prefix edges. Coneptully, one n lso think of using omined referene pir for string in the ffix tree mde up of prefix nd suffix referene pir. For the onstrution of suffix trees, referene pir (p, u) is lled nonil if there exists no node q nd suffix v suffixes(u) suh tht pth(p) is proper prefix of pth(q) nd pth(p)u = pth(q)v. For the onstrution of ffix trees, we use suffix nd prefix referene pirs. For suffix referene pir (p, u), we require q to e suffix node, nd, for prefix referene pir (p, u), we require p to e prefix node. A suffix referene pir is nonil if there is no suffix node q nd string v suh tht pth(p) is proper prefix of pth(q) nd pth(p)u = pth(q)v. In other words, nonil referene pir is referene pir where the se node hs mximl depth.

46 DEFINITIONS AND DATA STRUCTURES FOR AFFIX TREES For exmple, in Figure 3.1 with the underlying string t = t[0] t[4] = (with t R = t[ 1] t[ 5] = ) the nonil prefix referene pir for is (root, ( 3, 2)), the nonil suffix referene pir for = () 1 is (, (3, 0)). A non-nonil suffix referene pir for is (root, (1, 2)). Similr to suffix trees, edges re implemented with indies into the underlying string t. For the effiient hndling of leves, we lso use open edges s introdued in [Ukk95]. The edges leding to leves re leled with the strting index nd n infinite end index (e.g., (i, )) tht represents the end of the underlying string. Suh n edge lwys extends to the end of the string. This n esily e hieved using single it (or y impliitly ssuming the end index for leves). For the on-line onstrution of ffix trees, we need to keep trk of some importnt lotions in the tree. Let α(t) e the longest nested suffix of the string t nd ˆα(t) the longest nested prefix of t. Oviously, we hve α(t) = (ˆα(t R )) R. The longest nested suffix is lled the tive suffix nd the longest nested prefix is lled the tive prefix. By Lemm 3.1, ll longer suffixes (respetively prefixes) re represented y leves. As with suffix trees, ll updting is done round the positions of the tive suffix nd the tive prefix. We therefore lwys keep the following dt for the onstrution of ompt ffix trees: A nonil suffix referene pir sp suffix (t) for the tive suffix point, the suffix lotion of the tive suffix. A nonil prefix referene pir sp prefix (t) for the tive suffix point, the prefix lotion of the tive suffix, whih is the tive prefix of the reverse string. A nonil prefix referene pir pp prefix (t) for the tive prefix point, the prefix lotion of the tive prefix. The tive prefix is the tive suffix of the reverse string, so it is the ounterprt to sp suffix (t) in the prefix view. A nonil suffix referene pir pp suffix (t) for the tive prefix point, the suffix lotion of the tive prefix. The tive suffix lef sl suffix (t), whih represents the shortest non-nested suffix of t. The tive prefix lef pl prefix (t), whih represents the shortest non-nested prefix of t. For the four referene pirs nd the two leves we lso keep the depth (the length of the represented string). This is needed in order to lulte the orret indies of edge lels in the implementtion of the lgorithm. Note tht sp prefix (t) should e red s the lotion of the tive suffix in the prefix view. In this wy, we fix view nd the definition of tive suffix nd tive prefix refers to the string t nd not to its reverse. When reversing the view, suffix nd prefix lso hve to e reversed, e.g., pp suffix (t) eomes sp prefix (t R ). As lredy mentioned efore, we n oneptully see sp suffix (t) nd sp prefix (t) s one omined referene pir for the tive suffix nd pp prefix (t) nd pp suffix (t) s one omined referene pir for the tive prefix.

47 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 37 A non-nested suffix, represented y suffix lef in the suffix tree, is not represented y node in the prefix tree euse it ours only one s the prefix of t R. Hene, ll suffix leves must e hidden in prefix tree-like view of the ffix tree. By Lemm 3.2, the sl suffix is the only suffix node other thn the root tht my hve non-tomi suffix link. Therefore, ll suffix leves form hin with tomi suffix links euse, if lef is not the sl suffix, there is lef representing suffix tht is shorter y one hrter. This hin of suffix leves must e hidden in the prefix view. Fortuntely, the hin does not extend over the leves to other suffix nodes. The next lemm proves tht the prefix prent of sl suffix is lwys prefix node, too, nd is hene represented y node in the prefix tree. Theorem 3.3 (Prent of the tive suffix lef). The prefix (or suffix) prent of the tive suffix (respetively prefix) lef is lwys prefix (respetively suffix) node. Proof. The sl suffix (t) is the lef representing the shortest non-nested suffix. Let p e the prefix prent of sl suffix (t). Suppose p is not prefix node, then p must e suffix node nd w = pth(p) is right-rnhing in t. By the definition of suffix links, w is suffix of pth(sl suffix (t)) nd hene suffix of t. For w to e right-rnhing there must e t lest two ourrenes of w in t, one followed y Σ, the other y Σ (with ). Tking look t the preeding hrters, we therefore find tht xw suffixes (t) nd yw, zw sustrings (t) (3.1) with x, y, z Σ {ǫ} nd not oth y nd z re empty strings (otherwise = ). Note tht there n e no node q with u = pth(q) suh tht u is proper suffix of pth(sl suffix (t)) nd w is proper suffix of u. We sy tht suh node q lies etween p nd sl suffix (t). Beuse sl suffix (t) is represented y lef nd p is its diret suffix prent, this node would e etween these two, ontrditing the originl ssumptions. If y nd z re different hrters, then w is left-rnhing nd y Lemm 3.1 prefix node. Suppose the ontrry is true, then we n distinguish two ses: Either y nd z re equl, or one of the ourrenes of w is s prefix of t, i.e., y = ε or z = ε. Let us first ssume tht oth re equl nd let y = z = Σ. Either x is equl to or it is different. If we hd x =, then w would e right-rnhing nd represented y suffix node q lying etween p nd sl suffix (t) ontrdition to our ssumptions. If we hve x then w would e left-rnhing nd p is prefix node. Thus, we find ourselves in the seond se: One of y, z must e the empty string. Hene, w ours s prefix of t. Without loss of generlity, ssume tht z = ε. We then hve the sitution xw suffixes (t), yw sustrings (t), nd w prefixes (t). (3.2) If x y, then w is left-rnhing nd p is prefix node. Therefore, we must hve x = y = for Σ +. Tking look t the next position to the left of we find tht xw suffixes (t), yw sustrings (t), nd w prefixes (t) (3.3)

48 DEFINITIONS AND DATA STRUCTURES FOR AFFIX TREES for yet different x, y Σ {ǫ}. Agin we find tht, if x y, we hve left-rnhing string u tht is represented y prefix node lying etween p nd sl suffix (t). Thus, either x = y or y is the empty string (sine yw is loser to the left string order thn xw). Iterting this rgument leds to uw suffixes (t) nd uw, w prefixes(t), (3.4) for some string u Σ + (t some point the left order is rehed, y is the empty string, nd x is the hrter ). Therefore, w is prefix of uw. (Clerly, we hve u > 0 euse otherwise =.) It follows tht w is sustring of t. Now, let u = vd for some hrter d Σ, thus vdw is prefix of t, nd the sitution is w, dw sustrings (t) nd dw suffixes (t). (3.5) If = d, then dw is right-rnhing string represented y node q tht lies etween p nd sl suffix (t) ontrdition. Therefore, d, w is left-rnhing, nd p is prefix node. From Lemm 3.2 nd the ove theorem, we onlude tht ll nodes tht our in the prefix tree for t ut not in the suffix tree for t (so lled prefix only nodes) pper in hins with tomi edges limited y suffix nodes. As onsequene, we n luster the prefix nodes in the ffix tree tht would not e prt of the suffix tree. Let p e prefix node tht is not suffix node. The node p hs extly one suffix hild nd one suffix prent. If the suffix prent or the suffix hild of p is lso prefix only node, the edge length must e one. For the suffix struture, the prefix nodes n thus e lustered to pths, where the nodes re onneted y edges of length one. It suffies to know the numer of nodes in the pth to know the string length of the pth. Wht is n edge in the suffix tree eomes n edge leding to non-suffix node, pth of prefix nodes, nd finl edge leding gin to suffix node in the ffix tree. See Figures 3.1(d) nd 3.1(e) where the nodes re lredy in oxes orresponding to their pths. The pth struture hs the following properties: All non-suffix (or non-prefix) nodes in the suffix (or prefix) struture re lustered in pths. Aessing the i-th (in prtiulr the first nd the lst) node tkes onstnt time. The first nd the lst node of eh pth hold referene to the pth nd know the pth length. A pth n e split into two in onstnt time similr to n edge. (This is neessry, e.g., when prefix only node in pth eomes suffix node.) Nodes n e dynmilly ppended nd prepended to pth. With the first three properties, the ffix tree n e trversed in the sme symptoti time s the orresponding suffix tree nd, with the ltter two properties, the updting lso works in the sme symptoti time. The ffix tree n e trversed ignoring nodes of the reverse kind: If non-suffix node is hit while trversing the suffix prt, it must hve pointer to pth. The lst node of tht pth is gin prefix node with single suffix node s suffix hild. This is the key to hieving liner time ound in the onstrution of ffix trees.

49 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES Implementtion Issues We riefly tke loser look t the detils of the dt strutures. The edges need to e trversle in oth diretions. Therefore, we store strting nd ending point nd the indies (inluding flg to mrk open edges) in the underlying string t eh edge. To trverse the edges in oth diretions, we need to store the outgoing suffix nd prefix edges indexed y their strting hrter nd the suffix nd prefix prent edges t the nodes. The first nd the lst node of pth need to hve referene to the pth. A node nnot e prt of two pths, so eh node rries single pth pointer. The type of pth n e derived from the node type, whih in turn n e derived from the numer of hildren of eh type. The djeny list n e stored in vrious wys (trees, lists, rrys, hsh tles). The optiml storge depends on the pplition. For the nlysis, the lphet is ssumed to e fixed nd the only thing needed is proedure getedge(node, type, strting-hrter) tht returns the edge of the given type nd with the given hrter. The pths re stored s one or two dynmi rrys tht n grow with onstnt mortized ost per opertion (see Setion 2.2.2). The size of the pth is liner in the numer of nodes it ontins. Pths never need to e merged: Exept for the tive prefix lef, no nodes re ever deleted. Between the tive prefix lef nd the next prefix node loser to the root, there is lwys one suffix node (see Theorem 3.3), so the two lusters n never merge. An inner node tht is of suffix type represents right-rnhing sustring w of t. Regrdless of ppending or prepending to t, the sustring w lwys stys rnhing. To e le to split pths, we use pth pointers. A pth pointer hs referene to the pth, n index to the first node pointer, nd length. When pth is split, two pth pointers shre the sme dynmi rry struture. For exmple, the nodes,,, re in ommon pth in the ffix tree for t = (Figure 3.2()). After the ffix tree is extended to t = (Figure 3.2()), the nodes nd re suffix nodes nd no longer pth memers. The nodes nd now shre the sme dynmi rry, ut the rry does not need to grow euse the nodes,, nd will never e deleted, nd no other nodes n e inserted etween them. The size of n ffix tree depends on the wy the hildren of node re stored. Assuming onstnt size lphet Σ, it mkes no differene to the symptoti running time whether we use liner lists, hsh tles, or rrys. For smll lphet Σ (e.g., Σ = 5) n rry is proly the smllest nd fstest solution. In tht se, the size of n ffix tree for string of size n n e estimted s follows: Add the size of the two prent edges to the node to tht they led, so for eh node we ount the two prent edges (eh with two indies, one open edge flg, one type flg, nd two pointers), the two hildren rrys (eh Σ pointers), the two prent pointers (one pointer eh), pth pointer (two indies, one pointer), nd pth rry struture (two pointers to dynmi rrys, five indies for size, length nd referene ount, nd one pointer to node). The numer of nodes is t most 2n. The size of the underlying string (s two wy extensile dynmi rry struture) is t most 2n hrters. The size of the referene pirs does not ontriute symptotilly. An upper ound for the totl size of the ffix tree T is then size(t ) n(22 indies + 2 hrters + 8 flgs + ( Σ ) pointers). Assuming four yte mhine rhiteture (storing flgs in it of

50 CONSTRUCTION OF COMPACT SUFFIX TREES () () Figure 3.2: The ffix trees for nd for. some index or pointer, hrters in one yte, nd pointers nd indies in four ytes) nd four element lphet, the size is then t most 250n ytes. 3.2 Constrution of Compt Suffix Trees In order to prepre the onstrution of ffix trees, we riefly desrie how to onstrut suffix trees y expnding the underlying string one to the right nd one to the left. These prolems re solved y two lredy existing lgorithms, nmely Ukkonen s lgorithm [Ukk95] nd Weiner s lgorithm [Wei73]. We ssume some dditionl informtion in the ltter lgorithm redily ville when onstruting ffix trees; therefore, the desried lgorithm is not identil to Weiner s lgorithm (ut simpler). The results of this setion re due to Ukkonen nd Weiner. We replite them in detil here to prepre the presenttion of the ffix tree onstrution lgorithm On-Line Constrution of Suffix Trees Ukkonen s lgorithm onstruts suffix tree on-line from left to right, uilding ST(t) from ST(t). The lgorithm n intuitively e derived from n lgorithm tht onstruts tomi suffix trees. It essentilly works y lengthening ll suffixes represented in ST(t) y the hrter nd inserting the empty suffix. Oviously, the resulting tree is ST(t). To nlyze the lgorithm the suffixes re divided into three different types: Type 1. Non-nested suffixes of t. Type 2. Nested suffixes s of t, where s is sustring of t.

51 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 41 Type 3. Nested suffixes s of t, where s is not sustring of t. We ll s relevnt suffix. To extend t y the hrter, we del with eh of these sets of suffixes: Cse 1. Cse 2. Cse 3. For ll suffixes s of t of Type 1, s is non-nested in t (otherwise s would our elsewhere in t nd s were nested). The suffix s ws represented y lef (Lemm 3.1) in ST(t) nd s will e represented y lef in ST(t). With the use of open edges for leves, ll leves will extend utomtilly to the end of the string. Nothing needs to e done for these suffixes. For ll suffixes s of t of Type 2, s is sustring of t. Hene, s is nested suffix of t. Nothing needs to e done for these suffixes. For ll suffixes s of t of Type 3, s is non-nested suffix nd must thus e represented y lef in ST(t). The suffix s is not represented y lef in ST(t). If s is relevnt suffix (Type 3), then s is nested suffix of t. Therefore, s must e proper suffix of t nd s ours s sustring elsewhere in t. The extended suffix s is non-nested in t, so s nnot our in t. Therefore, there is sustring s in t with. Thus, s, s sustrings(t) nd s is right-rnhing in t. To updte ST(t) with the relevnt suffix s, node s must e inserted (unless s ws right-rnhing in t nd s lredy exists) nd new lef s must e dded t the (possily new) node s. Lemm 3.4. Every relevnt suffix s is suffix of α(t), nd α(t) is suffix of every relevnt suffix. Proof. The tive suffix α(t) is the longest nested suffix of t. If s is relevnt suffix, s is nested suffix of t. Hene, s is suffix of α(t) nd s suffix of α(t). The new tive suffix α(t) nnot e longer thn α(t). Suppose otherwise tht it would e longer, then α(t) = vα(t) for some v Σ +. Beuse the new tive suffix α(t) is nested in the new string t, we hve vα(t) psustrings(t) prefixes(t). It follows tht vα(t) psustrings(t) prefixes(t) nd vα(t) suffixes(t). The suffix vα(t) would e nested nd longer thn α(t). This ontrdits the definition of the tive suffix. Both, α(t) nd α(t) re suffixes of t, therefore α(t) is suffix of α(t). Beuse relevnt suffix s is non-nested in t, it is longer thn α(t). Hene, α(t) is suffix of s. Corollry 3.5. The new tive suffix α(t) is suffix of α(t). Let α(t) = t in,n nd let α(t) = t in+1,n. Let us denote the relevnt suffixes y s k = t in+k,n with k {0,...,i n+1 i n 1} (if i n = i n+1, there is no relevnt suffix). The (nonil) referene pir sp suffix (t) = (p, (o, l)) represents the tive suffix α(t) suh tht α(t) = pth(p)t o,o+l 1 (here, o + l 1 = n). Therefore, p represents the prefix t in,o 1 of α(t). For ll relevnt suffixes, we n reh the lotion of s k with sp suffix (t) s follows. The first relevnt suffix is immeditely ville euse we

52 CONSTRUCTION OF COMPACT SUFFIX TREES Proedure 3.3 nonize(referene pir (p, (o, l))) Prmeters: (p, (o, l)) is referene pir tht is to e nonized. 1: if l 0 then {Otherwise the referene pir is lredy nonil.} 2: e := getedge(p, SUFFIX, t o ); 3: k := getlength(e); 4: while k l do 5: o := o + k; 6: l := l k; 7: p := gettrgetnode(e); 8: e := getedge(p, SUFFIX, t o ); 9: k := getlength(e); hve s 0 = pth(p)t o,o+l 1 (with n = o + l 1). We reh the other lotions y modifying the referene pir. Let (p k, (o k, l k )) e the nonil referene pir 1 for the lotion of the relevnt suffix s k. To reh the lotion of s k+1 from the referene pir (p k, (o k, l k )), we hve two ses. If p k is not the root, it hs suffix link to node q suh tht pth(p k ) = t in+kpth(q) nd s k+1 = pth(q)t ok,o k +l k 1 = t in+k+1,n. The new se is nonized (see Proedure 3.3), whih hnges the se to p k+1 nd the offset nd length to o k+1 nd l k+1. Otherwise, if p k is the root, then s k = t ok,o k +l k 1, nd we n get to the lotion of s k+1 y inresing the offset o nd deresing the length l euse we hve s k+1 = t ok +1,(o k +1)+(l k 1) 1. Agin we invoke the proedure nonize(). The relevnt suffixes re thus rehed y strting t sp suffix (t) nd itertively moving over suffix links nd nonizing. It n esily e determined whether we hve rehed non-relevnt suffix (Type 2) y heking whether the urrent referene pir sp suffix n e extended y. If this is the se, the urrent represented suffix is nested, nd the urrent referene pir is sp suffix (t). The suffix links for the new nodes n e set during the proess or fterwrds if we store eh node p k inserted (or found) during the hndling of s k on stk. Node p k from the k-th itertion of the while loop is linked to the node p k+1 of the following itertion. The lst node p in+1 i n 1 gets suffix link to n lredy existing node (lled the end point), whih represents string one hrter shorter thn α(t) nd is either the se of sl suffix (t) or one edge elow the se. By Lemm 3.2, the end point lwys exists. Proedure 3.3 tkes referene pir (p, (o, l)) nd hnges the se, the offset, nd the length so tht the resulting referene pir represents the sme string nd is nonil. The proess is lled nonizing referene pir. This is done y following edges tht represent suffixes of the string prt of the referene pir down the tree until the node is losest to the represented lotion. Lemm 3.6. The mount of work performed during ll lls to nonize() with the referene pir sp suffix (t) is O( t ). 1 The referene pirs (p k, (o k, l k )) do not exist on their own ut re stges rehed y the referene pir sp suffix (t) efore it eomes sp suffix (t).

53 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 43 Proof. Let n = t. The mount of work performed during nonizing the referene pir sp suffix (t) n e ounded s follows. Eh time hrter is ppended to t, oin is put on the hrter. The referene pir sp suffix (t) onsists of se p, strt index o into t, nd length l. The string prt of the referene pir is t o,n leding to the impliit lotion. Eh time the se is moved long n edge e down the tree, the length l is deresed nd the index o is inresed y the length k of the edge. We tke the oin from hrter t o to py for the step. Beuse o is never deresed, there is lwys oin on t o. As result, nonizing tkes mortized onstnt time per ll. Theorem 3.7. Ukkonen s lgorithm onstruts ST(t) on-line in time O( t ). Proof. Let n = t. The only prt with non-onstnt running time per itertion is the insertion of the relevnt suffixes. The numer of nodes of ST(t) is t most 2n, the numer of leves is t most n (there re t most n leves, so there nnot e more thn n rnhing nodes). Eh time (ut one per itertion) tht suffix link is tken, t lest one lef is inserted. The numer of links tken is therefore less thn n. The mount of work performed during nonizing the referene pir sp suffix (t) is O(n) y Lemm 3.6. See Figure 3.4 whih shows the intermedite steps of the on-line onstrution of ST() from ST(). The referene pir strts out with the se nd the remining string, e.g., s (, (6, 2)) (the indies re implementtion indies strting t zero in this exmple). After the insertion of the new lef etween () nd (), the suffix link fromto the root is tken. The edge strting withhs length six, nd thus no nonizing tkes ple. After the insertion of the seond lef etween () nd (d), the offset is inresed, the length deresed (insted of tking suffix link). The resulting referene pir is (root, (5, 1)) with remining string. The referene pir is then nonized to (, (6, 0)), where it n e extended to (, (6, 1)) with the remining string Anti-On-Line Suffix Tree Constrution with Additionl Informtion Anti-on-line onstrution uilds ST(t) from ST(t). The tree ST(t) lredy represents ll suffixes of ST(t) exept t. Only the suffix t needs to e inserted. The lef for t will rnh t the lotion representing ˆα(t) in ST(t). We ssume tht we hve the dditionl informtion of knowing the length of ˆα(t) ville in eh itertion of the following lgorithm. With this, we do not need the inditor vetor used y Weiner [Wei73] to find the length of the new tive prefix. There re three possiilities. The node lotion my e represented y n inner node, the lotion is impliit in some edge, or the lotion is represented y lef. In the lst se, the new lef does not dd new rnh to the tree ut extends n old lef tht now represents nested suffix. The node representing the old lef needs to e deleted fter tthing the new lef to it (it represents nested suffix now, see Lemm 3.1). Lemm 3.8. The new tive prefix ˆα(t) is prefix of ˆα(t).

54 CONSTRUCTION OF COMPACT SUFFIX TREES tive suffix suffix se () tive suffix suffix se () tive suffix suffix se () tive suffix suffix se (d) tive suffix suffix se (e) Figure 3.4: Steps of onstruting ST() from ST(). The tive suffix (s represented y sp suffix ) is shown ove eh tree. The sustring mrked s suffix se is the prt of the tive suffix tht is represented y the se of the (nonil) referene pir. The offset index of the referene pir points to the hrter one right of the sustring represented y the se. Figure () is the strting point. In Figure () the open edges hve grown with the text. Figure () shows the tree fter the first relevnt suffix hs een inserted nd the se of the referene pir hs een moved over suffix link. Figure (d) shows the tree fter the seond relevnt suffix hs een inserted, the referene pir hs een shortened t the front, nonized to, nd ws enlrged y the new hrter. Figure (e) shows the finl tree with the updted suffix links.

55 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 45 Proedure 3.5 denonize(referene pir (p, (o, l)), hrter ) Prmeters: (p, (o, l)) is referene pir tht is to e denonized, is the hrter for tht prefix edge is to e found. 1: e := getedge(p, PREFIX, ); {If there is no edge,then e is nil.} 2: while e = nil or getlength(e) 1 do 3: f := getprent(p, SUFFIX); 4: if f = nil then {If p hs no prent, it is the root. The index o should then point to the eginning of the old string. Mke it point to the eginning of the new string.} 5: o := o 1; 6: l := l + 1; 7: return; 8: else 9: k := getlength(f); 10: o := o k; 11: l := l + k; 12: p := getsourenode(f); 13: e := getedge(p, PREFIX, ); 14: p := gettrgetnode(e); Proof. Oviously, oth strings re prefixes of t. Either ˆα(t) is prefix of ˆα(t), or ˆα(t) is proper prefix of ˆα(t). If ˆα(t) were proper prefix of ˆα(t), then we would hve ˆα(t) = ˆα(t)v for some v Σ +. Hene, ˆα(t)v prefixes(t) nd ˆα(t)v psustrings(t) suffixes(t), whih ontrdits the definition of the tive prefix euse ˆα(t)v would e longer ndidte. Let t = t n t 1 nd t = t n 1 t n t 1 (i.e., = t n 1 ). For the purpose of onstruting the suffix tree ST(t) from the suffix tree ST(t), we keep nonil referene pir pp suffix (t) for the tive prefix ˆα(t). Let r n+1 = (p n+1, v n+1 ) e nonil referene pir for ˆα(t) = t 1 n,in+1 in ST(t) (the lotion of the new tive prefix in the urrent suffix tree). The se p n+1 of r n+1 hs suffix link to node p n+1 (Lemm 3.2). Hene, r = (p n+1, v n+1) is referene pir for t n,in+1, whih is prefix of ˆα(t) (Lemm 3.8). Let r n = (p n, v n ) e the referene pir pp suffix (t). There exists pth π = (p n+1,...,p n) from p n+1 to p n in ST(t) (π might hve length 0). We lim tht for ny node s on the pth π, if s p n+1 nd s p n, then there is no node q suh tht there is suffix link leled from q to s. Assume otherwise tht suh node q existed. Beuse ˆα(t) is prefix of t, pth(s) is lso prefix of t nd pth(q) = pth(s) is prefix of t. On the other hnd, q eing lef proves the existene of pth(q) suffixes(t) nd q eing n inner node proves the existene of pth(q) prefixes(t) psustrings(t). Hene, we hve pth(q) psustrings(t) suffixes(t). Therefore, pth(q) is nested-prefix of t nd thus prefix of ˆα(t). Beuse depth(q) = depth(s) + 1 > depth(p n+1 ), this is ontrdition to r n+1 = (p n+1, v n+1 ) eing nonil referene pir for ˆα(t). As result, we hve unique method to find pp suffix (t) from pp suffix (t) y

56 CONSTRUCTION OF COMPACT SUFFIX TREES Proedure 3.6 updte-new-suffix(referene pir (p, (o, l)), hrter, int k) Prmeters: k is the new length of the referene pir fter it hs een denonized. 1: denonize((p, (o, l)), ); {We rememer node r t depth k 1 during nonizing.} 2: l := k depth(q) 3: if l = 0 then 4: q := p; 5: else 6: insert node q t the lotion of (p, (o, l)) y splitting the edge with strting hrter t o 7: insert new lef r linked to q y n edge with string lel (o, n + 1) 8: if q ws lef then 9: eliminte q 10: else 11: dd suffix link from q to r denonizing pp suffix (t) = (p, (i, l)). We strt t the se p nd wlk up towrds the root until prefix edge (i.e., reverse suffix link) e leled with is found. For eh edge on the wy up, we sutrt its length from i. Finlly, we reple p y the trget q of the prefix edge e (i.e., the soure of the suffix link). With the knowledge of the length of ˆα(t), we n set the length l to ˆα(t) depth(q) nd get nonil referene pir for pp suffix (t). If new node q is dded for pp suffix (t) to insert the lef representing the suffix t, we hve to set its suffix link to node r. By Lemm 3.8, we hve depth(r) ˆα(t). At the eginning of the itertion, we hve the nonil referene pir r n = (p n, v n ) for ˆα(t). Beuse r n is nonil, we know tht depth(r) depth(p n ). On the other hnd, we inserted new node q ove p n+1, thus we hve depth(r) = depth(q) 1 > depth(p n+1 ) 1 = depth(p n+1 ). Hene, we enounter r on the pth π while serhing the suffix link leled. We know eforehnd tht depth(r) = ˆα(t) 1, so r is lso esily identified. Proedure 3.6 gives the lgorithm in pseudo ode nd Proedure 3.5 the detils of denonizing. Lemm 3.9. The mount of work performed during ll lls to denonize() with the referene pir pp suffix (t) is O( t ). Proof. We pply mortized nlysis s follows: Eh time hrter is prepended to t oin is put on the hrter. As desried ove, for eh edge the se is moved upwrds, the index o of the string prt of the referene pir is deresed. We tke the oin on t o to py for the step. The offset o stys unhnged when the se is repled or when the length is djusted. If denonize() moves the se of pp suffix (t) = (p, (o, l)) to lef, l must e zero (otherwise there wouldn t e lotion for pp suffix (t)). The lef is deleted when tthing the new suffix to it nd the se of pp suffix (t) is moved up towrds the next node, deresing the offset o. Therefore, o never inresed nd there is lwys oin on t o. As result, denonizing hs

57 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 47 mortized onstnt omplexity. Theorem With the dditionl informtion of knowing the length of ˆα(s) for ny suffix s of t efore inserting it, it tkes time O( t ) to onstrut ST(t) in n nti-on-line mnner, i.e., reding t in reverse diretion from right to left. Proof. Inserting the lef, inserting n inner rnhing node, or deleting n old lef tkes onstnt time per step. The non-onstnt prt of eh itertion is the denonizing of pp suffix (t). By Lemm 3.9, the work n e done in mortized onstnt time. See Figure 3.7, where the onstrution of ST() from ST() is shown. If we do not hve prefix links for leves, the referene pir for pp suffix (t) is denonized y moving the se from over to the root, where the offset is deresed y one. The length is then djusted to 2. The lef will e deleted nd, therefore, there is no need to nonize to. If we hve prefix links for leves (not drwn in Figure 3.7), there is prefix edge leledfromto, whih is found fter moving the se up to. Before deleting the node, the se of the referene pir needs to e moved k up to the root. 3.3 Construting Compt Affix Trees On-Line This setion desries the liner-time unidiretionl onstrution of ffix trees. We emphsize the derivtion from merging the two lgorithms of the previous setion. The lgorithm is est understood s interleving the steps of these lgorithms together with the use of pths so tht the two lgorithms never notie eh other Overview The lgorithm is dul y design. The neessry steps to onstrut AT(t) from AT(t) re the sme s for onstruting AT(t) from AT(t), we only hve to exhnge suffix nd prefix in the desription nd swp the referene pirs. For the ske of lrity, we ssume the view of onstruting AT(t) from AT(t). We ll this suffix itertion, wheres onstruting AT(t) from AT(t) is lled prefix itertion. The ffix tree for t onsists of the suffix struture tht is equivlent to the suffix tree ST(t) nd the prefix struture tht is equivlent to the prefix tree ST(t R ). To rete the ffix tree AT(t) from the ffix tree AT(t), we pply the lgorithm desried in Setion (Ukkonen) to the suffix struture nd pply the lgorithm desried in Setion (Weiner ) to the prefix struture. We need to tke preution so tht one lgorithm does not distur the other when they re merged together (the lgorithms need to see the orret view). Additionlly, the pth strutures must e updted orretly. To hieve liner running time, we tret pths s edges. Funtion 3.8 shows how prefix nodes lustered in pth n e treted s single edge (in onstnt time). All other funtions on edges n e implemented for pths in similr wy with onstnt running times.

58 CONSTRUCTING COMPACT AFFIX TREES ON-LINE tive prefix suffix se () tive prefix suffix se () tive prefix suffix se () tive prefix suffix se (d) Figure 3.7: Steps of onstruting ST() from ST(). The tive prefix (s represented y pp suffix ) is shown ove eh tree. The sustring mrked s prefix se is the prt of the tive prefix tht is represented y the se of the (nonil) referene pir. The offset index of the referene pir points to the hrter one right of the sustring represented y the se. Figure () is the strting point. In Figure (), the pp suffix hs een denonized nd the lotion of ˆα() hs een found. Figure () shows the tree with the new suffix inserted. It ws dded t lef, whih is then deleted to give the finl tree in Figure (d).

59 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 49 Funtion 3.8 gettrgetnodevirtuledge(edge e) returns node Prmeters: e is suffix edge tht possily ends in prefix node. 1: p := gettrgetnode(e); 2: if p hs only one suffix hild then {Node p is prefix node.} 3: π = getpth(p); {Node p is ontined in pth π.} 4: q = getlst(π); {Node q is the lst node of pth π.} 5: f = getonlyedge(q, SUFFIX); {Node q is prefix node (in π) nd hs only one suffix hild.} 6: r = gettrgetnode(f); 7: return(r); 8: else 9: return(p); {There is no pth for suffix nodes.} Eh itertion (whih onsists of ll neessry updting steps to ppend hrter to t) of the lgorithms onsists of the following steps: Step 1. Let p e the prefix prent of sl suffix (t). Remove the prefix edge tht leds from p to sl suffix (t). Rememer the node p. Step 2. Lengthen the text, therey utomtilly extending ll open edges. Step 3. Insert prefix node for t s suffix prent of t. Step 4. Insert the relevnt suffixes nd updte the suffix links using sp suffix (t). Restore the prefix struture to inlude the prefix t. Step 5. Denonize sp prefix (t) to find sp prefix (t), mke its lotion expliit, nd dd prefix edge to the urrent sl suffix. If the lotion of sp prefix (t) is prefix lef, then the lef must e deleted euse it hs eome nonrnhing inner node in this step Detiled Desription Step 1 is neessry euse the suffix edge is n open edge nd utomtilly grows when the string is enlrged, while the prefix edge is not n open edge. After this step, the prefix struture is still omplete exept for the prefix for t, whih is no longer represented. The step n esily e implemented in onstnt time. The underlying string n e implemented s dynmilly growing rry (see Setion 2.2.2). When hrter is ppended in Step 2, ll open edges nd thus ll suffixes of Type 1 (see Setion 3.2.1) grow utomtilly. The prefix link of the tive suffix lef ws deleted in the previous step euse otherwise depth(sl suffix ) would e different in the prefix nd in the suffix prt of the ffix tree. Step 3 reinserts the prefix node for t (ut without linking it in the prefix struture yet). The prefix edge will e inserted in the next step. We n find the orret lotion to insert the new node esily if we keep referene to the node tht represents the omplete string (the end node). The node will never hnge nd utomtilly grows

60 CONSTRUCTING COMPACT AFFIX TREES ON-LINE y its open edge. It n never e deleted euse in Step 5 leves re only deleted if they eome nested. The new prefix node for t is not suffix node (it nnot e rnhing nd it is not lef) nd is therefore dded to pth. If t is not the only prefix lef representing proper non-nested prefix of the omplete string, there is prefix lef r representing t 0,n 1 nd the new lef n e dded to r s pth. Step 4 is silly the sme s the lgorithm desried in Setion For eh new suffix lef, prefix edge to the urrent sl suffix is inserted nd the new lef eomes the sl suffix. Although the lotion referened y sp suffix my e impliit in the suffix struture, eh time new node would e inserted for suffix tree, there n e n existing node p tht represents left-rnhing sustring of t nd elongs to the prefix struture. In this se, the pth tht ontins p must e split nd p must thus e mde visile in the suffix struture. This orresponds to splitting n edge nd n e implemented in onstnt time nlogously to Funtion 3.8. The newly inserted inner nodes must e threded in the prefix struture whih orresponds to setting their suffix links. By Lemm 3.2, there is lwys n end point (the suffix prent to the lst node where lef ws dded), whih n e used s strting point. If new nodes were inserted, these elong to the suffix struture only nd the orresponding pths must e dded or hnged. The proedure updte-relevnt-suffixes(sp suffix, ) is modified to put the trversed nodes on stk, nd their suffix links re set in reverse order while the pths re ugmented t the sme time. By trversing the nodes in reverse order, we n ssure tht we re le to grow existing pths. When setting the suffix links nd dding nodes to pths, we must hek whether old prefix edges etween these nodes existed nd if so remove them. This n e done while the nodes re put on the stk. The next two lemms give some hints on how to reonstrut the prefix struture. Lemm Let p e (new) rnhing node to tht relevnt suffix is tthed. Then p is prefix prent or prefix nestor of the prefix lef representing t. Proof. Let p e s in the prerequisite. Then p represents string w = pth(p) tht ws nested suffix of t, ut w is not nested in t. Beuse w is suffix of t, there is prefix pth from p to t. Thus, ll inserted nodes represent suffixes of t. Lemm The string pth(p) represented y the node p rememered in Step 1 is suffix of α(t). Proof. The node p is the prefix prent of sl suffix (t). Therefore, pth(p) is suffix of pth(sl suffix (t)). Beuse pth(sl suffix (t)) is suffix of t, so is pth(p). The node p is either n inner prefix node or n inner suffix node. In oth ses, it represents nested sustring of t (y Lemm 3.1). We hve pth(p) α(t) euse α(t) is the lrgest nested suffix of t. Hene, pth(p) is suffix of α(t). Let q e the first node tht ws either inserted or tht resulted from split pth (in whih se q ws prefix ut no suffix node in AT(t)). Let p e the node rememered in Step 1. Note tht ll nodes inserted fter node q represent suffixes of pth(q),

61 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 51 nd pth(q) is suffix of t. Furthermore, pth(p) is nested in t nd thus shorter thn pth(q), whih is lso suffix of t. Beuse p ws the diret prefix prent of sl suffix (t), there is no other suffix of t represented y node in AT(t). It follows tht q (representing α(t)) is either desendnt of p or equls p nd tht we hve thus lredy inserted hin of suffix links from q to p. Hene, we insert prefix edge from q to the node inserted for t in Step 3. If no relevnt suffixes were inserted, we dd prefix edge from node p to the prefix node for t. While the prefix t ws removed from the prefix struture in Step 1, it is now inserted gin. The newly inserted inner nodes re in the suffix struture only nd re hidden in pths. After this step, the prefix struture orresponds to ST(t R ) gin. Step 5 inserts the prefix t nd silly orresponds to the lgorithm desried in Setion The suffix lotion is esily found with sp suffix (t) nd so the node n e inserted in the suffix struture, too (possily in pth). The length of ˆα(t) is the sme s the length of α(t) nd the informtion is tken from the previous step. If new prefix node is inserted, it my either e dded to pth on its own or to the pth of its suffix prent. If the lotion is prefix lef, it must e removed from the pth tht ontins ll the leves efore it is deleted. All suffix leves re suffixes of t. They re onneted y prefix edges leled with the first hrters of t. For exmple, the lrgest suffix is t 0 t n, the seond lrgest suffix is t 1 t n. They re onneted y prefix edge leled with t 0. Hene, the prefix edges onneting the k + 1 lrgest suffix leves desrie the prefix t 0 t k. All leves ut the lef for t re in pth nd not visile in the prefix tree. They re the end of the lrgest prefix t. Linking sl suffix, the shortest suffix lef, to the lotion sp prefix (t) is equivlent to inserting new prefix lef for t. Hene, Step 5 orretly updtes the prefix struture of the ffix tree nd the resulting tree is AT(t) An Exmple Itertion Figures 3.9 nd 3.10 show n exmple itertion of uilding AT() from AT(). Figure 3.9 shows the suffix view, nd Figure 3.10 shows the prefix view. The trees orrespond to eh other. Figure () shows the ffix tree for the string. After Step 1 the prefix link of sl suffix (in the suffix view) to p (in the suffix view ) is removed. The prefix for t = is no longer represented nymore, whih is shown in Figure (). Figure () shows the tree fter the string hs grown in Step 2. As the hrter ws ppended to the right, this hs no effet on the prefix struture while ll suffix leves grow y the open edges. Step 3 reinserts the node for the prefix lotion of t. The node is dded to the pth tht ontins the prefix leves s shown in Figure (d). The first relevnt suffix is inserted resulting in Figure (e). The rnhing node q = would hve een inserted in suffix tree onstrution. In the ffix tree, the node lredy exists nd is only mde suffix node y removing it from the pth. This hs no effet on the prefix struture. Figure (f) shows the tree fter the seond relevnt suffix hs een inserted. A new rnhing node q = ws reted in the suffix struture. The new node is not prefix node nd will thus e dded to pth. It ppers hidden in pth in the prefix struture fter the new nodes hve een threded in from the stk. This n e seen in Figure (g), whih shows the tree fter Step 4. The referene pir sp prefix is denonized, its se moves from to,

62 CONSTRUCTING COMPACT AFFIX TREES ON-LINE tive suffix suffix se prefix se () tive suffix suffix se prefix se () tive suffix suffix se prefix se () tive suffix suffix se prefix se (d) tive suffix suffix se prefix se (e) tive suffix suffix se prefix se (f) tive suffix suffix se prefix se (g) tive suffix suffix se prefix se (h) tive suffix suffix se prefix se (i) Figure 3.9: Steps of n itertion from AT() to AT() in the suffix view. The figures show the ffix tree () AT(), () fter Step 1, () fter Step 2, (d) fter Step 3, (e) fter inserting the first relevnt suffix, (f) fter inserting the seond relevnt suffix, (g) fter Step 4, (h) fter denonizing sp prefix nd inserting the prefix lotion of t, nd (i) fter the prefix lef hs een deleted. The resulting tree is AT().

63 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 53 tive suffix suffix se prefix se () tive suffix suffix se prefix se () tive suffix suffix se prefix se () tive suffix suffix se prefix se (d) tive suffix suffix se prefix se (e) tive suffix suffix se prefix se (f) tive suffix suffix se prefix se (g) tive suffix suffix se prefix se (h) tive suffix suffix se prefix se (i) Figure 3.10: Steps of n itertion from AT() to AT() in the prefix view. The figures show the ffix tree () AT(), () fter Step 1, () fter Step 2, (d) fter Step 3, (e) fter inserting the first relevnt suffix, (f) fter inserting the seond relevnt suffix, (g) fter Step 4, (h) fter denonizing sp prefix nd inserting the prefix lotion of t, nd (i) fter the prefix lef hs een deleted. The resulting tree is AT().

64 BIDIRECTIONAL CONSTRUCTION OF COMPACT AFFIX TREES where suffix edge leds to, whih eomes the new se. The new length is set to two, nd prefix edge is inserted to sl suffix =. The resulting tree is shown in Figure (h). Beuse ws prefix lef, it needs to e deleted. Finlly, the resulting tree AT() is shown in Figure (i) Complexity Theorem 3.13 (Complexity of the unidiretionl onstrution). The ffix tree AT(t) for the string t n e onstruted in n on-line mnner from left to right or from right to left in time O( t ). Proof. The lgorithm is dul, so it does not mtter whether we lwys ppend to the right or lwys ppend to the left. Exhnging the words suffix nd prefix in the desription of the lgorithm does not hnge it. With the use of pths (see Funtion 3.8 for n exmple), Step 4 only sees nodes tht would e in the orresponding suffix tree. The other nodes re hidden in pths nd re skipped like edges. Similrly, Step 5 only works with nodes tht would e in the orresponding prefix tree. New nodes re dded to pths nd so this invrint is kept. Therefore, Lemm 3.6 nd Lemm 3.9 n e pplied nd oth steps tke mortized onstnt time. If we use the index trnsformtion s desried in Setion 3.1.3, we do not even need to hnge the mortized nlysis of Lemm 3.9 the oins re simply tken from t 1 o (rell tht o is the offset of the referene pir sp prefix ). Beuse o only dereses, 1 o only inreses. We lwys put oin on newly dded hrter. Hene, there is lwys oin on t 1 o. The other three Steps (1, 2, nd 3) tke onstnt time eh. As result, the lgorithm runs in liner time. Remrk As lredy stted, the lgorithm is on-line: The k-th step results in the ffix tree for the prefix of length k of the string. Assuming the input string hs length n, the omplexity of n individul step n well e Θ(n). Consider the lst step of uilding the ffix tree for. The string hs no right- or left-rnhing sustrings, ut the finl string hs n 2 right-rnhing sustrings, thus Θ(n) new inner nodes hve to e reted. 3.4 Bidiretionl Constrution of Compt Affix Trees So fr, we hve only onsidered the question whether it is t ll possile to onstrut n ffix tree in liner time. We nswered the question in the positive. We even hve n online lgorithm with mortized onstnt ost per step tht onstruts ompletely dul dt struture. This nturlly rises the question wht hppens if the on-line onstrution is done in oth diretions. This idiretionl onstrution is on-line y dding hrters t oth sides of the string in ritrry order. Besides the need to updte the importnt lotions for oth sides (e.g., the referene pir for the tive prefix point), dditionl mesures must e tken to hieve liner running time. Lemms 3.15 nd 3.16 n e found in similr form in [Sto95] nd re replited for etter understnding, lrity, nd ompleteness. The importne of Lemm 3.16 revels itself only in the ontext of the mortized nlysis for the liner idiretionl onstrution of ompt ffix trees.

65 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES Additionl Steps The ffix tree onstrution lgorithm is dul, so the tree n e onstruted y ppending hrters t the end or t the front of the underlying string. To e le to do this interhngely, we need to rememer the referene pirs for the tive prefix point (pp prefix nd pp suffix ), too, nd we need to keep them nonil nd up-to-dte. For this, we dd new step to the steps desried in Setion 3.3: Step 6. Updte the tive prefix. By ppending hrter to the end of t the tive prefix ˆα( ) nnot eome smller. The tive prefix ˆα(t) stys prefix in t nd, if it is nested in t, it is lso nested in t. Therefore, the tive prefix n grow t most y one hrter if it is prefix nd suffix of t nd the prefix is followed y, i.e., ˆα(t) prefixes(t) nd ˆα(t) suffixes(t). The following lemm n e used to determine if the tive prefix grows. Lemm The tive prefix grows y one hrter if nd only if the prefix lotion of the new tive suffix is represented y prefix lef in Step 5 of the lgorithm. Proof. If the prefix lotion of the new tive suffix is prefix lef, then the new tive suffix is lso prefix of t. We hve α(t) prefixes(t) nd α(t) suffixes(t). Oviously, α(t) is nested prefix. Beuse v = α(t) is represented y prefix lef in AT(t), the string v ws non-nested prefix of t efore. Therefore, the tive prefix ˆα(t) ws smller thn v. Due to the ft tht v is nested prefix of t, the new tive prefix ˆα(t) must e t lest s lrge s v. We only dd one hrter to t, thus ll new sustrings of t not in t re sustrings of t ppended y. Therefore, the tive prefix nnot grow y more thn single hrter. For the other diretion of this proof, ssume tht the tive prefix grows y one hrter. Then its new lrger seond ourrene in t must e suffix. Hene, we hve ˆα(t) prefixes(t) nd ˆα(t) suffixes(t). The new tive prefix ˆα(t) is nested suffix of t nd therefore suffix of α(t). We lim tht ˆα(t) = α(t). For the ske of ontrdition, suppose tht the tive suffix is lrger, then we hve vˆα(t) = α(t) for some v Σ +. There must e seond ourrene of the tive suffix vˆα(t) in t tht is not suffix. If vˆα(t) is lso prefix, then it is ndidte for the tive suffix ontrditing our ssumptions out ˆα(t). Therefore, v ˆα(t) is proper sustring of t nd onsequently lso non-prefix sustring of t. The sitution is desried y ˆα (t) prefixes (t) nd ˆα (t) psustrings (t) suffixes (t). (3.6) Hene, ˆα(t) is nested prefix of t nd, s suh, ndidte for the tive prefix of t lrger thn the tive prefix itself ontrdition. Therefore, we must hve ˆα(t) = α(t). Beuse the new tive prefix is longer thn the old one, ˆα(t) is non-nested prefix in t, nd it is represented y prefix lef in AT(t). We find the prefix lotion of the new tive suffix in Step 5. We then hek expliitly whether it is prefix lef (nd whether we must delete it) or not. Step 6 is

66 BIDIRECTIONAL CONSTRUCTION OF COMPACT AFFIX TREES esily implemented y nonizing oth referene pirs if no node is deleted in Step 5 (we nonize euse nodes might hve een inserted), nd y lengthening the tive prefix when lef is deleted. We now detil the lengthening step. If the tive prefix grows, we must keep its referene pirs up-to-dte. In tht se, we lso know tht ˆα(t) = ˆα(t). To find the prefix lotion pp prefix (t) of ˆα(t) from pp prefix (t), we must find suffix edge leled to move its se so tht the new hrter n e prepended. This n e done with the lredy known proedure denonize(sp prefix (t), ). The lotion of pp suffix (t) n e found y lengthening pp suffix (t) y the hrter. After tht, we use nonize(pp suffix (t)) to keep the referene pir nonil. If the tive prefix does not grow, it is still neessry to nonize pp prefix (t) nd pp suffix (t) euse prefix nd suffix nodes might hve een inserted. Clling nonize t the end of n itertion is equivlent to simply testing whether newly inserted suffix (prefix) node is the new nonil se of pp suffix (sp prefix ) eh time new node is dded. This dds only onstnt overhed to the insertion of nodes. Step 6 hene dds the dditionl overhed of nonizing pp suffix (t) nd denonizing sp prefix (t). Thus, we n onstrut ffix trees idiretionl nd on-line. To hieve linertime ehvior, we need to pply some more hnges s explined next Improving the Running Time If the lgorithm is exeuted s desried ove, liner time ound nnot e gurnteed if hrters re ppended nd prepended in ritrry order. As n illustrtion, suppose tht fter some itertions where hrters were ppended to the string, hrters re now prepended. In suh reverse itertion, the tive suffix might grow. In Step 6 of the reverse itertion, the referene pir sp suffix = (p, (o, l)) is denonized. Thus, its index o into the string is deresed nd Lemm 3.6 nnot e pplied nymore. Similr onditions my inrese the index o of sp prefix nd Lemm 3.9 nnot e pplied. To prevent ostly osilltion of the indies nd to redue work to mortized liner time, we must onserve some work. An mortized nlysis will then result in the desired omplexity. We introdue two improvements to Step 5 nd to Step 6 tht ensure liner time ound. Extension 1. If the new prefix lotion of the tive suffix is lredy expliit s prefix node q, this node must our in pth strting elow the se p of the sp suffix = (p, (o, l)). We n get to the node in onstnt time y retrieving the (l 1)-th element in the pth strting t the hild t o of p. In Step 5, we denonize only if the node nnot e found in onstnt time y the method desried ove. In the exmple shown in Figures 3.9 nd 3.10, the nonil suffix referene pir sp suffix (t) fter Step 4 (see Figure (g)) hs the send the length 1. This lotion is represented y the prefix node in the prefix struture nd n e retrieved in onstnt time from node. The following lemm will mke Step 6 run in onstnt time ltogether.

67 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 57 Lemm If the tive prefix ˆα(t) grows, the new tive prefix ˆα(t) is equl to the new tive suffix α(t). Proof. If the tive prefix grows from t to t, then ˆα(t) must e prefix nd ˆα(t) suffix of the originl string, nd there my not e ny other ourrene of ˆα(t) nywhere in t (otherwise ˆα(t) is lrger nested prefix ontrdition to ˆα(t) eing the tive prefix of t). The new string t hs the suffix ˆα(t) nd the prefix ˆα(t). Hene, ˆα(t) is nested suffix. Suppose there is lrger nested suffix w = vˆα(t) of t. Then w is suffix of t, nd there must e n dditionl ourrene of w in t, i.e., w prefixes(t) psustrings(t). Hene, the suffix ˆα(t) of w must e either proper sustring or suffix of t. But then we hve ˆα(t) prefixes(t) nd ˆα(t) psustrings(t) suffixes(t), so ˆα(t) is ndidte for the tive prefix of t ontrdition. As result, ˆα(t) = ˆα(t) = α(t). This yields our seond improvement. Extension 2. If the tive prefix grows, we ssign sp suffix (t) to pp suffix (t) nd sp prefix (t) to pp prefix (t). By Lemm 3.16, this is orret, nd Step 6 thus runs in onstnt time Anlysis of the Bidiretionl Constrution By Extension 2, Step 6 runs in onstnt time. The ost of nonizing sp prefix nd pp suffix when the tive prefix does not grow n simply e ssigned to the insertion ost of node t no further symptoti hrge (only one or two referene pirs need to e heked for eh newly inserted node). Hene, ll tht is left to tke re of re the side effets tht reverse itertion hs on lter itertions. Beuse the referene pirs re ll kept up-to-dte nd the upkeeping of sp prefix nd pp suffix omes for free, the min onern lies on hnges of the referene pirs for the tive suffix in itertions where hrters re ppended to the front of the underlying string Overview The idiretionl version of the lgorithm hs four steps tht do not run in onstnt time. These re Steps 4 nd 5 in the suffix nd in the prefix itertion (the proedures nonize() nd denonize()). We distinguish etween suffix itertion ( norml itertion ) where hrter is ppended to t resulting in t, nd prefix itertion ( reverse itertion ) where hrter is prepended to t resulting in t. We need to show tht ll exeutions of nonize() nd of denonize() in the prefix nd in the suffix itertion hve onstnt mortized ost. To show this, we use the ounting method of mortized nlysis nd ssign to eh hrter of t four purses or ounts: A suffix/nonize purse, suffix/denonize purse, prefix/nonize purse, nd

68 BIDIRECTIONAL CONSTRUCTION OF COMPACT AFFIX TREES prefix/denonize purse. In the suffix itertions, the ost of repetition of the while loop in nonize() is pyed from suffix/nonize purse nd the ost of repetition of the while loop in denonize() is pyed from suffix/denonize purse. Anlogously, in prefix itertions, the prefix/nonize nd the prefix/denonize purses re used. The purses re filled when hrters re ppended or prepended to the string or when new nodes re inserted into the tree (the detils of the ltter re given further elow). The following invrints re kept throughout the lgorithm. They gurntee liner ehvior. Invrint I Whenever the ody of the while loop of nonize() in suffix itertion is exeuted, there is oin in the suffix/nonize purse of hrter t o, where o is the offset index of sp suffix. Invrint II Whenever the ody of the while loop of denonize() in suffix itertion is exeuted, there is oin in the suffix/denonize purse of hrter t o, where o is the offset index of sp prefix. Invrint III Whenever the ody of the while loop of nonize() in prefix itertion is exeuted, there is oin in the prefix/nonize purse of hrter t o, where o is the offset index of pp prefix. Invrint IV Whenever the ody of the while loop of denonize() in prefix itertion is exeuted, there is oin in the prefix/denonize purse of hrter t o, where o is the offset index of pp suffix. Theorem 3.17 (Complexity of idiretionl onstrution of ffix trees). If the Invrints I, II, III, nd IV re true, then the idiretionl onstrution of the ffix tree for the string t hs liner running time O( t ). Proof. Let n = t. If the ove four onditions re met, then eh ll to one of the proedures hs onstnt mortized ost. In eh itertion, t most one ll to denonize() is mde nd for eh suffix lef inserted in suffix itertion nd eh prefix lef inserted in prefix itertion t most one ll is mde to nonize(). Beuse there re t most n itertions, no more thn n leves n e deleted. The finl ffix tree hs t most n suffix leves nd t most n prefix leves. Hene, the numer of leves ever inserted is O(n) nd there re no more thn O(n) lls to nonize(). As result of the ove theorem, eh itertion hs onstnt mortized ost. We show tht, even though n ritrry numer of reverse itertions is exeuted etween norml itertions, there is lwys oin in the suffix/nonize purse of hrter t o when the ody of the while loop of nonize() in suffix itertion is exeuted nd there is lwys oin in the suffix/denonize purse of hrter t o, when the ody of while loop of denonize() in suffix itertion is exeuted (Invrints I nd II). By the dulity of the lgorithm, this will show tht ll four invrints re true throughout the lgorithm nd tht the lgorithm is liner in the length t of the finl string t. We initilize eh purse s follows:

69 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 59 If hrter is ppended to t, then the new string is t 0 t n t n+1 (with = t n+1 ), nd we put oin in the suffix/nonize purse of t n+1 nd in the suffix/denonize purse of t n+1. If hrter is prepended to t, then the new string is t 1 t 0 t n (with = t 1 ), nd we put oin in the prefix/nonize purse of t 1 nd in the prefix/denonize purse of t 1. If we solely perform suffix itertions, the offset index o of sp suffix is only inresed nd the offset index o of sp prefix is only deresed (so tht the trnsformed index 1 o is only inresed). If no reverse itertions re performed, only the Invrints I nd II (or III nd IV) re relevnt, nd Lemm 3.18 tells us, tht they re lwys fulfilled. Therefore, the onstrution is liner in time. Lemm If the offset index o of sp suffix nd of pp prefix inreses only, Invrints I nd III re lwys fulfilled. If the offset index o of sp prefix nd of pp suffix dereses only, Invrints II nd IV re lwys fulfilled. Proof. Chrters ppended to the underlying string pper t higher indies thn the index offset o of sp suffix nd t smller indies (in the reverse view of the string) thn the index o of pp prefix. Similrly, hrters prepended to the underlying string pper t higher indies (in the reverse view of the string) thn the index offset o of sp prefix nd t smller indies thn the index o of pp suffix. If intermedite prefix itertions re performed, the offset indies of sp suffix nd pp prefix n derese (respetively inrese). We hve to prove tht the dditionl oins needed n e put into the purses with mortized onstnt extr ost. The remining two setions will show how this is done nd omplete the proof Chnges to sp suffix in Reverse Itertion Invrint I Chnges to sp suffix onern Invrint I only. The following results n e pplied nlogously to pp prefix (y exhnging suffix nd prefix ) nd Invrint III. Lemm Assigning pp suffix (t) to sp suffix (t) in Step 6 dds mortized onstnt overhed to suessive lls to nonize(sp suffix ) nd keeps Invrint I. Proof. In prefix itertion t = t m t n to t = t m 1 t n (where the hrter t m 1 = ws prepended), the tive suffix my grow y one. Although sp suffix (t) is not denonized expliitly (Extension 2), it is hnged s if it hd een denonized. Let p suff new e the new se of sp suffix (t) nd let p suff old e the old se, then there is suffix node q suh tht q is the prefix prent of p suff new (the prefix edge is leled y the prepended hrter ), nd p suff old is suffix desendnt of q. (See Figure 3.10(g), where the tive prefix grows from to. We hve p suff old =, q = root, nd p suff new = root euse of the speil se in denonize.) Let V suff = {q 1,...,q k } e the nodes tht re on the pth from q to p suff old (V suff = {} in our exmple). In the next ll to nonize() in suffix itertion, the nodes in V suff hve to e trversed

70 BIDIRECTIONAL CONSTRUCTION OF COMPACT AFFIX TREES dditionlly. The purses tht re expeted to hve oin re ssigned to the hrters with the indies I suff = {i j i j = n α(t) + depth(q j ) + 1, q j V suff }. We now show tht the oins ome from the reverse itertion. In the reverse itertion, sp suffix ws hnged y n ssignment from pp suffix. Usully, pp suffix is denonized in Step 5 of the reverse itertion. In the se, tht the tive suffix grows in reverse itertion, the suffix lotion of pp suffix is represented y suffix lef (y Lemm 3.15). The node is found in onstnt time y Extension 1. Hene, there re oins tht re not used in prefix/denonize purses euse denonize() is not exeuted. (In the exmple, these re the oins tht would e needed to denonize over ). Let p pref old e the se of pp suffix(t) nd p pref new e the se of pp suffix (t). We know tht p pref new = p suff new nd, y Corollry 3.5, we hve α(t) + 1 = α(t) = ˆα(t) ˆα(t) + 1, so α(t) ˆα(t), nd p pref old is desendnt or equls to p suff old. Therefore, if we hd denonized pp suffix (insted of ssigning it diretly to the suffix lef), the nodes V pref (V pref = {} in the exmple) tht we would hve met in denonize() would e superset of the nodes in V suff (i.e., V suff V pref ). The string indies re I pref = {i j i j = m + ˆα(t) depth(q j ) + 1, q j V pref }. Oviously, I pref I suff nd, euse the oins were not needed for denonizing pp suffix, we n tke the money from the prefix/denonize purses of the hrters with indies I pref nd put one oin in the suffix/nonize purse of eh hrter with n index in I suff. (In the nlysis of denonize(), we ssumed tht the oin from the hrters with indies I pref were used euse we moved the offset index over them, ut denonize() ws not exeuted.) As result, Invrint I is kept nd the dditionl work n e pyed for. Further prefix itertions, whih our efore nonize() is lled in suffix itertion, n lengthen the tive suffix gin, leding to the sme money trnsfer. In itertions where the tive suffix is unhnged, only new suffix nodes n e inserted. Lemm Only onstnt work is needed to keep Invrint I when dding dditionl suffix nodes in prefix itertions. Proof. Let p suff new, p suff old, q e the nodes of previous prefix itertion t = t m...t n to t = t m 1...t n s desried ove in the proof of Lemm 3.19 (where the tive suffix hs grown nd the tive suffix hs not een nonized yet). For eh node r tht is inserted etween q nd p suff old, n dditionl repetition of the while loop is needed in suessive ll to nonize() (V suff grows). The index of the hrter with the suffix/nonize purse to tht oin hs to e dded is i r = n α(t) +depth(r)+1. The new suffix node nnot lie etween two different p suff old, q pirs. Suppose p suff old nd q re the nodes of the next se where the tive suffix ws lengthened with p suff new = p suff old. A node r etween p suff old,q nd p suff old, q must hve depth d with d > depth(q) = depth(p suff new) 1 = depth(p suff old ) 1 nd d < depth(p suff old ). This is not possile. There is t most one suffix node inserted per reverse itertion, so pying the dditionl oin only dds onstnt extr ost. Note tht suffix itertions nnot insert ny suh nodes euse ll nodes re inserted ove or t the se of the sp suffix.

71 CHAPTER 3. LINEAR CONSTRUCTION OF AFFIX TREES 61 Lemm Cnonizing sp suffix (t) during prefix itertions keeps Invrint I. Proof. Cnonizing inreses the offset index o towrds the end of the string, so no positions without oins re introdued. As result, Invrint I is lwys kept vlid Chnges to sp prefix in Reverse Itertion Invrint II Chnges to sp prefix onern Invrint II only. The following results n e pplied nlogously to pp suffix (y exhnging suffix nd prefix ) nd Invrint IV. By Lemm 3.18, the invrint might only e destroyed if the offset index o of sp prefix is inresed (in the prefix view of the string). Lemm Assigning pp prefix (t) to sp prefix (t) in Step 6 dds mortized onstnt overhed to suessive lls to denonize(sp prefix ) nd keeps Invrint II. Proof. Let t = t m t n. All prefix nodes tht might hve een dded during Step 4 in this prefix itertion (if ny) re dded t depth lrger thn the length of α(t). Assigning pp prefix (t) to sp prefix (t) is equivlent to inresing the length of sp prefix (t) y one. If sp prefix (t) ws nonil with se p old, then the se p new of the nonil referene pir sp prefix (t) is either the sme s p old or hild of p old. In the ltter se, depth(p new ) = α(t) euse there nnot e node q with depth(q) α(t) if sp prefix (t) ws nonil. We n put oin in the suffix/denonize purse of t n to keep Invrint II. Beuse we need to put t most one dditionl oin in suffix/denonize per itertion, this dds t most onstnt overhed. We look t nonizing sp prefix when the tive suffix does not grow. We only need to nonize if new prefix nodes re inserted. The noniztion n e seen s tking ple eh time new node is inserted. For eh new node, we hek whether the node is the new nonil se for sp prefix. The following lemm dels with ll inserted nodes. Lemm Only mortized onstnt work is needed to keep Invrint II when dding dditionl prefix nodes in prefix itertions. Proof. Let o min e the miniml vlue tht the offset index ever hd in ll previous itertions. For node q, let i q e the index with whih it would pper in ll to denonize() (in suffix itertion). We hve i q = n depth(q) 1 (the trnsformed index for the prefix view is 1 i q = depth(q) n, wheres n depth(q) 1 is the rel index). Let r e the node with the lrgest string depth suh tht 1 i r < o min, nd let p e the urrent se of sp prefix. A node q inserted elow r does not destroy Invrint II euse there is still the originlly inserted oin on hrter t iq. We need to provide extr money for the nodes inserted ove r to keep the invrint. Let q e inserted ove r nd elow p with 1 i q o min. We dd oin to the suffix/denonize purse of hrter t iq

72 BIDIRECTIONAL CONSTRUCTION OF COMPACT AFFIX TREES (prefix view of the string). If q is inserted ove p so tht sp prefix is nonized to the new se p = q, we dd oin to the suffix/denonize purse of hrter t ip, nd q eomes the new se. By the ove ounting, we hve now the sitution tht elow o min there is oin in every hrter s purse, nd there is oin in eh purse for hrter orresponding to node etween o min nd o. When denonize() is lled on sp prefix nd we move over node q in the whileloop whilst o is still lrger thn o min, the ove pyment gurntees tht there is oin in the suffix/denonize purse of hrter t o. We prove y indution. Bse se. We hve not yet moved over prefix edge. In this se, we meet extly the sme nodes for tht we hve dded the oins for the orresponding hrters. Indutive se. We hve moved over prefix edge efore. We hve i q = n depth(q) 1 nd q is prefix node. Let q e the suffix prent of q. By Lemm 3.2, there is suffix prent nd depth(q) = depth(q ) + 1. By indution, there is oin in the suffix/denonize purse of hrter i q. The index is i q = n depth(q ) 1, where n ws the rightmost string index in the step when the se ws moved long the prefix edge the lst time. The se of sp prefix is moved over prefix edge only if hrter is ppended. Hene, we hve n = n 1 nd i q = n 1 depth(q ) 1 = n 1 (depth(q) 1) 1 = n depth(q) 1. Thus, there is oin in the suffix/denonize purse of hrter t iq. For string t of length n, we insert t most 2n prefix nodes in reverse itertions, thus we only py n mortized onstnt mount of dditionl oins. Note gin tht suffix itertions nnot insert ny suh prefix nodes euse ll prefix nodes re inserted ove or t the se of the sp prefix. As result, Invrint II is lwys kept vlid.

73 Chpter 4 Averge-Cse Anlysis of Approximte Trie Serh In this hpter, we study the verge-se ehvior of serhing in ditionry ( set S) of strings llowing ertin numer d of errors under omprison-sed string distne (see Setion 2.1.3). In the nottion of the lssifition sheme from Setion 2.4, we look t indexing prolems of the type P(Σ ) d f(n) do ll nd P(Σ ) d f(n) do pref, where d is ny omprison-sed string distne, n = S is the numer of indexed strings, nd d = f(n) is the numer of mismthes llowed, whih we study s funtion of n. Ditionry indexing is well-known tsk, e.g., for looking up misspelled words. There exist very fst on-line serh lgorithms. For exmple, for omprison-sed string distnes, serh pttern u of length m n e interpreted s deterministi finite-stte utomton with (m + 1)(d + 1) sttes. The utomton hs very regulr struture, nd it is possile to implement it y fst register opertions. Strting the utomton on eh string in S n depending on the string distne nd the numer of llowed mismthes thus e very fst. On the other hnd, using trie hs the dvntge tht the prefixes of strings in S re omined. When we ompre prefix v of length k of the serh pttern with pth spelling out the word w in the trie, we hve effetively ompred the prefix v with ll strings in S hving prefix w. In terms of the numer of omprisons, this is lwys more effiient. In prtie, trversing trie hs some extr overhed, e.g., in rndom-esses to the memory, while ompring the pttern with string is very simple. Thus, the ltter pproh n e implemented with smller onstnt in the symptoti running time (nd lso with less spe). We expet to see some trde-off or threshold where one method outperforms the other. To nlyze the reltive performne of the lgorithms, we look t the vergese numer of omprisons performed. We present very ext nlysis for tht the onstnts in the symptotis n e omputed extly for ny onrete instne of the prolem. A worst-se omprison nnot e used for meningful omprtive nlysis euse it only dels with smll suset of possile inputs. An individul nlysis of the lgorithm implementtions n ontriute the onstnt ftors involved in single omprison. Thus, our nlysis helps to deide whih lgorithm should e used for ny onrete instne. We disuss lterntive lgorithms in the introdution nd onrete lterntive (for 63

74 PROBLEM STATEMENT Algorithm 4.1 LS(set S, string w, integer d) Prmeters: S = {t 1,...,t n } is the olletion of strings to e serhed, w is the serh pttern, nd d the error ound. 1: for j from 1 to n do 2: i := 1 3: := 0 4: l := min{ w, t j } 5: while d do 6: while i l nd d (w i, t j i ) = 0 do 7: i := i + 1 8: := + 1 9: i := i : if i = w then 11: report mth for t j onstnt numer of errors) in Chpter Prolem Sttement Let S = {t 1,...,t n } Σ e olletion of strings to e indexed with n = S. In the ontext of tries or diret omprison, heking whether prefix t j,i of string t j S is the omplete string is n esy tsk. The lgorithms onsidered in this hpter n provide solutions to P(Σ ) d f(n) do ll nd P(Σ ) d f(n) do pref with only minor modifition of the reporting funtion. We therefore onsider only the ltter prolem nd report doument if prefix is mthed. We ssume tht ll strings re generted independently t rndom y memoryless soure with uniform proilities, i.e., the proility tht is the i-th hrter of the j-th string is given y Pr{t j i = } = 1 Σ. We will e using the size of the lphet Σ lot, nd therefore revite it y σ = Σ. For the verge-se nlysis, it is onvenient to ssume tht we del with strings of infinite length. Hving strings of finite length in prtie hs only negligile impt on the vlidity of our nlysis: If the end of string is rehed, no more omprisons tke ple, thus the theoretil model gives n upper ound. Furthermore, if the strings re suffiiently lrge, the proility of ever rehing the end is very smll. We ssume tht the serh pttern w is lso generted independently t rndom y memoryless soure with uniform proilities. Comprison-sed string distnes re ompletely defined t the hrter level. For given distne d, the prmeters relevnt to our nlysis re thus the proilities of mth or mismth given y q = 1 σ 2 d (, ) nd p = 1 σ 2 (1 d(, )) = 1 q. (4.1) Σ Σ Σ Σ For exmple, for Hmming distne, we hve mismth proility of q = 1 1 σ nd mth proility of p = 1 σ. Pseudo ode for oth lgorithms is given in Figures 4.1 nd 4.2. The LS lgorithm (for Liner Serh ) just ompres the serh pttern with eh string. The TS lgo-

75 CHAPTER 4. APPROXIMATE TRIE SEARCH 65 Algorithm 4.2 TS(trie T, node v, string w, integer i, integer ) Prmeters: 1: if 0 then 2: if v is lef then 3: report mth for t vlue(v) T is trie for the olletion of strings S = {t 1,...,t n }, v is node of T (in the first ll v is the root), w is serh pttern, i is the urrent position in w (zero in the first ll), nd is the mximl numer of mismthes llowed (in the first ll we hve = d). 4: else if i > w then 5: for ll leves u in the sutree T v rooted t v do 6: report mth for t vlue(u) 7: else 8: for eh hild u of v do 9: let e the edge lel of the edge (u, v) 10: if d(w i, ) = 0 then 11: TS(T, u, w, i + 1, ) 12: else 13: TS(T, u, w, i + 1, 1) rithm (for Trie Serh ) uses previously onstruted trie to redue the numer of omprisons. We store vlue(v) = i t lef v if the pth to v spells out the string t i. On rndom set S, let L d n e the numer of omprisons mde y the LS lgorithm nd T d n e the numer of omprisons mde y the TS lgorithm using trie s n index. We re interested in the verge-se; therefore, we ompute ounds on the expeted vlues E[L d n] nd E[T d n]. In the nlysis, we ssume tht q nd p re given prmeters nd look t the ehvior for different d. Wht is the threshold for d (seen s funtion d = f(n)) up to where the verge E[T d n] is symptotilly smller thn the verge E[L d n]? Wht is the effet of different mismth proilities? The nswers to these questions give us the lues needed to hoose the more effiient of oth methods in vrious situtions. 4.2 Averge-Cse Anlysis of the LS Algorithm Computing the expeted ehvior of L d n involves only some si lger with generting funtions. Beuse of our ssumptions, we n lso derive onvergene proilities for the ehvior of L d n. It indeed omes s no surprise tht L d n is so well ehved euse we re in essene deling with huge mount of independent Bernoulli trils so tht the Lw of Lrge Numers results in very preditle (in the sense of likely) ehvior. If we only hve one string, the proility tht Algorithm 4.1 mkes k omprisons is given y { } Pr L d 1 = k = ( k 1 d ) q d+1 p k d 1. (4.2) For n strings, we hve to sum over ll possiilities to distriute the k omprisons over

76 AVERAGE-CASE ANALYSIS OF THE LS ALGORITHM the strings. Beuse ( k l) = 0 for k < l, we get { } n ( Pr L d ij 1 n = k = d i 1 + +i n=k j=1 ) q d+1 p i j d 1. (4.3) We use this eqution to derive the proility generting funtion g L d n (z): [ ] g L d n (z) = E z Ld n = = k=0 { } Pr L d n = k z k k=0 i i n=k j=1 ( ( k 1 = d k=0 n ( ) ij 1 q d+1 p ij d 1 z k d ) q d+1 p k d 1 z k ) n = Thus, we find the expettion nd vrine of L d n to e ( ) zq n(d+1). (4.4) 1 zp [ ] E L d n = d + 1 [ ] n nd V L d (d + 1)p n = q q 2 n. (4.5) As lredy mentioned, the stohsti proess is very stle. We n use Cheyshev s inequlity to derive onvergene in proility of L d n to its men. { L d n Pr n(d + 1) 1 } q > ǫ { } = Pr L d n(d + 1) n q > ǫn(d + 1) p < q 2 ǫ 2 n(d + 1). (4.6) Hene, we hve proven onvergene in proility for lim n L d n n(d + 1) = 1 q (pr). Note tht, if d = f(n) is funtion of n, then we lredy hve lmost sure onvergene if f(n) = ω(log 1+δ n) for some δ > 0 y simply pplition of the Borel- Cntelli Lemm (see, e.g., [Kl02]) nd the ft tht { L d n Pr n(d + 1) 1 } q > ǫ < n=0 n=0 p q 2 ǫ 2 n log 1+δ n < (4.7) euse n 0 n 1 log (1+ǫ) n is onvergent. Using method of Kesten nd Kingmn (see [Kin73] or [Szp00]), this n e extended to lmost sure onvergene: First, note tht L d n < L d n+1, so Ld n is non-deresing. For ny positive onstnt s let r = r(n) e the lrgest integer with r s n, then L d r s Ld n L d (r+1) s, nd thus lim sup n L d n lim sup n(d + 1) r = lim sup r L d r s (r + 1) s (d + 1) L d r s r s r s lim sup (d + 1) (r + 1) s r L d r s r s (d + 1) (4.8)

77 CHAPTER 4. APPROXIMATE TRIE SEARCH 67 nd eqully lim inf n L d n lim inf n(d + 1) r = lim inf r L d (r+1) s r s (d + 1) L d (r+1) s (r + 1) s (r + 1) s (d + 1) r s lim inf r L d (r+1) s (r + 1) s (d + 1). (4.9) By the Borel-Cntelli Lemm, s 1 + ǫ we hve { L d r Pr s r s (d + 1) 1 } q ǫ r=0 L d r s r s (d+1) onverges to 1 q r=0 p q 2 ǫ 2 (d + 1)r s = (where ζ(s) is the Riemnn zet funtion), nd thus lmost surely, euse for ll p q 2 ǫ 2 (d + 1) ζ(s) < (4.10) lim r L d r s r s (d + 1) = 1 q (.s.). (4.11) As result, we find tht lim n L d n n(d + 1) = 1 q (.s.). Note tht, interpreting d = f(n) s funtion of n, this lso holds for f(n) = Ω(1). 4.3 Averge-Cse Anlysis of the TS Algorithm The ehvior of T d n is somewht hrder to nlyze. We pursue the following pln: We strt y setting up reursive eqution for the expeted numer of omprisons. Using some lger on generting funtions, we rrive t n ext formul. Unfortuntely, the formul requires summtion over n terms of exponentil size nd with lternting signs. Therefore, we pply some more dvned methods from omplex funtion nlysis to derive n symptoti ound. The numer of terms in the ext formul for the ound is polynomil in d (where one term is itself n infinite sum), ut it gives us preise ounds using the highest order terms nd will enle us to determine prmeter rnges for tht the TS lgorithm is the etter hoie. Insted of ounting the numer of omprisons, it is esier to exmine the numer of nodes visited in the trie. Oserve tht the numer of nodes visited is y one lrger thn the numer of hrter omprisons. We denote the numer of nodes visited y the proess in trie for n strings llowing d mismthes y T d n An Ext Formul We derive reursive eqution for the expeted numer of nodes visited. Let gn d = E[T n d ] = E[Tn] d + 1. At ny node u, we hve t most σ hildren v 1,...,v σ. Eh hild is the root of sutree hving i j hildren. Let d e the numer of llowed mismthes t node u. When the lgorithm exmines node v j, the numer of llowed

78 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM mismthes is either d with proility p, or it is d 1 with proility q. Thus, we get the desired reursive eqution y summing over ll possiilities to prtition the n strings in the sutree of u into the hildren s sutrees. The proility for prtition into sutrees of sizes i 1, i 2,...,i σ is ( ) n! 1 Pr {i 1 + i i σ = n} = i 1! i σ! σ n = n σ n. (4.12) i 1,...,i σ The following reursive eqution is vlid for n > 0 nd d 0: ( ) gn d n σ σ = 1 + σ n q gi d 1 i 1,...,i j + p σ i 1 + +i σ=n j=1 j=1 g d i j. (4.13) If there is no string represented in sutree, we mke no omprisons. Hene, we hve g0 d = 0 for ll d. If we reh node with d = 1, we hd mismth in the lst hrter omprison nd do not ontinue (line one of Algorithm 4.2); therefore, we hve gn 1 = 1. We n esily hek tht, if there is only one string left, y eqution (4.13), we mke g1 d = 1 + d+1 q omprisons. This grees with 1 + E[L d 1 ]. We derive the exponentil generting funtion of gn d y summing oth sides of eqution (4.13) for n 0. We ount the se n = 0, whih in eqution (4.13) wrongly gives g0 d = 1, y sutrting one on the right hnd side. Thus, the eqution for G d (z) = n 0 gd n zn n! is G d (z) = e z + q σ ( z ) G d 1 e (1 σ)z 1 + p σ j=1 σ ( z G d σ j=1 In the ove, we hve lredy simplified the σ-fold ( n )σ n z n G dij i 1,...,i σ n! n 0 i 1 + +i σ=n = n 0 i 1 + +i σ=n z i 1 σ i 1 i1! Gd i j z ij σ i ji j! ) e (1 1 σ)z 1. (4.14) z iσ σ iσ i σ! = G d ( z σ ) e (1 1 σ)z. (4.15) We multiply oth sides y e z (whih orresponds to pplying some kind of inomil inversion) nd define G d (z) = G d (z)e z. We now hve G d (z) = 1 e z + σq G ( z ) d 1 + σp σ G ( z ) d. (4.16) σ Let g n d e the oeffiients of G d (z). Then we hve n ( ) g n d = ( 1) n n ( 1) k gk d nd gn d = k k=0 n k=0 ( ) n g k d k. (4.17) For g n, d we get the oundry onditions g 1 d = 1 + (d+1) q, g 0 d = 0, g 1 n = ( 1) n 1 for n > 0, nd g 0 1 = 0. Compring oeffiients in eqution (4.16), we find for n > 1 tht

79 CHAPTER 4. APPROXIMATE TRIE SEARCH 69 g n d = ( 1)n 1 + g n d 1 σ 1 n q 1 σ 1 n p whih y itertion (on d) leds to g d n = ( 1)n 1 σ 1 n ( σ 1 n ( σ 1 n q 1 σ 1 n p, (4.18) ) d+1 1). (4.19) Finlly, omining equtions (4.17) nd (4.18) proves the following lemm. Lemm 4.1 (Ext expeted numer of nodes visited y Algorithm 4.2). The expeted numer of nodes visited y Algorithm 4.2 on rndom trie of n strings with rndom pttern is ( gn d = n 1 + d + 1 ) q n k=2 ( ) n ( 1) k k 1 σ 1 k n ( ) n ( 1) k ( ) qσ 1 k d+1 + k σ k pσ 1 k. (4.20) k=2 We n divide (4.20) into three prts. Eh prt hs its own interprettion nd will e hndled seprtely: gn d = 1 + d + 1 n ( ) n ( 1) k n + (n 1) q k 1 σ } {{ } 1 k k=2 } {{ } 1+E[L d n] A n n ( ) n ( 1) k ( ) qσ 1 k d+1 + k σ k pσ 1 k k=2 } {{ } Sn d. (4.21) This finishes the first prt of our derivtion. We hndle the two sums in eqution (4.20) seprtely strting with n exursion on A n. We use Rie s integrls to tkle oth sums. Before emrking on these tsks, we desrie generl sheme how to trnslte these sums into integrls involving only the Gmm funtion (insted of the Bet funtion). This mkes the prolem esier. Along the wy, we will lso see how the sum n e desried y simple line integrl Approximtion of Integrls with the Bet Funtion Using Theorem 2.2 on eqution (4.21) yields pth integrl involving the Bet funtion. The evlution of the residues involving the Bet funtion is triky. Therefore, we pproximte the Bet funtion using the symptoti expnsion given y eqution (2.32) introdued in Setion 2.5. Using this pproximtion ws lredy suggested y Szpnkowski [Szp88] lthough without ommenting on vlidity onditions. In the following, we will ly rigorous sis for it. The min ide is to reple the Bet funtion in n integrl of the form 1 2πı C f(n, z)b(n + 1, z)dz, (4.22)

80 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM where C is pth enirling some poles of B(n+1, z) t {,...,}, y polynomil in z nd the Gmm funtion. This n e done using the symptoti expnsion given in eqution (2.32). We ssume in the following tht f(n, z) is funtion of polynomil growth in n nd onstnt growth in z, i.e., f(n, z) = O(n r ) for onstnt r, then f(n, z) = O(1) for z. The growth of B(n + 1, z) = ( 1) n n! z(z 1) (z n) is O(z n 1 ) s z (in reltion to z, n n e onsidered s onstnt). The first step is to ound the vlue of the integrl long irle with lrge rdius. Lemm 4.2 (The Bet funtion integrl long rs with lrge rdius). Let f(n, z) e funtion suh tht f(n, z) = O(n r ) for onstnt r. Let C e pth orresponding to prt of irle in the omplex plne with rdius m, i.e., C = {p p = m e ıt for t } where, [0, π), nd ssume tht C is suh tht it voids ny singulrities. Then 1 f(n, z)b(n + 1, z)dz 2πı 0. (4.23) m Proof. The proof follows esily from 1 2πı C C f(n, z)b(n + 1, z)dz = 1 ( 1) n n! f(n, z) 2π me ıt (me ıt 1) (me ıt n) dt 1 f(n, z) ( 1) n n! 2π me ıt (me ıt 1) (me ıt n) dt ( ) n r n! 2π (m n) n+1 0. (4.24) m A simple onsequene of the ove lemm for eqution (4.22) is the following: Assume tht the pth C enirles the poles t {,...,} nd tht there re no other singulrities in the hlf-plne R(z) ǫ for some ǫ > 0, then we n extend C to vertil line t R(z) = ǫ nd hlf-irle with rdius m suh tht 1 f(n, z)b(n + 1, z)dz = 1 2πı C 2πı ǫ ı ǫ+ı f(n, z)b(n + 1, z)dz. (4.25) The simple line integrl llows us to pply the pproximtion. We n then return to pth integrl lter. Lemm 4.3 (Asymptoti pproximtion of Bet funtion integrl). For rel onstnt x {0, 1, 2,...}, we hve B(n, x + ıy) dy = O ( n x). (4.26) Proof. The pln of the proof is s follows. We rewrite the Bet funtion s rtio of Gmm funtions. Then we use Stirling s formul to pproximte these. The resulting

81 CHAPTER 4. APPROXIMATE TRIE SEARCH 71 integrl hs four different ftors in the integrnd, for whih we mke se distintion, depending on the rnge of y with respet to n. We strt simplifying y pplying Stirling s formul: B(n, x + ıy) dy = 2 = B(n, x + ıy) dy Γ(n) Γ(x + ıy) Γ(n + x + ıy) dy = 2 ( x + ıy n + x + ıy 0 2π ( n n + x + ıy ) n 1 2 ) x 1 x + ıy e yφ(n,x,y) dy ( 1 + O ( n 1)), (4.27) where 0 sgn(y)φ(n, x, y) = sgn(y)(rg(x+ıy) rg(x+n+ıy)) < π. The ltter follows from the ft tht, for y 0, we hve nx + x 2 + y 2 os(φ(n, x, y)) = x 2 + y 2 < 1. (4.28) n 2 + 2nx + x 2 + y 2 Beuse rg(z) < 0 for I(z) = y < 0, we nlyze φ(n, x, y) s funtion φ(y) of y for y 0. Note tht os(φ(y)) grows inverse to φ(y) on [0, π]. The derivtive of os(φ(y)) is d dy os(φ(n, x, y)) = yn 2 ( nx x 2 + y 2 ) (x 2 + y 2 ) 3 2 (n 2 + 2nx + x 2 + y 2 ) 3 2, (4.29) whih is zero for y = 0 nd for y = nx + x 2. The ltter is impossile for x < 0 nd x < n. For x < 0 nd x < n, we hve os(φ(y)) = 1 nd φ(y) = π 2 for y = 0. Otherwise we hve os(φ(y)) = 1 nd φ(y) = 0 t y = 0. As y we hve os(φ(y)) 1 nd similrly φ(y) 0. We onsider x onstnt with respet to n. Thus, for x < 0, φ(y) dereses from π to zero nd, for x > 0, φ(y) grows from zero to mximum t y = nx + x 2 nd then dereses k to zero. As result, the term e yφ(n,x,y) is monotonilly deresing in y. The term x+ıy n+x+ıy n n+x+ıy is monotonilly deresing in y nd tends to zero. The term is monotonilly inresing in y nd tends to one. We mke se distintion for the integrtion intervl. For x > 0, we find tht x 0 ( n n + x + ıy ) n 1 ( 2 x + ıy n + x + ıy ( n n + x ) n 1 2 ( x 2 + x 2 ) x 1 x + ıy e yφ(n,x,y) dy ) x 1 x x n 2 + 2nx + 2x 2 e 0 dy 0 ( ) x 2x n x x = O ( n x). (4.30)

82 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM Seondly, for n lrge enough, we hve n x ( n n + x + ıy ( ) n 1 ( 2 x + ıy n + x + ıy n 2 (n + x) 2 + x 2 ) n 1 2 n x 1 n 2x ) x 1 x + ıy e yφ(n,x,y) dy x ( ) x x 2 + y 2 e yφ(n,x,y) dy n 2 n ( ) x 2y e y π 5 dy 2x x n x 2x 5 π ( ) 2 5 x Γ(x + 1) = O ( n x), (4.31) π ( ) ( ) euse φ(n, x, x) ros 2 1 = π n 4 nd φ(n, x, n) ros 2 1 = n π 4, so we hve φ(n, x, y) > π 5 for some n lrge enough. The third rnge is n 2 n ( n n + x + ıy ( ) n 1 ( 2 x + ıy n + x + ıy n 2 (n + x) 2 + n 2 Finlly, for the lst prt we hve n 2 ) n 1 2 ( ) x 1 x + ıy e yφ(n,x,y) dy x 2 + n 4 (n + x) 2 + n 4 ) x (x 2 + n 2 ) 1 4 n 2 ( ) n n 1 ( ) 2 x + ıy x 1 e yφ(n,x,y) dy n + x + ıy n + x + ıy x + ıy n 2 ( n 2 (n + x) 2 + yn 2 n 2 y ) n 1 2 ( ) = O 2 n 2 n 2. (4.32) 1 y dy ) n dy = O (n n 1 2. (4.33) Thus, the whole integrl is O(n x ). For x < 0 ( x N), the derivtion is lmost the sme. We strt with slightly lrger rnge: n 0 ( n n + x + ıy n n + x ) n 1 ( 2 x + ıy n + x + ıy ( n n + x ) n n 0 ) x 1 x + ıy e yφ(n,x,y) dy ( x + ıy n + x + ıy 4e x x x (2n) x 1 n x e y π 5 dy = n x4(2e) x x x x 0 ) x 1 x + ıy e yφ(n,x,y) dy 5 π (e 1 5 nπ 1) = O ( n x), (4.34)

83 CHAPTER 4. APPROXIMATE TRIE SEARCH 73 ( n euse n n+x) < 2e x, n n+x < 2, nd φ(n, x, y) > π 5 for some n lrge enough (with φ(n, x, n) = π n 4 nd φ(n, x,0) = π). The seond prt is n 2 n ( n n + x + ıy ) n 1 2 n + x + ıy x + ıy ( ) n 1 n 2 (n + x) 2 + n 2 (n + x) 2 + n 2 x 2 + n 2 The finl prt is n 2 n 2 ) n 1 ( n n2 ) x 1 x + ıy e yφ(n,x,y) dy ) n x 2 n 2 = O ( n n + x + ıy ( 2 n + x + ıy x + ıy ( ) n 1 2 n 2y (n + x) 2 + y 2 This finishes the proof. 2 x 2 ) x (x 2 + n 2 ) 1 4 n 2 ( (3 2 ) n ) 2 n 2 ) x 1 x + ıy e yφ(n,x,y) dy y 2 y n dy n 2 ) x 1 y e yφ(n,x,y) dy ( ) = O n n 1 2. (4.35). (4.36) The ove proof relies on the ft tht the Bet funtion eomes very smll very quikly long the imginry xis. The next lemm detils this. Lemm 4.4 (Til of Bet funtion integrl). For x < 0 nd ny stritly positive funtion f(n) ω(1), we hve ( B(n, x + ıy) dy = O n f(n)( π ǫ) x) 4, (4.37) f(n)ln n for some rel onstnt ǫ > 0. Proof. We use the sme derivtions s in Lemm 4.3 together with the symptoti ehvior φ(n, x, n) = π n 4. Thus, we find tht n f(n) ln n ( n n + x + ıy ) n 1 e x ( f(n)ln n 2n ( 2 n + x + ıy x + ıy ) x (lnn) 1+ǫ 2 ) x 1 x + ıy e yφ(n,x,y) dy n f(n)ln n e y( π 4 ǫ) dy ( ) f(n)lnn x e x 1 (f(n)lnn) 1 2 ( ln n( π ǫ) 2n π 4 4 ǫ)e f(n) ( = O n f(n)( π ǫ) x) 4. (4.38)

84 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM Lemm 4.5 (Approximtion of Bet funtion integrls). Let f(n, z) e funtion suh tht f(n, z) = O(n r ) for onstnt r (i.e., f(n, z) is of onstnt growth with respet to z). Let z = x + ıy. For x < 0, we n pproximte for some onstnt nd fixed n x+ı x ı f(n, z)b(n, z)dz = N 1 k=0 x+ı x ı f(n, z) ( 1)k B (1 z) k (1) Γ(z + k)n k z dz k! + x+ı x ı f(n, z)n N z Γ(z)dz + o(1). (4.39) Proof. The proof relies on the stndrd expnsion of the rtio of two Gmm funtions ([TE51, AS65], see eqution (2.32) in Setion 2.5). We use the ft tht the expnsion is uniform [Fie70] nd tht the tils of the integrnds re exponentilly smll (Lemm 4.4). By eqution (2.32), we n pproximte B(n, x + ıy) g N (n, x + ıy) on the intervl 0 y n 1 3 y g N (n, x + ıy) = N 1 k=0 For y > n 1 3, we use Lemm 4.4 nd get n 1 3 ( 1) k B (1 x ıy) k (1) Γ(x + ıy + k)n k x ıy k! f(n, z)b(n, x + ıy) dy = O (n r ) n n N x ıy y 2N Γ(x + ıy). (4.40) B(n, x + ıy) dy ( = O e n1 3 ( π ǫ) ) 4 n x+r. (4.41) The terms of f(n, z)g N (n, x+ıy) re of the type n r k x ıy Γ(x+ıy+k)(x + ıy) l (for some l k). Rell tht l, k, x hve to e onsidered s onstnt nd tht x < 0. Integrting over term like this yields n 1 3 n r k x ıy Γ(x + ıy + k)(x + ıy) l dy 2πn r k x x l y l x + ıy + k x+k 1 y rg(x+k+ıy) x k+ 2 e n 1 3 x+k 12((x+k) 2 +y 2 )dy 2πe x k n r k x x l x + k x+k y x+k+l e y rg(x+k+ıy) dy = O (n r k x) n 1 3 n 1 3 y 2k e y π 3 dy = O ( n r k x) π 3 n1 3 y 2k e y dy ( = O n r k x+2 3 k e π 3 n1 3 ), (4.42)

85 CHAPTER 4. APPROXIMATE TRIE SEARCH 75 euse we hve rg(x + k + ıy) π 3sgn(y), nd π 3 n1 3 u 2k e u du 2( π 3 n1 3 ) 2ke π 3 n1 3 (4.43) y iterted integrtion (for n lrge enough). Hene, the error on the til y > n 1 3 is exponentilly smll in n. Thus, the expnsion is vlid throughout the whole rnge. Lemm 4.5 llows us to onvert line integrl over Bet funtion into one involving the Gmm funtion. The residues of the Gmm funtion re esier to evlute. Therefore, we need to go k to pth integrl. By Lemm 4.2, we re le to onvert the integrl in eqution (4.22), where C enirles the poles t {,...,} nd f(n, z) hs no singulrities in the hlf plne R(z) ǫ, into line integrl long the line R(z) = ǫ. By Lemm 4.5, we pproximte the integrl into sum of line integrls over the Gmm funtion: 1 f(n, z)b(n + 1, z)dz 2πı C k,l ǫ ı k,l f(n, z)n k+z Γ( z + k)z l dz. ǫ+ı (4.44) To evlute the integrls on the right hnd side, gin the residue theorem is used. The integrls on the right side re onverted into losed-pth integrls y showing tht the missing prts of the pth re smll. 1 Beuse we strted out with the poles t {,...,} nd ssumed tht there re no other singulrities in the hlf plne R(z) ǫ, the line is usully ompleted into ox to the left round different poles. By Stirling s formul, 2 for z = x + ıy with y nd ounded x, we hve 2π Γ(x + ıy) x + ıy e x x + ıy x e y rg(x+ıy) 2πe x y x 1 2 e π 2 y. (4.45) Thus, we n ound n integrl for pth prllel to the x-xis for some α β nd m > mx{ α, β } y f(n, z)n k+z Γ( z + k)z l dz n r k+β (β + m) l 2π x + ıy e x (β α) β + m β e m π 4 0, (4.46) m β±ım α±ım for suitle onstnt. To lose the ox, we need n integrl prllel to the y-xis. We n ound the 1 Derivtions similr to the following n e found in [Knu98], Setion Whih in one of its mny forms is Γ(z) = 2πz z 1 2 e z (1 + O( 1 z )).

86 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM () () () (d) Figure 4.3: Illustrtion of the trnsformtion of the integrls in the omplex plne for the proof of Lemm 4.6. The Rie s integrl () is trnsformed into huge hlf-irle (). The irle is extended to infinity, whih leds to line integrl (). The ltter is extended into ox round different poles (d). vlue of it for ny onstnt α y α+ı f(n, z)n k+z Γ( z + k)z l + dz nr k+α Γ(k α + ıy)(α + y) l dy α ı + n r k+α 2π k α + ıy e k+α k α + ıy k α e y rg(k α+ıy) α + ıy l dy = O ( n r k+α) (4.47) euse the integrl onverges to onstnt due to the exponentil derese in y. As result, hoosing α < r + k yields su-onstnt growth in n. Thus, we hve just proven the following pproximtion lemm s speilized nd extended version of Rie s integrls (Theorem 2.2). The generl ide is skethed in Figure 4.3. Lemm 4.6 (Extended Rie s integrls). Let f(n, z) e n nlyti ontinution of f(k, n) = f k,n tht ontins the hlf line [m, ). Assume further tht for some ǫ > 0 the funtion f(n, z) hs no singulrities nywhere in the hlf plne R(z) m ǫ nd tht f(z, n) = O(n r ) for some onstnt r (i.e., f(n, z) is of onstnt growth with respet to z). Then for some onstnt N, we hve n ( ( 1) k n k k=m ) f k,n = 1 2π N 1 k=0 f(n, z) ( 1)k B (1+z) k (1) Γ( z + k)n k z dz B k! + f(n, z)n N+z Γ( z)dz + o(1), (4.48) B where B is negtively oriented ox of infinite height, right order t m ǫ, nd left order < r + N nd is n pproprite onstnt. The generl method to use Rie s integrls for sums exhiiting the exponentil nelltion effet is not new. Almost the sme pproh is tken in [Szp88] (Theorem 2.2). No proof is given for the vlidity of the pproh to pproximte the Bet

87 CHAPTER 4. APPROXIMATE TRIE SEARCH 77 funtion with Gmm funtion in [Szp88]. Therefore, we derived the neessry ounds in this setion. The two ruil points re the uniform expnsion [Fie70] nd the negligile tils. A very erly use of the method n e found in the third volume of the fmous ooks y Knuth [Knu97, Knu97, Knu98]. The differene to our pproh is tht the Gmm funtion omes into ply vi speil identity for the exponentil funtion The Averge Comptifition Numer In this setion, we nlyze the prt leled A n of eqution (4.21). It turns out tht this hs n interesting interprettion of its own. We ll A n the verge omptifition numer. With omptifition we wnt to desrie the reltion etween the trie for the set S nd the set itself. When the strings in S re inserted in the trie, ommon prefixes re merged. An edge strting t the root leled represents ll strings in S tht strt with the hrter. If there re n suh strings, the edge hides n 1 of these. Assume tht we ount ll hrters tht we enounter y trversing the omplete trie for S. Then these would e A n less hrters thn S (the numer of hrters in S). Hene, n(d + 1)/q A n is n upper ound for the verge performne of the TS lgorithm. We give rief derivtion of the symptotis for A n. This is somewht redundnt euse the pproh is very muh the sme nd the derivtion is muh esier. We give the detils nevertheless; they lso serve s primer to the tehnique. We derive reursive eqution for A n in nlogy to Setion Agin it is simpler to tth the ounting to the nodes of the trie. At the node v with n leves in the sutree, we hve n 1 hidden hrters in the edge leding to v. When ounting this wy, we overshoot the tul result y n 1. Hene, let n = A n + n 1. We then hve to sum over ll possiilities to prtition n (see eqution (4.12)). For n > 1, this yields the formul ( ) n σ n = n 1 + σ n ij, (4.49) i 1,...,i σ i 1 + +i σ=n with 0 = 0 nd 1 = 0 (euse neither the empty trie nor the trie ontining only one string hides ny hrters). The exponentil generting funtion A(z) = n 0 n zn n! is then reursively defined y σ ( z A(z) = ze z e z + A e σ) (1 σ)z (4.50) i=1 Note tht we gin hd to ommodte for the se n = 0 y dding one. Multiplying oth sides y e z we get (with Ã(z) = A(z)e z ) Ã(z) = z 1 + σ i=1 j=1 ( z Ã + e σ) z. (4.51) From the inomil trnsform (see eqution (4.17)) we ompute ã 0 = 0 nd ã 1 = 0. For n > 1, we ompre oeffiients in eqution (4.51) nd find tht ã n = ( 1)n. (4.52) 1 σ1 n

88 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM Trnslting this k, we get n = n k=2 ( ) n ( 1) k. (4.53) k 1 σ1 k This grees with eqution (4.21) euse A n = n (n 1). The lrge terms in the sum n e s ig s ( ) n ( 1) k n/2 2n 1 σ 1 k n. By the sign hnges, these exponentil vlues nel out. This is often lled n exponentil nelltion proess. For these sums, Rie s integrls re the method of hoie, nd we use the results from the 1 previous setion. Beuse = O(1) for z, we n pply Lemm 4.6 to 1 σ 1 z find tht n = 1 2π N 1 k=0 B 1 ( 1) k B (1+z) k (1) 1 σ 1 z Γ( z + k)n k+z dz k! σ 1 z n N+z Γ( z)dz + o(1), (4.54) B where B is ox-pth with the right order R(z) = 3 2. Eh term of the expnsion is smller y ftor of n nd possily more due to dditionl terms z in the generlized Bernoulli polynomils B (1 z) k (1). To evlute eqution (4.54), we ompute the residues left of the line R(z) = 3 2. Beuse B is negtively oriented, we hve to tke the negtive of the sum of the residues. We first onsider the min term 1 2πı The residues of 1 1 σ 1 z Γ( z)n z re nd k Z\{0} B 1 1 σ 1 z Γ( z)nz dz. (4.55) [ ] 1 res 1 σ 1 z Γ( z)nz, z = 0 [ ] 1 res 1 σ 1 z Γ( z)nz, z = 1 [ 1 res 1 σ 1 z Γ( z)nz, z = 1 + 2πık ] lnσ = n lnσ = 1 σ 1, (4.56) = n 2 γ 1 lnσ n n log σ n, (4.57) k Z\{0} ( n 2πık ln σ Γ 1 + 2πık ) lnσ. (4.58) The sum in the lst eqution is smll osillting term tht is ounded y smll onstnt. These sums re frequently enountered throughout this type of prolems. 1 The seond term in the pproximtion (4.54) is Γ( z + 1)nz 1 1 z 1 σ 1 z 2. Here, we hve the residues [ ] 1 z res Γ( z + 1)nz 11 1 σ1 z 2, z = 1 = 1 2 lnσ, (4.59)

89 CHAPTER 4. APPROXIMATE TRIE SEARCH 79 nd k Z\{0} [ 1 z res Γ( z + 1)nz 11 1 σ1 z 2, z = 1 + 2πık ] = lnσ ( kπı 2n 2πık ln σ Γ 1 + 2πık ) (lnσ) lnσ k Z\{0}, (4.60) whih re O(1) together. Hene, we find (rell tht we hve to tke the negtive of the sum of ll residues) tht n = n log σ n + n γ k Z\{0} lnσ + n 2πık ln σ Γ ( 1 + 2πık ) ln σ + O (1), lnσ whih proves the following lemm. Lemm 4.7 (Asymptoti ehvior of the omptifition numer). The symptoti ehvior of A n is A n = n log σ n + n γ lnσ + k Z\{0} n 2πık (4.61) ln σ Γ ( 1 + 2πık ) ln σ + O (1). (4.62) lnσ Allowing Constnt Numer of Errors We now turn to the evlution of the sum n ( ) n ( 1) Sn d k ( ) q d+1 = k σ k 1 1 σ k 1, (4.63) p k=2 t first for onstnt d. For this se, we pply Lemm 4.6. Hene, S d n = 1 2π N 1 k=0 + B B ( ) 1 q d+1 ( 1) k B (1+z) k (1) σ z 1 1 σ z 1 Γ( z + k)n k+z dz p k! 1 ( q ) d+1 n N+z Γ( z)dz + o(1), σ z 1 1 σ z 1 p (4.64) where B is gin ox-pth with the right order R(z) = 3 2. We onentrte on the min term of the pproximtion euse the other terms re smller y ftor of n or more (see lso the previous setion). By Lemm 4.6, B is negtively oriented. We simply hnge the orienttion y sustituting z for z. The integrl onsidered for the reminder of this setion is thus In d = 1 1 2πı B σ z 1 1 ( ) q d+1 σ z 1 Γ(z)n z dz, (4.65) p

90 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM where B is now positively oriented ox round ll poles right of the line R(z) = 3 2. We evlute eqution (4.65) y its residues. Their lotions in the omplex plne re depited in Figure 4.4. The poles re loted t z = 1, z = 0, z = 1 ± 2πık ln σ, nd z = log σ p 1 ± 2πık ln σ, k Z. The rel prt of the ltter rnges from 1 + log σ (σ 2 1) > 1 to one under the ssumption tht σ 2 p 1 σ 2. The ltter is fulfilled if there n e t lest one mismth nd t lest one mth (otherwise, the whole nlysis eomes trivil nywy). Lemm 4.8 (Residues t z = 1 ± 2πık ( 1 q Let g(z) := σ z 1 1 σ z 1 p re nd k Z\{0} res[g(z), z = 1] = n ln σ ). [ res g(z), z = 1 + 2πık ] = 1 lnσ lnσ ) d+1γ(z)n z. The residues of g(z) t z = 1 ± 2πık ln σ ( 1 γ lnσ log σ n + 1 ) (d + 1) + 2 q k Z\{0} (4.66) ( Γ 1 + 2πkı ) n 1 2πkı ln σ. lnσ (4.67) Proof. The residue n e derived esiest from the series deompositions. These re t z = 1 + 2πık ln σ ( 1 σ 1 z 1 q σ z 1 p = l= 1 ( lnσ) l ( B l+1 z + 1 2πık ) l (4.68) (l + 1)! lnσ 1 = lnσ ( z + 1 2πık ) 1 2 ln σ (( + O z + 1 2πık )) lnσ ) d+1 ( (d + 1)lnσ = 1 + z + 1 2πık ) q lnσ ( ( + O z + 1 2πık ) ) 2 lnσ n z = n l=0 n 2πık ln σ ( lnn) l l! (4.69) (4.70) (4.71) (4.72) ( z + 1 2πık ) l. (4.73) lnσ For the Gmm funtion, we hve pole t 1 ut not t the other points 1 + 2πık ln σ. For k 0, we hve Γ(z) = nd, for k = 0, we hve l= 1 γ ( 1) l (z + 1) l = 1 + (γ 1) + O ((z + 1)) (4.74) z + 1 γ ( 1+2πık ln σ ) l l=0 ( z + 1 2πık ) l, (4.75) lnσ

91 CHAPTER 4. APPROXIMATE TRIE SEARCH 81 Poles of ( 1 p σ z 1 p) Poles of 1 σ z 1 1 Poles of Γ(z) d+1 Rel vlues depend on p ( 1 q d+1γ(z)n Figure 4.4: Lotion of poles of σ z 1 1 σ p) z in the omplex plne. z 1 where γ x 0 l is the l-th oeffiient of the series expnsion of the Gmm funtion t the point x 0. The B l re the Bernoulli numers (see eqution (2.34)). Note tht γ ( 1) 1 = γ nd γ ( 1+2πık ln σ ) 0 = Γ ( 1 + 2πkı ) ln σ. The next term in the pproximtion series in eqution (4.64) inludes the dditionl ftor n 1 B (1 z) 1 (1) = n 1 ( 1 z 2 + 1). The ftor z nels ny single pole, leving only the simple pole t z = 1. The residues re derived in similr mnner from the series representtions, ut the order my e deresed y neling ftor z, z + 1, or similr. Furthermore, the results from the series deomposition re divided y the ftor n, thus the residues re negligile. Oserve tht res [g(z), z = 1] + [ res g(z), z = 1 + 2πık ] = d + 1 n lnσ q k Z\{0} } {{ } E[L d n] k Z\{0} n 2πık ln σ Γ ( 1 + 2πık ln σ lnσ ) n log σ n 1 2 n 1 γ lnσ n n } {{ } A n+o(1), (4.76) thus the residues t R(z) = 1 nel the first terms in the totl omplexity of the TS lgorithm (see eqution (4.21) nd Lemm 4.7). When llowing onstnt numer of errors, the omplexity depends upon the residues t log σ p 1 ± 2πık ln σ nd t zero. If log σ p 1 0 then there is single residue due to the Gmm funtion t zero. Lemm 4.9 (Residue t z = 0). ( 1 q d+1γ(z)n Let g(z) := σ z 1 1 σ p) z. If log z 1 σ p 1 0 the residue of g(z) t z = 0 is res[g(z), z = 0] = σ ( ) qσ d+1. (4.77) σ 1 1 pσ

92 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM Proof. The residue t this single pole is gin evluted y the series deomposition. For the Gmm funtion, the oeffiient of the z 1 -term is one. The other terms re just the funtion vlues t z = 0. The residue is the produt of these nd results in eqution (4.77). The remining residues t log σ p 1 ± 2πık ln σ log σ p 1 = 0) re omputed in the next step. (inluding the se z = 0 for Lemm 4.10 (Residues t z = 0, z = log σ p 1 + 2πık ( ln σ ). 1 q d+1γ(z)n Let g(z) := σ 1 z 1 σ p) z. If we hve p = σ 1, g(z) hs the z 1 residue res [g(z), z = 0] = i=0 d+1 l=0 l i j=0 (σ 1) d+1 ( 1) l B ( l+d+1) d+1 ( l + d + 1)! ( A j σ 1 ) ( ) σ j+1 ( lnn ) i j j!(i j)! 1 σ lnσ t z = 0. For k Z nd k 0 or p σ 1, g(z) hs the residues [ res g(z), z = log σ p 1 + 2πık ] lnσ l i i=0 j=0 = l=0 A j (p)( lnσ) j (1 p) j+1 n log σ p+1 2πık ln σ j! γ(0) l i 1 (lnσ) l i (4.78) d ( ) 1 p d+1 ( lnσ) l B (d+1) l+d p ( l + d)! ( lnn) i j (i j)! γ ( log σ p 1+2πık ln σ ) l i. (4.79) The B () k re generlized Bernoulli numers, the A l (x) re Eulerin polynomils, nd the γ (x 0) l re the oeffiients of the series for Γ(z) t z = x 0. Proof. We ompute the residues t z = 0 nd z = log σ p 1 + 2πık ln σ. If p σ 1, then the residue t z = 0 is given y eqution (4.77). Otherwise we hve higher order pole t z = 0. Therefore, we need to use different series expnsion for the q term ( σ z 1 p )d+1. We find the following series representtions. The pole is of order d + 1 or d + 2 (depending on whether p = σ 1 ) so we only give the strt sums nd do not other with evluting the first terms (s in the proof of Lemm 4.8). Let k = log σ p + 1 2πık ln σ. 1 σ 1 z 1 = ( ) q d+1 σ z 1 = p n z = A l (p)( lnσ) l (1 p) l+1 (z + k ) l (4.80) l! ( ) 1 p d+1 ( lnσ) l B (d+1) l+d+1 (z + k ) l (4.81) p (l + d + 1)! l=0 l= d 1 l=0 n 1+log σ p n 2πık ln σ ( lnn) l (z + k ) l. (4.82) l!

93 CHAPTER 4. APPROXIMATE TRIE SEARCH 83 For the Gmm funtion, we hve pole t zero ut not t the other points z = k. For p = σ 1 nd k = 0, we hve Γ(z) = γ (0) l z l, (4.83) nd, for k 0 or p σ 1, we hve Γ(z) = l=0 l= 1 γ ( k) l (z + k ) l. (4.84) The γ ( k) l re just the result of simple Tylor series. The relevnt poles re of order d + 1 (or d + 2 for p = σ 1 ). The residue is the sum of ll omintions of oeffiients suh tht the power of (z + k ) is 1. The series led diretly to the residues. Remrk One n show tht γ (0) l ( 1) l < ( ǫ) l s follows: By sutrting the poles t zero nd 1 from Γ(z), we get funtion Γ(z) 1 z + 1 z+1 whih is nlytil for z < 2. Beuse 1 z + 1 z+1 = 1 z l=0 ( 1)l z l, we know tht the series l=0 (γ(0) l ( 1) l )z l is onvergent with rdius of onvergene of two, whih proves our lim. We n now determine the ext omplexity for ny onstnt vlue of d y evluting the sum (possile with the help of omputer lger system). The numer of terms in equtions (4.78) nd (4.79) is polynomil in d, so the sums re more effiient thn eqution (4.20). For most purposes, the term with the lrgest growth in n should suffie. If p = σ 1, this term is σ(σ 1)d (d + 1)! (log σ n) d+1. (4.85) Otherwise this term is (1 p)d d!p d+1 (log σ n) d n ( 1+log σ p n 2πık ln σ Γ log σ p 1 + 2πık ) lnσ k Z. (4.86) The infinite sum is somewht disstisftory, ut it is ounded y onstnt (due to the ehvior of the Gmm funtion long n imginry line), nd these sums do our in lot of prolems of this kind. Note tht the negtive sign nels with the sign in eqution (4.65). The result is summrized in the following theorem. Theorem 4.12 (Serhing with onstnt ound). Let d = O(1), then ( [ ] O (log n) d+1) for p = σ 1, ) E Tn d = O ((log σ n) d n 1+log σ p for p > σ 1, O (1) otherwise. (4.87) Note tht 1 + log σ p < 1 for p < 1. Thus, for onstnt numer of mismthes nd lmost ny model, the TS lgorithm is more effiient thn the LS lgorithm.

94 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM Allowing Logrithmi Numer of Errors Wheres for smll d it is still fesile to evlute the sum in Lemm 4.10, it is not stisftory for d = f(n) s funtion of n. In this setion, we lso nswer the question for whih hoie of d the LS lgorithm is symptotilly s effiient s the TS lgorithm (on verge). The sums in Lemm 4.10 re rther hrd to evlute for growing d. We look t the integrl (4.65) s line integrl. Insted of omputing the residues, we ound the integrl, possily first sweeping over some residues. This tehnique hs strong resemlne to the pplition of the Mellin trnsform lthough we used Rie s integrls (see lso the remrks in [Szp88] nd [FS95]). The integrl I d n defined in eqution (4.65) gives the highest order term of the sum S d n of eqution (4.63). By equtions (4.46) nd (4.47) we find tht In d = Iξ,n d = 1 ξ+ı 2πı ξ ı ( 1 σ z 1 1 q σ z 1 p ) d+1 Γ(z)n z dz, (4.88) where ξ = 3 2. In ft, we n move the line of integrtion freely etween 2 nd 1 euse the funtion under the integrl is nlyti in this strip. When tking the residues t R(z) = 1 nd possily t R(z) = 0 into ount we n move the line of integrtion even further. Thus, we hve (tking into ount eqution (4.76)) Iξ,n d, for 2 < ξ < 1 Iξ,n d + E [ L d n] An + O (1), for 1 < ξ < 1 log σ p 0 I d n = I d ξ,n + E [ L d n] An + σ σ 1 ( qσ 1 pσ) d+1 + O (1), for 0 < ξ < 1 logσ p. (4.89) In the following, we show tht Iξ,n d is suliner for ertin hoies of ξ depending on q, p, nd d. To this end, we ssume tht d+1 = log σ n nd ound the integrl Iξ,n d s n exponent of n. We find tht I d ξ,n 1 = 2πı 1 2π ξ+ı ξ ı ( ) 1 q logσ n σ z 1 1 σ z 1 Γ(z)n z dz p ( ) q logσ ξ+ı nn ξ σ ξ 1 Γ(z) dz p 1 σ ξ 1 1 ξ ı ( = O n ξ+log σ q σ ξ 1 p ), (4.90) 1 euse p nd one re rel vlues, the mxim of σ z 1 1 nd re tken σ z 1 p for rg z = 0, i.e., z = ξ. This n esily e seen s follows. The mximum is tken for the minimum of the denomintor. The denomintor desries irle with enter 1 or p nd rdius σ ξ 1 in the omplex plne. The losest distne to zero is then ttined t the intersetion with the rel line left of the enter. Let ( ) 1 p E,p,ξ = log σ σ ξ 1 ξ (4.91) p e the exponent. We n ound it s follows. q

95 CHAPTER 4. APPROXIMATE TRIE SEARCH 85 Lemm For 0 < p < 1, 0, nd 1 p, there exists ξ > 2 suh tht E,p,ξ < 1. If < 1, then E,p,ξ hs minimum t ξ = log σ (1 ) log σ p 1. If 1 or ξ 2, then some vlue ξ ( 2, 1) stisfies E,p,ξ < 1. Proof. We exmine the growth of E,p,ξ with respet to ξ. The derivtives re d dξ E,p,ξ = pσ ξ+1 nd d 2 dξ 2 E,p,ξ = pσξ+1 lnσ (1 pσ ξ+1 ) 2, (4.92) whih shows tht, for > 0, the exponent hs t most one lol extreme vlue for rel ξ t ξ = log σ (1 ) log σ p 1, whih is minimum with vlue ( ) ( ) q(1 ) p log σ + log p σ + 1. (4.93) 1 Assume < 1 nd let = 1 σ x, then the exponent hs minimum t ξ = x 1 log σ p, where it tkes the vlue ( 1 σ x ) ( ( ) ) q log σ log p σ (σ x 1) + x log σ p. (4.94) With respet to x, the exponent hs derivtive ( ( ) ) q > 0, for x < log σ p σ x lnσ log σ log p σ (σ x 1) = 0, for x = log σ p < 0, for x > log σ p. (4.95) Hene, there is single extreme vlue, mximum t x = log σ p, where the exponent tkes the vlue one. For ll other vlues of x, the exponent is smller thn one. Thus, for ξ > 2, the exponent is smller thn one if x log σ p, whih is equivlent to 1 p. This proves our lim for ξ > 2 nd < 1. For ξ 2, we find tht > 1 p σ. The minimum vlue of the exponent is tken for some ǫ > 0 t ξ = 2 + ǫ. Beuse, for smll ǫ, we hve Thus, we find tht E,p, 2+ǫ < log σ ( 1 p σ 1 ǫ p ) < log σ 1 = 0. (4.96) ( 1 p ) ( ) 1 p log σ σ σ 1 ǫ + 2 ǫ. (4.97) p The derivtive of the right hnd side of eqution (4.97) with respet to p is ( ) 1 1 p ( σ log σ σ 1 ǫ + 1 p p σ) ( σ 1 ǫ ) 1 (σ 1 ǫ. (4.98) p)(1 p)lnσ The seond derivtive is ( 2 σ 1 ǫ ) 1 ( σ (σ 1 ǫ 1 p ) ( (σ 1 ǫ 1) ( (1 p) + (σ 1 ǫ p) ) ) p)(1 p)lnσ σ (σ 1 ǫ p) 2 (1 p) 2, lnσ (4.99)

96 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM whih is smller thn zero for p < σ2 ǫ + σ 2σ 1 ǫ 1 + 2σ σ 1 ǫ ǫ 0 σ. (4.100) Therefore, the first derivtive is deresing for p < σ ǫ, nd it is mximl t p = 0. Here we hve vlue of (1 ǫ)lnσ σ + σ ǫ σ lnσ ǫ 0 lnσ σ + 1 σ lnσ, (4.101) whih is smller thn zero for σ 2. Therefore, the right hnd side of eqution (4.97) is deresing in p, nd the mximl vlue is ttined t p = 0, where we hve vlue of one. Beuse p < 1, we hve n exponent of E,p, 2+ǫ < 1 for some smll ǫ > 0. Finlly, for 1 there is no minimum vlue ξ. The derivtive of the exponent E,p,ξ with respet to ξ is σ ξ 1 σ ξ 1 p 1 1 1, (4.102) 1 pσξ+1 whih is stritly positive for ξ > 1 log σ p. Thus, the minimum is ttined t the smllest possile ( vlue ) for ξ in the intervl ( 2, ), sy t ξ = 2 + ǫ. Agin the term log 1 p σ is negtive so tht σ 1 ǫ p ( ) ( ) 1 p 1 p E,p,1 ǫ = log σ σ 1 ǫ + 2 ǫ log p σ σ 1 ǫ + 2 ǫ. (4.103) p With respet to p, the right hnd side hs the derivtive σ 1 ǫ 1 (σ 1 ǫ p)(1 p)lnσ < 0, (4.104) so the mximl vlue is gin otined t p = 0, where we lredy know tht the exponent hs vlue of one. Hene, we hve E,p, 2+ǫ < 1 for p > 0. This finishes the proof. The ounds immeditely yield the following theorem. Theorem 4.14 (Serhing with logrithmi ound). If d + 1 = log σ n, then we hve { [ ] E Tn d o(n), for < q = Θ(n log n), for > q. (4.105) Proof. By Lemm 4.13 nd eqution (4.90), we n ound Iξ,n d. The minimum of the exponent if it exists is t ξ = log σ (1 ) log σ p 1 < log σ p 1. We mke se distintion on ξ. Cse 1. The minimum ξ exists nd ξ > 1. Here, we move the line of integrtion to ξ = ξ. By Lemm 4.13, we hve Iξ,n d = o(n). By eqution (4.89), we find tht [ ] E Tn d = o(n), (4.106)

97 CHAPTER 4. APPROXIMATE TRIE SEARCH 87 if we n ound the term ( ) σ qσ d+1 (4.107) σ 1 1 pσ for ξ > 0 y o(n). Note tht ξ > 0 implies < 1 pσ nd lso p < σ 1 euse > 0. We ound eqution (4.107) y ( ) σ qσ d+1 = σ 1 p σ 1 1 pσ σ 1 nlog σ σ 1 p < σ 1 p σ 1 n(1 pσ)log σ σ 1 p. ( ) (4.108) The exponent (1 pσ)log 1 p σ n e ounded y one from ove: Its σ 1 p derivtive with respet to p is ( ) 1 p σ log σ σ 1 + σ 1 σ 1 p (1 p)lnσ, (4.109) whih is stritly less thn zero euse 1 p σ 1 p > σ nd 1 σ 1 1 p < 1 (note gin tht p < σ 1 ). At p = 0, the exponent hs vlue one. Hene, it must e smller for p > 0. Cse 2. The minimum ξ exists nd 2 < ξ < 1. We move the line of integrtion to ξ nd find tht [ ] [ ] E Tn d = E L d n A n + o(n) = Θ(n log n) (4.110) euse ξ < 1 implies > q, nd so [ ] ( ) E L d logσ n n A n = n log q σ n + O (1) for ǫ = q 1. = ǫn log σ n + O (n) = Θ(n log n) (4.111) Cse 3. No minimum ξ exists or the minimum exists with ξ < 2. In this se, we hve either > 1 or ξ < 2, nd we move the line of integrtion to vlue 2 < ξ < 1 where, y Lemm 4.13, we find tht Iξ,n d = o(n). Beuse ξ < 2 implies > 1 pσ 1 > q nd the non-existene of minimum implies > 1 > q, eqution (4.110) nd s onsequene eqution (4.111) lso hold. As result, we hve liner-logrithmi ehvior in Cses 2 nd 3 nd suliner ehvior in Cse 1. The ltter ours if nd only if ξ > 1, whih is equivlent to > q. Figure 4.5 illustrtes the proof ide Remrks on the Complexity of the TS Algorithm To guide the seletion etween the two lgorithms nd to improve the understnding of the TS lgorithm, we detil the exponent ǫ in the suliner growth term O(n ǫ ) = o(n) nd the influene of the error model.

98 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM Poles of Poles of 1 1 σ z 1 Poles of Γ(z) 1 p ( σ z 1 p ) d+1 Minimum depends on p nd * ξ Rel vlues depend on p Figure 4.5: Illustrtion for Theorem Expliit Computtion of the Exponent It is possile to determine onrete vlues for the exponent of n when E[Tn] d hs suliner growth. This is the se for 0 < < 1 p. Either y Lemm 4.13 or y Lemm 4.9, the omplexity is ounded y O(n E,p,min(ξ,0)) where { ( logσ (1 ) 1 σp 1 (1 p) ) for 1 pσ < < 1 p, E,p,min(ξ,0) = ( log σ (1 p) σ (1 pσ) ) for 0 1 pσ. (4.112) The first line is from Lemm 4.13, the seond, for the se ξ > 0, follows from Lemm 4.9. The seond term is lrger thn the first euse ( log σ (1 p) σ (1 pσ) ) ( log σ (1 ) 1 σp 1 (1 p) ) ( ) ( ) pσ 1 = log σ + log (1 pσ)(1 ) σ σp hs vlue zero t = 1 pσ nd derivtive (4.113) ( ) pσ > 0 for > 1 pσ, log σ = 0 for = 1 pσ, (1 pσ)(1 ) < 0 for < 1 pσ. (4.114) Thus, there is lol minimum t = 1 pσ. On the other hnd, the seond term is hidden in the first for > 1 pσ. The exponent is growing in p (whih should not e surprising). The first term hs derivtive 1 p 1 p with respet to p, whih is positive for < 1 p. The seond, whih is lwys positive. term s derivtive with respet to p is 1 p + σ 1 σp For p = σ 1 (Hmming distne), the exponent is H() ln σ + log σ (σ 1), where H(x) = xln(x) (1 x)ln(1 x) is the entropy funtion. The exponent is zero t

99 CHAPTER 4. APPROXIMATE TRIE SEARCH 89 = 0 nd one t = 1 p. Thus, for σ = 2, we hve sled entropy funtion. With growing σ, the liner prt domintes more nd more while the entropy prt diminishes, i.e., the ehvior of the exponent onverges to liner funtion. As result, we n estimte tht the running time n e ounded y funtion etween O(n H()/ln 2 ) nd O(n ) Reltion etween the Error Proility nd the Numer of Errors For the LS lgorithm, we n doule the numer of llowed errors nd the error proility nd get the sme symptoti running time. For the TS lgorithm, in the prmeter rnge with suliner ehvior, this is different. Here, higher vrine in the length tht string is mthed y the pttern inreses the expeted running time. This ehvior is explined y the differene in svings or expenses of n erly or lte end in mthing pttern ginst string. Beuse of the trie struture, hrters t the eginning of the strings re more ompted (see Setion 4.3.3) wheres hrters t the end re less ompted. Thus, not ompring some hrters erly in the string in exhnge for ompring some hrters more towrds the end of nother string is disdvntgeous. The svings in the erlier hrters orrespond to fewer nodes thn the extr ost in the hrters towrds the end. In our formul for the exponent given y eqution (4.112), this n e seen when repling y t nd q y qt: ( log σ(1 tq) 1 t (tq) t σ E t,1 tq,min(ξ,0) = ( ) tqσ tlog σ (1 t) 1 t (t) t ) 1 (1 tq)σ, for 1 (1 tq)σ < tq, for 0 < 1 (1 tq)σ. (4.115) The derivtive (with respet to t) of the ove is lwys negtive when t,, nd q result in vlid prmeters. For the seond line, we hve the derivtive ( ) tqσ log σ 1 σ + tqσ } {{ } 0 + t lnσ }{{} 0 1 σ + tqσ tqσ } {{ } 1 For the first line, we find tht the derivtive is ) + whih is negtive if log σ (q tq) The funtion log σ ( q tq tq <0 0 {}}{ { }} { qσ (1 σ) (1 σ + tqσ) 2 } {{ } 0 < 0. (4.116) q (1 tq)lnσ, (4.117) q (1 tq)lnσ < log σ ( tq) (1 tq)lnσ. (4.118) x log σ (x tq) x (1 tq)lnσ (4.119) dereses for x q euse the derivtive with respet to x is (x tq)lnσ ( tq)lnσ, (4.120) whih is negtive or zero euse x nd 1 tq > 0.

100 AVERAGE-CASE ANALYSIS OF THE TS ALGORITHM Interprettion In summry, the verge-se omplexity of the TS lgorithm is ( O (log n) d+1), for d = O (1) nd p = σ 1 ) [ ] O ((log n) d n 1+log σ p, for d = O (1) nd p > σ 1 E Tn d = O (1), for d = O (1) nd p < σ 1 o(n), for d + 1 < q log σ n Θ(dn) = Ω(n log σ n), for d + 1 > q log σ n. (4.121) It is well known tht the verge height of trie is symptotilly equl to log σ n (see, e.g., [Pit86, Szp88]). When no more rnhing tkes ple, the TS lgorithm nd the LS lgorithm ehve the sme; oth lgorithms perform onstnt numer of omprisons on verge. If we llow enough errors to go eyond the height of the trie, they should perform similr. With n error proility of q we expet to mke qm errors on m hrters. It is therefore not ig surprise tht there is threshold t d + 1 = q log σ n. With respet to the mthing proility p, we hve different ehvior for the three ses p < σ 1, p = σ 1, nd p > σ 1. To explin this phenomenon, we look t the onditionl proility of mth for n lredy hosen hrter. If we hve p < σ 1, then the onditionl proility must e smller thn one, i.e., with some proility independent of the pttern we hve mismth nd thus restrit the serh independently of the pttern. If we hve p > σ 1, the onditionl proility must e greter thn one. Hene, with some proility independent of the pttern hrter we hve mth nd therey extend our serh. This restrition or extension is independent of the numer of errors llowed; hene, there is the dditionl ftor of n ǫ in the omplexity. It is interesting to look t some onrete nd importnt exmples of omprisonsed string distnes. We lredy mentioned the Hmming distne nd its extension with don t-re symols in Setion Another model used, e.g., in [BT03, BTG03] is kind of rithmeti distne. Here, we ssume tht Σ = [0,...,σ 1] is ordered nd tht k < σ 1 2 is onstnt. For i, j Σ let (i, j) = mx(i, j) min(i, j). Then let the rithmeti distne e defined y d (i, j) = { 1 if min((i, j), σ (i, j)) > k, 0 otherwise. (4.122) For exmple, let Σ e the disretiztion of ll ngles etween zero nd 2π, then d(i, j) mesures whether two ngles re not too fr prt. The mth nd mismth prmeters re given in Tle 4.6. Espeilly for onstnt size d, we find tht the introdution of don t res mkes huge differene in the omplexity, i.e., the omplexity jumps from O(log d+1 n) to O(n log σ (3 2/σ) log d n). For logrithmilly growing d, we lredy omputed the exponent for the Hmming distne to e H()/ lnσ+log σ (σ 1), where H(x) = xln(x) (1 x)ln(1 x) is the entropy funtion. The introdution of don t-re symols hnges the exponent to H()/ lnσ + log σ (σ 2 3σ + 2) log σ (3σ 2) + log σ (3 2 σ ), i.e., slightly smller liner term in ut with the dditionl term log σ (3 2 σ ). Note lso tht the rnge for tht we hve suliner growth is lrger for the Hmming distne.

101 CHAPTER 4. APPROXIMATE TRIE SEARCH 91 Distne p q Hmming distne 1 σ Hmming distne with don t re symols 3σ 2 σ 2 Arithmeti distne 2i+1 σ σ 1 σ σ 2 3σ 2 σ 2 σ 2i 1 σ Figure 4.6: Mth nd mismth proilities for seleted omprison-sed string distnes. 4.4 Applitions Our reserh ws lso motivted y projet for the ontrol of niml reeding vi single nuleotide polymorphisms (SNPs) [WDH + 04]. An SNP is lotion on the DNA sequene of speies tht vries mong the popultion ut is stle through inheritne. In [WDH + 04], sequene of SNPs is enoded s string to identify individuls. In prtiulr, the SNPs hve one of the types heterozygous, homozygous 1, homozygous 2, nd ssy filure, whih is enoded in n lphet of size four. Beuse of errors, the serh in the dtset of ll individuls needs to e le to del with mismthes nd don t res (orresponding to ssy filure ). The nture of the dt llows very effiient inry enoding with two its per SNP. This yields resonle fst lgorithm tht just ompres pttern to eh string in the set (the LS lgorithm desried nd nlyzed ove). On the other hnd, trie n e used s n index for the dt. Although trie hs the sme worst-se look-up time, we n expet to sve some work euse fewer omprisons re neessry. As drwk, the onstnts in the lgorithm re it higher due to the tree struture involved. To hoose the optiml lgorithm, more detiled nlysis ws needed, whih we hve presented in this hpter. In the onrete se of the ove pplition, the serh is ontrolled y ounding the numer of don t res nd the numer of mismthes. We find tht the verge-se ehvior, ounding only the don t res, is pproximtely O((log n) d n ) when llowing d don t res. Bounding only the numer of mismthes would result in Hmming distne, ut in this pplition don t re hrter nnot indue mismth. Therefore, the verge-se omplexity is pproximtely O((log n) d n ) when llowing d mismthes. This is signifintly worse thn Hmming distne lone, whih yields serh time of O((log n) d+1 ). It lso domintes the ound on the numer of don t res. When deiding whether the LS or the TS lgorithm should e used in this prolem, we find tht for d > (3/8)log 4 n 1 the LS lgorithm will outperform the TS lgorithm. The results of this hpter n lso e used to estimte the running time of the serh methods desried y Buhner et l. [BT03, BTG03]. They use generlized suffix tree uilt on the strings representing the disretized ngles of the threedimensionl struture of proteins. Serhes re performed llowing some devitions. This is ptured using the rithmeti distne defined in the previous setion, distne funtion tht we n hndle with our pproh. In prtiulr, the full rnge of 360 de-

102 APPLICATIONS grees is disretized into n lphet Σ = {[0, 15),...,[345, 360)} of size twenty-four. The lgorithm then serhes protein sustruture y onsidering ll ngles within i intervls to the left nd right, i.e., for i = 2 intervls to oth sides, the intervl [0, 15) mthes the intervls [330, 345), [345, 360), [0, 15), [15, 30), nd [30, 45). The proility of mth is thus 2i+1 Σ. In their pplition, Buhner et l. [BT03, BTG03] llow no mismth, i.e., the serh is stopped if the ngle is not within the speified rnge. The symptoti running time n thus e estimted to e O(n log Σ (2i+1) ). Although suffix tree is used, nd we do not expet the underlying distriution of ngles to e uniform nd memoryless, this result n e used s (rough) estimte, espeilly to understnd the effet of different hoies of i. The results in [AS92, JS94, JMS04], whih point out reltion etween the two trees t lest for memoryless soure, support this lim.

103 Chpter 5 Text Indexing with Errors In the lst hpter, we sw how using trie for the pproximte text indexing prolems P(Σ ) d f(n) do ll nd P(Σ ) d f(n) do pref sped up serhes over the nive method of ompring eh string with the serh pttern. The serh time, however, depended on the size of the index, respetively the text orpus. In Chpter 2, we disussed the solution to some ext text indexing prolems using (generlized) suffix trees. For this se, we were le to hieve query time tht ws independent of the size of the text orpus. In ft, the lgorithms were liner in the numer of reported hits nd the size of the serh pttern, i.e., output sensitive. This is the est we n hope for in ny text indexing prolem. In this hpter, we present dt struture nd orresponding lgorithms for text indexing tht re lso output sensitive. This ehvior omes t the ost of lrger index size. In the worst-se, the index my eome very lrge, ut we re le to ound the size on verge nd with high proility. The method n lso e modified so tht the size is ounded in the worst-se, lthough the lgorithm is only output sensitive on verge in this se. We illustrte the si ide for the pproximte ditionry indexing prolem llowing d mismthes. For set S of rdinlity n, the possile outputs for ny query re the different strings mthing pttern w. For the empty string, we report ll of S. When reding pttern string from left to right, we n (possily) exlude some strings from S for eh hrter we hve red euse they nnot mth the pttern with d or fewer mismthes. In prtiulr, the prefix we hve urrently red from the pttern must mth prefix of string with t most d mismthes. To demonstrte the priniple, ssume tht we perform the trie serh of the previous hpter in redth first insted of depth first mnner. In the first round, we hve not yet red nything, nd the possile hits re ll strings represented y the root of the tree. In lter round, we hve red prefix of length i of the pttern. The possile hits re represented y ll sutrees of nodes t depth i tht were rehed with d or fewer mismthes. In the next round, we move to those hildren of the urrent nodes for tht we still hve t most d mismthes. Unless this pplies to ll hildren, they represent suset of the strings of the previous step. We hve eliminted from our urrent set those strings elonging to hild nodes where we rehed the limit of d mismthes. The ide to hieve n optiml query time is to redue the numer of nodes tully visited to onstnt numer per exmined pttern hrter. To this end, we need more 93

104 94 nodes to represent the different serh sttes. Typilly, we expet tht fter exmining O(log n) hrters of the pttern, there should not e more thn onstnt numer of possile output ndidtes, whih we n hek in output optiml time. For pttern prefixes of length log n, there n e t most Σ log n = n log Σ different sttes. We show tht the numer of sttes is even smller on verge, thus yielding n expeted index size of O(n log d n). We elorte the ide y distinguishing ses for the numer of errors i from zero to d. For i errors, there is threshold h i for tht no two strings in the set n e mthed y the sme string of length h i with i errors. If the pttern is lrger, there is t most one ourrene tht we n hek mnully. For shorter strings, we ompute nd store ll possile string sets in trie disounting the ses we lredy solved for smller i. These tries re lled error trees. The threshold h i is the height of the i-th error tree. At first glne, the ide my e reminisent of the pproh tken in [GMRS03] with the repetition index plying the role of our threshold. Coneptully, the min differene is tht our pproh is tht of n index (or register ), i.e., if the pttern is smll, we n diretly red the ourrenes from the index struture, while the method desried in [GMRS03] works more like filter (or elertor ) tht reports res of the strings with one or more mthes. As result, in ontrst to [GMRS03], we n gurntee worst-se look-up time. The min text indexing dt struture in [CGL04] is lso sed on error trees (lled k-errt trees). In ontrst to our pproh, where ll strings re merged into single new tree, the trees in [CGL04] re reursively deomposed into entroid pths, nd merging only tkes ple long these pths. A entroid pth is pth tht ontins t eh node the edge tht leds to the sutree with the most leves. Any pth in tree with n leves n pss t most log n entroid pths. For eh pth, seprte proessing is done. On the one hnd, this ounds the index size in the worst-se. On the other hnd, this lso requires to hndle the pths seprtely during the serh. Furthermore, it omplites the proess of eliminting duplite reports of ourrenes. The nlysis of our error trees shows tht they hve logrithmi height. This is used to ound the size of the trees on verge. The intuition for this ide drws from the very detiled nlysis of digitl serh trees. In prtiulr, tries nd suffix trees hve n expeted height of O(log n) (see, e.g., [Knu98, Pit85, KP86, JS91, Szp93, Szp93]). This height reflets the length of longest repeted sustring in rndom sequenes of length n. Even when llowing some errors, the length of suh repets is still O(log n) [ŁS97]. It is thus resonle to suspet tht error trees reted y llowing onstnt numer of errors in prefixes up to the height of the suffix tree or trie ehve similrly well nd hve height of O(log n). This hpter is orgnized s follows: After the forml prolem definition, we tke loser look t the edit distne euse it is inherently miguous. We then define relxed version of Σ + -trees, whih we need to relize trde-off etween time nd spe. Our si indexing dt struture is desried next, followed y the serh lgorithm. Then the verge size nd the preproessing for the dt struture re nlyzed. Finlly, we present trde-off etween time nd spe, whih llows to ound the size in the worst-se in exhnge for wekening of the query time ound from worst-se to n verge-se ound.

105 CHAPTER 5. TEXT INDEXING WITH ERRORS Definitions nd Dt Strutures Our indexing sheme n e pplied to vriety of prolems. The two si prmeters re set S of strings, from whih we uild the index, nd riterion tht we use to selet leves from sutrees. Our si dt strutures re representtions of modified Σ + -trees sed on sets of strings. For ny set of strings S, in the modified Σ + -tree T orresponding to S, the numer of leves orresponds to the rdinlity of S. The size of the dt struture thus depends only on the rdinlity of the set. By using different sets, we re le to solve different indexing prolems. In ll ses relevnt here, the rdinlity of the set is upper ounded y the size of ny representtion of the set. Therefore, s unifying view in the desription of the lgorithm, we ssume tht we re given set S of strings, lled the se set. In prtiulr, our dt struture n e dpted to solve the following prolems: For Σ edit k o ll (pproximte text indexing with ourrene reporting), the se set S is the set of suffixes of string. For Σ edit k pos pref (pproximte text indexing with position reporting), the se set S is the set of suffixes of string. For P(Σ ) edit k do ll (pproximte ditionry querying), the se set S is set of independent strings. For P(Σ ) edit k do pref (pproximte ditionry prefix querying), the se set S is set of independent strings. For P(Σ ) edit k do sustr (pproximte doument olletion indexing with doument reporting), the se set S is set of suffixes of independent strings. For P(Σ ) edit k pos pref (pproximte doument olletion indexing with position reporting), the se set S is set of suffixes of independent strings. For P(Σ ) edit k o ll (pproximte doument olletion indexing with ourrene reporting), the se set S is set of suffixes of independent strings. Before we n detil the seletion proess, we need some more definitions regrding the edit distne, the modified Σ + -trees, nd their representtion. In the following, we ssume tht we re working with set S of strings s speified ove A Closer Look t the Edit Distne The lgorithms presented here work for edit distne with unit ost up to onstnt numer d of errors. Any sumodel of edit distne n e used s well, ut we only desrie the edit distne version. To understnd the si ides first, it might e helpful to red this hpter repling edit distne y Hmming distne. Our indexing struture is sed on the ide of omputing ertin strings tht re within speified distne to elements from our se set S or sets derived from S

106 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i DEFINITIONS AND DATA STRUCTURES i n t e r n t i o n l i n t e r t i o n d d d d d d d d d d d d d m d d d d d d d d d d d d m m d d d d d d d d d d d m d d d d d d d d d d m d d d d d d d d d m d d d d d d d d s m d d d d d d d d d d d d d m s d d d d d d d d d d m m m s s s s s s s s d d d m s m m s s m s s m m s s m m s s s s Figure 5.1: Exmple of relevnt edit grph G rel interntionl,intertion for trnsforming the string interntionl into the string intertion. Only relevnt rs re drwn, i.e., those rs elonging to some shortest pth. The edit pths etween the two strings re mrked in old. Ansindites sustitution, ddeletion, nin insertion, nd nmmth. where the errors our in prefixes of ounded length. Therefore, we hve to estlish lose ties etween the length of miniml prefixes nd the numer of errors therein. The following definition ptures the miniml prefix length of string u tht ontins ll errors with respet to string v. Definition 5.1 (k-miniml prefix length). For two strings u, v Σ with d(u, v) = k, we define minpref k,u (v) = min { l d ( u,l, v,l+ v u ) = k nd ul+1, = v l+ v u +1, }. (5.1) Note tht, for the Hmming distne, minpref k,u (v) is the position of the lst mismth. For more detiled understnding of edit distne, it is helpful to onsider the edit grph. The edit grph for the trnsformtion of the string u into the string v ontins vertex 1 for eh pir of prefixes (s, t) with s prefixes(u) nd t prefixes(v). An 1 To void onfusion, we ll elements of the edit grph verties nd rs nd elements of the trees of our index dt struture nodes nd edges.

107 CHAPTER 5. TEXT INDEXING WITH ERRORS 97 r onnets two verties if the prefixes represented y the verties differ y t most one hrter. Eh r is leled y weight orresponding to eqution (2.9), i.e., the weight is zero if nd only if the prefixes of the soure vertex re oth extended y the sme hrter to form the prefixes of the trget vertex ( mth). Otherwise the weight of the r is one, orresponding to sustitution (digonl rs), deletion (horizontl rs), or n insertion (vertil rs). The vertex representing the empty prefixes (ε, ε) is the strt vertex nd it is leled with weight zero. Eh vertex in the edit grph is leled with the weight of lightest (or shortest if we onsider weights s distnes) pth onneting it to the strt vertex. The vertex representing oth strings (u, v) ompletely is the end vertex. Its lel is the edit distne etween oth strings. We ll n r relevnt if it lies on shortest pth from the strt vertex to nother vertex. The relevnt edit grph ontins only the relevnt rs. We denote the relevnt edit grph for trnsforming u into v y G rel u,v. An edit pth is ny pth in the relevnt edit grph onneting the strt with the end vertex. Figure 5.1 shows n exmple. Eh pth onneting the strt with the end vertex orresponds to miniml set of edit opertions to trnsform one string into the other. Rell tht in Setion the edit distne is defined s the miniml numer of opertions neessry to trnsform string u into nother string v. It is onvenient to define the opertors del, ins, nd su of type Σ {ε} Σ {ε}. If we hve distne d(u, v) = k for two strings u, v Σ, then there exist one or more sequenes of opertions (op 1,op 2,...,op k ) with op i {del, ins, su} suh tht v = op k (op k 1 ( op 1 (u) )). We ll u(i) = op i ( op 1 (u) ) the i-th edit stge. Eh opertion op i in the sequene hnges the urrent string t position pos (op i ). An insertion hnging u = u 1 u i 1 u i u i+1 u m into u 1 u i 1 u i u m hs position i, deletion hnging u into u 1 u i 1 u i+1 u m hs position i, nd sustitution hnging u into u 1 u i 1 u i+1 u m lso hs position i. We ll sequene ρ(u, v) = (op 1,op 2,...,op k ) of edit opertions n ordered edit sequene if the opertions re pplied from left to right, tht is, for op i nd op i+1, we hve pos (op i ) pos (op i+1 ) if op i is deletion, nd pos (op i ) < pos (op i+1 ) otherwise. There re two different pths to trnsform interntionl into intertion with four opertions in the exmple in Figure 5.1: either repleny, repleynd delete nd l; or delete n, insert, nd delete nd l. For eh set of opertions, there is lso numer of different possiilities to rete n edit sequene. There re two ordered edit sequenes: su 6,, su 7,, del 12, del 12 nd del 6, ins 7,, del 12, del 12. But the edit sequene del 12, del 12, ins 8,, del 6 lso trnsforms interntionl into intertion. Oserve how hnging the order of the opertions lso effets the positions in the string where the opertions pply. For fixed set of opertions, there is unique order (exept for swpping identil del i -opertions). The hrm of ordered edit sequenes lies in the ft tht they trnsform one string into nother in well-defined wy. Note tht there my e exponentilly (in the numer of errors) mny edit pths: Consider the strings m nd n, whih hve distne k = n m for n > m. There re ( ) n 2 k possiilities to remove the dditionl s nd thus eqully mny edit pths. Lemm 5.2 (One-to-one mpping etween pths nd edit sequenes). Let u, v Σ e suh tht d(u, v) = k. Eh ordered edit sequene ρ(u, v) =

108 DEFINITIONS AND DATA STRUCTURES (op 1,...,op k ) orresponds uniquely to n edit pth π = (p 1,...,p m ) in G rel u,v. Proof. Let π e n edit pth from the strt to the end vertex in the relevnt edit grph. We onstrut n ordered edit sequene ρ y dding one opertion for eh r with non-zero weight enountered on π. Beuse π hs weight k, there re k non-zero rs orresponding to k opertions. Let p ji e the soure nd p ji +1 the trget vertex of the r orresponding to the i-th opertion. Let the row nd olumn numers of p ji e r nd. The pth to p ji hs weight i 1, so we hve d(u,, v,r ) = i 1. The first i 1 opertions trnsform u, to v,r. The position of the i-th opertion is pos (op i ) = r + 1, euse it trnsforms u,+1 to v,r+1 ( sustitution), u,+1 to v,r ( deletion), or u, to v,r+1 (n insertion). The row numers on the pth re stritly inresing exept for the se of two deletions following immeditely one fter nother. But, for two deletions op i nd op i+1 ourring diretly in row, we hve the sme positions pos (op i ) = pos (op i+1 ). Therefore, the ordered edit sequene derived from the pth is unique. By indution on the numer of opertions, we show tht eh ordered edit sequene ρ orresponds uniquely to pth in the relevnt edit grph. The strt vertex orresponds surely to the edit sequene with no opertions. Assume tht we hve found unique pth for the first i opertions leding to vertex p with row nd olumn numers r nd in the relevnt edit grph suh tht the weight of its predeessor in the pth is smller thn i (or p is the strt vertex) nd the first i opertions trnsform u, to v,r. Thus, vertex p must e leled with weight i, whih is optiml, nd d (u,, v,r ) = i. The position of the i-th opertion (if i > 0) is either r for sustitution or n insertion, or it is r + 1 for deletion. Let the position of op i+1 e pos (op i+1 ) = r (i.e., we either reple or delete the r -th hrter, or we insert nother hrter in its ple). Note tht r r + 1 euse we hve r r + 1 if op i+1 is deletion, nd r > r (hene r r + 1) if op i+1 is sustitution or n insertion. Beuse we hve d(u,, v,r ) = i nd pos (op i+1 ) = r, the intermedite sustrings must e equl: u,+r 1 r = v r,r 1. Thus, we must hve zero-weight rs from the vertex p to vertex q with row nd olumn numers r 1 nd +r 1 r. The weight of q is optiml euse the weight of p is optiml y the indution hypothesis. The vertex q represents the prefixes v,r 1 nd u,+r 1 r, whih hve distne i. With the next opertion, the first i + 1 opertions trnsform u,d into v,s, where d = + r r nd s = r 1 for deletion, d = + r r nd s = r for sustitution, or d = + r 1 r nd s = r for n insertion. From q, there is n r orresponding to the next opertion to vertex p with row nd olumn numers s nd d: Eh r dds t most weight one. The pth to p is therefore optiml euse the existene of pth to p with less weight would prove the existene of n ordered edit sequene with fewer opertions for the two prefixes. Thus, we ould trnsform u,d into v,s with fewer thn i + 1, nd u d+1, into v s+1, with k i 1 opertions. This would ontrdit d(u, v) = k. Note tht the position of the i-th opertion op i derived from the edit pth ws pos (op i ) = r + 1, where r ws the row numer of the soure vertex of the r representing op i in the first onstrution. Thus, if the opertions hve the positions r 1 + 1,...,r k + 1, then the row numers of the soure verties of non-zero weight rs re r 1,...,r k. In the seond onstrution, we derived the existene of vertex t the strt of n r representing the (i + 1)-th opertion with row nd olumn numers

109 CHAPTER 5. TEXT INDEXING WITH ERRORS 99 r 1 nd + r 1 r, where r ws the position of the (i + 1)-th opertion. Thus, if the non-zero weight rs on the edit pth strt t the row numers r 1,...,r k, then the opertions hve the positions r 1 + 1,...,r k + 1. As result, we hve one-to-one mpping. Now, we n mke the desired onnetion from the sequene of opertions to the k-miniml prefix lengths. Lemm 5.3 (Edit stges of ordered edit sequenes). Let u, v Σ e suh tht d(u, v) = k. Let ρ(u, v) = (op 1,...,op k ) e n ordered edit sequene, nd let u(i) = op i ( op 1 (u) ) e the i-th edit stge. If we hve minpref i,u(i) (u) > h + 1, then there exists n j > h with v,j = u(i 1),j. Proof. By Lemm 5.2, there is unique pth π(u, v) in the relevnt edit grph orresponding to the sequene ρ(u, v). The sme holds for ny susequene ρ i (u, u(i)) = (op 1,...,op i ). Beuse there re no more opertions when trnsforming u into u(i), the remining pth tht is not identil with the pth for ρ(u, v) must e mde up of zero-weight rs. Let p e the soure vertex of the r for the i-th opertion on the edit pth in the relevnt edit grph G rel u,u(i). The vertex p must hve weight i 1. Let q e the trget vertex of the r for the (i 1)-th opertion. Up to vertex q, the edit pths in the relevnt edit grphs G rel u,u(i) nd Grel u,u(i 1) re equl. Thus, they re equl up to vertex p euse there re only zero-weight rs etween q nd p in the relevnt edit grphs trnsforming u into u(i). Furthermore, the sme pth is lso found in the relevnt edit grph G rel u,v. Let the row nd olumn numers of p e r nd. Thus, p represents the prefixes u, nd v,r = u(i),r = u(i 1),r. (5.2) Let p e the next vertex in the edit pth following p, nd let p hve row nd olumn numers r nd. Beuse p hs weight i, the distne etween the represented prefixes is d(u(i),r, u, ) = i. Hene, we hve minpref i,u(i) (u) r. Beuse r r + 1, we find tht h + 1 < minpref i,u(i) (u) r r + 1, (5.3) nd thus we hve r > h. Equtions (5.2) nd (5.3) prove our lim. When looking t the edit grph, one n lso see how it is possile to ompute the edit distne in time O(mk) for pttern of length m nd k errors. Beuse eh vertil or horizontl edge inreses the numer of errors y one, there n e t most k suh edges in ny edit pth from the strt to the end vertex. Therefore, we only hve to onsider 2k + 1 digonls of length m. We ll this k-ounded omputtion of the edit distne (see lso [Ukk85]) Wek Tries We silly ssume tht ll Σ + -trees re represented y ompt tries or y the similr wek tries. Therefore, we onsider the strings in underlying sets to e independent nd mke no use of ny intrinsi reltions mong them. A suffix tree for the string t

110 DEFINITIONS AND DATA STRUCTURES n lso e regrded s trie for the set S = suffixes(t), whih n e represented in size O( t ). The mjor dvntge of the suffix tree is tht (y using the properties of S) it n e uilt in time O( t ), wheres uilding trie for suffixes(t) my tke time O(l t ), where l is the length of the longest ommon repeted sustring in t. For our index dt struture, this dditionl ost will only e mrginl. Therefore, for the onstrution, we just regrd ll Σ + -trees s ompt tries nd mke no use of internl reltions mong the strings. In the following, let T (S) denote the ompt trie for the string set S Σ +. To redue the size of our index in the worst-se, we lter restrit the serh to ptterns with mximl length l. We then only need to serh the tries up to depth l. The struture from this threshold to the leves is not importnt. This onept is ptured in the following definition of wek Σ + -trees. Definition 5.4 (l-wek Σ + -tree). For l > 0, n l-wek Σ + -tree T is rooted, direted tree with edge lels from Σ +. For every node p with depth(p) < l, there is t most one outgoing edge for eh Σ whose lel strts with the hrter (relxed unique rnhing riterion). Although there is no one-to-one orrespondene etween strings nd nodes in the l-wek Σ + -tree, the definitions for the pth of node, the depth of node, nd the word set of the tree s given in Setion pply. A wek trie is the representtion of ompt wek Σ + -tree for set of strings S. 2 Indeed, s rgued in Setion 2.1.2, if we use virtul nodes, it mkes no oneptul differene whether we omptify our trees or not. Thus, we will not e too rigid on this sujet where it is not needed. Definition 5.5 (l-wek trie). For set of strings S, the l-wek trie W l (S) is ompt implementtion of n l-wek Σ + -tree suh tht it ontins extly one lef for eh string in S, nd there is either single rnhing node t depth l, under whih ll leves re tthed, or there is only single lef nd no rnhing fter depth l. In ny se, the lrgest depth of rnhing node in n l-wek trie is l. Figure 5.2 shows exmples of wek tries derived from the exmples in Setion The height of the trie in Figure 5.2() is three, thus the 3-wek trie is the sme s the ompt trie. Beuse mxpref(s) is the height of the trie for the string set S, the wek trie W mxpref(s) (S) is just the ompt trie T (S). The l-wek trie for set S n esily e implemented in size O( S ) y representing edge lels with pointers into the strings of S: Eh string in S genertes lef, nd so there re t most O( S ) leves, O( S ) inner nodes, nd O( S ) edges. Definition 5.5 gurntees tht wek tries re unique with respet to string set S: Beuse the trie is ompt (y the unique rnhing riterion), the nodes up to depth l re uniquely defined y the prefixes of length l of the strings in S. Eh element in S hs prefix of length l tht uniquely orresponds to pth of length l. If there is more thn one string with ertin prefix u, then these re ll represented y leves under node p with pth(p) = u. 2 The term wek ompt trie would e more ext, ut we prefer to use the shorter term.

111 CHAPTER 5. TEXT INDEXING WITH ERRORS e id 2 8 r ll e id 2 8 r ll er ell id u 9 ll y u 9 ll y ull uy s 13 ell to k p s 13 t ell ok op s 13 ell tok top () Compt trie () 2-Wek trie () 1-Wek trie Figure 5.2: Exmples of ompt trie, 1-wek trie, nd 2-wek trie for the prefix-free set of stringser,ell,id,ull,uy,sell,stok,stop. 5.2 Min Indexing Dt Struture In order to nswer query for ny of our indexing prolems, we hve to find prefixes of strings in S tht mth the serh pttern w with t most k errors. Assume tht S Σ nd w Σ re given. We ll s S k-error length-l ourrene of w if d(s,l, w) = k. We then lso sy tht w mthes s up to length l with k errors. We will omit the length l if it is either ler from the ontext or irrelevnt. To solve the indexing prolems defined in Setion 5.1 in optiml time, we need to e le to find ll d-error ourrenes of the serh pttern w in time O( w ). For tht purpose, we will define error trees so tht leves in ertin sutrees ontin the possile mthes. From these, we selet the desired output using rnge queries. On n strt level, the si ide of the index for d errors is the following: A single string s S n e ompred to the pttern w to hek whether w mthes prefix of s with t most d errors in time O(d w ) s desried in Setion On the other hnd, preomputed index of ll strings within edit distne d of S (e.g., trie ontining ll strings r for whih r mthes prefix of string s S with t most d errors) llows liner look-up time. Both methods re reltively useless y themselves: The index would e prohiitively lrge, wheres ompring the pttern to eh string would tke too long. Therefore, we tke lned pproh y preomputing some of the neighoring strings nd diretly omputing the distne for others. In prtiulr, we indutively define sets of strings W 0,...,W d, where, for 0 i d, the set W i onsists of strings with edit distne i to some string in S. We egin with S = W 0. Prmeters h 0,...,h d re used to ontrol the preomputing. As n exmple, W 1 onsists of ll strings r = u v suh tht d(r, s) = 1 for some s = uv W 0 nd u h 0 +1, i.e., we pply ll possile edit opertions to strings s S tht result

112 MAIN INDEXING DATA STRUCTURE in modified prefix of length h The prtitioning into d + 1 sets llows serhing with different error ounds up to d Intuition The intuitive ide is s follows: If the serh pttern w mthes prefix s of some string in S with no errors, we find it in the trie T for S. If w mthes s with one error, then there re two possiilities depending on the position of the error: Either the error lies ove the height h 0 of the trie T or elow. If it lies ove, we find prefix of w in the trie rehing lef edge. Thus, we n simply hek the single string orresponding to the lef y k-ounded omputtion of the edit distne (with k = 1) in time O( w ). Otherwise the error lies elow the height of the trie nd is overed y preomputing. For this se, we ompute ll strings r tht re within distne one to string in S suh tht the position of the error is in prefix of length h of r. At eh position, we n hve t most 2 Σ errors ( Σ insertions, Σ 1 sustitutions, one deletion). Thus, for eh string S, we generte O(h 0 ) new strings (rememer tht we onsider Σ to e onstnt). The new strings re inserted in trie T with height h 1. For the seond se, we find ll mthes of w in the sutree T w. 3 We extend this sheme to two errors s follows. Assume tht w mthes prefix s prefixes(s) of string in the se set with two errors. There re three ses depending on the positions of the errors (of n ordered edit sript). If the first error ours ove the height h 0 of the trie T, then we find prefix of w in T leding to lef edge. Agin, we n hek the single string orresponding to this edge in time O( w ) s ove. If the first error ours elow height h 0 + 1, then we hve inserted string r into the trie T tht reflets this first error. We re left with two ses: Either the seond error ours ove the height h 1 of T or elow. In the first se, there is lef representing single string 4, whih we n hek in time O( w ). Finlly, for the se tht the seond error ours elow h 1, we generte O(h 1 ) new strings for eh string in T in the sme mnner s efore. The new strings re inserted in trie T with height h 2. Agin, we find ll mthes of w in the sutree T w. The ide n e rried on to ny (onstnt) numer of errors d, eh time mking se distintion on the position of the next error Definition of the Bsi Dt Struture We now rigidly ly out the ide. There re some more ostles tht we hve to overome when generting new strings from set W i : We hve to void generting string y doing reverse error, i.e., undoing n error from previous step. In ddition, to ese representtion, we llow ll errors on prefixes of strings tht led to new prefix of mximl length h i In prtiulr, this lso ptures ll opertions with positions up to h i (See the proof of Lemm 5.2: The position of n opertion orresponds to one plus the row numer of the soure vertex in the edit grph. The resulting string hs length t most one plus this row numer.) The key property of the following indutively defined sets W 0,...,W d is tht we hve t lest one error efore 3 Note, though, tht some leves in T w my not e one-error mthes of w euse the error my e fter the position w. We lter employ rnge queries to selet the orret suset of leves. 4 We hve to relx this lter so tht lef in the trie for i errors might represent 2i + 1 strings.

113 CHAPTER 5. TEXT INDEXING WITH ERRORS 103 h 0 + 1, two errors efore h 1 + 1, nd so on until we hve d errors efore h d in W d. Definition 5.6 (Error sets). For 0 i d, the error set W i is defined indutively. The first error set W 0 is defined y W 0 = S. The other sets re defined y the reursive eqution W i = Γ hi 1 (W i 1 ) S i, (5.4) where Γ l is n opertor on sets of strings defined for A Σ y Γ l = {op(u)v uv A, op(u) l + 1,op {del, ins, su}}, (5.5) the set S i is defined s S i = {r there exists s S suh tht d (s, r) = i} (5.6) (i.e., the strings tht re within distne i of the string in S), nd h i re integer prmeters (tht my depend on W i ). We sy tht r W i stems from s S if r = op i ( op 1 (s) ). If the se set S ontins suffixes of string u, then the sme string r W i my stem from multiple strings s, t S. For exmple, if we hve t = v nd s = v, then s hs distne one to oth t nd s. When generting W i from W i 1, it is therefore neessry to eliminte duplites. On the other hnd, we do not wnt to eliminte duplites generted from independent strings. Therefore, we introdue sentinels nd extend the distne funtion d. Eh string is ppended with different sentinel, nd the distne funtion is extended in suh wy tht the ost of pplying n opertion to sentinel is t lest 2d + 1. To void the introdution of sentinel for eh string, we n use logil sentinels: We define the distne etween the sentinel nd itself to e either zero if the sentinels ome from the sme string, or to e t lest 2d + 1 if the sentinels ome from two different strings. With the sheme we just desried, we n eliminte duplites y finding ll those newly generted strings t nd s tht stem from the sme string or from different suffixes of the sme string u W i 1, hve the sme length, nd shre ommon prefix. The following property will e useful to devise n effiient lgorithm in the next setion. Lemm 5.7 (Error positions in error sets). Assume tht the string r W i stems from the string u S y sequene of opertions op 1,...,op i. If the prmeters h i re stritly inresing for 0 i d, then r hi 1 +2, = u hi 1 +2, nd r hi +1, = u hi +1,. Proof. We lim tht ny hnged, inserted, or deleted hrter in r ours in prefix of length h i h i of r. By ssumption, r = op i ( op 1 (u) ). Let r(j) = op j ( op 1 (u) ) e the edit stges. In the j-th step, the position of the pplied opertion is pos (op j ), thus op j hs hnged or inserted the pos (op j )-th hrter in string r(j), or it hs deleted hrter from position pos (op j ) of the string r(j 1). We denote this position y p j (for r(j 1) we hve p j = pos (op j ), ut the position hnges with respet to lter stges). Conseutive opertions my influene p j, i.e., deletion n derese nd n insertion inrese p j from r(j) to r(j + 1) y one.

114 MAIN INDEXING DATA STRUCTURE Thus, fter i j opertions, p j pos (op j ) + i j. By Definition 5.6, we hve pos (op j ) h j Therefore, in step i, we hve p j h j i j. Beuse the h i re stritly inresing, we hve h j h j h j+1 1 h i i + j h i i + j. As result, for ny position in r, where hrter ws hnged, we hve p j h i h i. To serh the error sets effiiently nd to filitte the elimintion of duplites, we use wek tries, whih we ll error trees. Definition 5.8 (Error trees). The i-th error tree et i (S) is defined s the wek trie W hi (W i ). Eh lef p of et i (S) is leled with (id s, l), for eh string s S, where id s is n identifier for s nd l = minpref i,pth(p) (s). To pture the intuition first, it is esier to ssume tht we set h i = mxpref(w i ), i.e., the mximl length of ommon prefix of ny two strings in W i. For this se, the error trees eome ompt tries. A lef my e leled multiple times if it represents different strings in S (ut only one per string). For i errors, the numer of suffixes of string t tht my e represented y lef p is ounded y 2i+1: The lef p represents pth pth(p) = u nd ny string mthing u with i errors hs to hve length etween u i nd u +i. We ssumed tht i d is onstnt, thus lef hs t most onstntly mny lels. The lels n esily e omputed while eliminting duplites during the genertion of the set W i Constrution nd Size To llow d errors, we uild d + 1 error trees et 0 (S),...,et d (S). We strt onstruting the trees from the given se set S. Eh element r W i is implemented y referene to string s S nd n nnottion of n ordered edit sequene tht trnsforms u = s,l+ s r into v = r,l for l = minpref i,r (s). The i-th error set is impliitly represented y the i-th error tree. We uild the i-th error tree y generting new strings with one dditionl error from strings in W i 1. Beuse we tthed n ordered edit sequene to eh string, we n esily void undoing n erlier opertion with new one. The detils re given in the next lemm. Lemm 5.9 (Constrution nd size of W i nd et i (S)). For 0 i d, ssume tht the prmeters h i re stritly inresing, i.e., h i > h i 1, nd let n i = W i 1. The set W i n e onstruted from the set W i 1 in time O(n i 1 h i 1 h i ) nd spe O(n i 1 h i 1 ) yielding the error tree et i (S) s yprodut. The i-th error tree et i (S) hs size O( S h 0 h i 1 ). Proof. We first prove the spe ound. For eh string r in W i, there exists t lest one string r in W i 1 suh tht r = op(r ) for some edit opertion. For string r in W i 1, there n e t most 2 Σ (h i 1 + 2) strings in W i : Set u = r,h i 1 +1, then we n pply t most Σ (h i 1 + 1) insertions, ( Σ 1)(h i 1 + 1) sustitutions, nd (h i 1 + 2) deletions. Hene, we hve W i 2 Σ (h i 1 + 2) W i 1. Let W i e the multi-set of ll strings onstruted from W i 1 y pplying ll possile edit opertions tht result in modified prefixes of length h i We hve to void

115 CHAPTER 5. TEXT INDEXING WITH ERRORS 105 undoing ny erlier opertion. This n e omplished y ompring the new opertion with the nnotted edit sequene. Note tht we my onstrut the sme string from multiple edit sequenes (lso inluding the ordered edit sequene). To onstrut W i from W i, we hve to eliminte the duplites. By Lemm 5.7, if the string r W i stems from the string u S, then we hve r hi +1, = u hi +1,. If r, s W i re equl, then they must either oth stem from the sme string u S, or they re oth suffixes of the sme string u. Beuse there re no errors fter h i, we hve r hi +1, = s hi +1, = u hi +1,. Note tht h i + 1 i u hi +1, h i +1+i, so there n e t most 2i+1 different suffixes for ny string u. To eliminte duplites, we uild n h i -wek trie y nively inserting ll strings from W i. Let n e the numer of independent strings tht were used to uild the se set S, i.e., ll suffixes of one string ount for one. Oviously, we hve n S. We rete n(2d + 1) ukets nd sort ll leves hnging from node p into these in liner time. All leves in one uket represent the sme string. For suffixes, we selet one lef nd ll different lels, therey lso determining the i-miniml prefix length. For ukets of other strings, we just selet the one lef with the lel representing the i-miniml prefix. After eliminting the surplus leves, the wek trie eomes et i (S). Building the h i -wek trie for W i tkes time O(h i W i ) = O(n i 1h i 1 h i ), nd eliminting the duplites n e done in time liner in the numer of strings in W i. The size of the i-th error tree is liner in the numer of lef lels nd thus ounded y the size of W i. Iterting W i = O(h i 1 W i 1 ) leds to O( S h 0 h i 1 ). We hoose the prmeters h i either s h i = mxpref(w i ) or s h i = h + i for some fixed vlue h tht we speify lter. By oth hoies, we stisfy the ssumptions of Lemm 5.9 s shown y the following lemm. Lemm 5.10 (Inresing ommon prefix of error sets). For 0 i d, let h i = mxpref(w i ) e the mximl prefix of ny two strings in the i-th error set, then we hve h i > h i 1. Proof. We prove y indution. Let r nd s e two strings in W i 1 with mximl prefix r,hi 1 = s,hi 1 = u for some u Σ h i 1. Beuse r nd s re not identil, we hve r = uv nd s = uv for some strings v, v Σ nd, Σ. We hve Σ 2; therefore, Γ(W i 1 ) ontins t lest the strings uv, uv, uv, uv, uv, uv, uv, nd uv. By Lemm 5.7, for ny string r W i 1 tht stems from t S, we hve r hi 1 +1, = t hi 1 +1,, thus no hrter t position greter or equl to h i n hve een hnged y previous opertion. Hene, pplying ny opertion t h i retes new string tht hs distne i to some string in S. Both uv nd uv were reted y inserting hrter t h i 1 + 1, so they do not undo n opertion nd elong to W i. They oth hve ommon prefix of length t lest h i > h i 1. Therefore, we find tht h i > h i 1. For h i = mxpref(w i ), we do not need to use the ukets s desried in the proof of Lemm 5.9, ut we n rete the error trees more esily. By ssumption, two strings re either ompletely identil, or they hve ommon prefix of size t most h i. On the other hnd, no two strings hve ommon prefix of more thn h i, thus there n e t most one uket elow h i. If two strings r nd s re identil, they must stem from suffixes of the sme string u S nd the suffixes r hi+1, nd s hi+1, must

116 MAIN INDEXING DATA STRUCTURE e equl. Beuse s nd r re implemented y pointers to strings in S, we n esily hek whether they referene the sme suffix of the sme string during the proess of insertion nd efore ompring t hrter level Min Properties The sequene of sets W i simultes the suessive pplition of d edit opertions on strings of S where the position of the i-th opertion is limited to e smller thn h i Before desriing the serh lgorithms, we look t some key properties of the error sets tht show the orretness of our pproh. Lemm 5.11 (Existene of mthes). Let w Σ, s S, nd t prefixes(s) e suh tht d (w, t) = i. Let ρ(w, t) e n ordered edit sequene for w = op i ( op 1 (t) ). For 0 j i, let t(j) = op j ( op 1 (t) ) e the j-th edit stge. If for ll 1 j i we hve then there exists string r W i with minpref j,t(j) (t) h j 1 + 1, (5.7) w prefixes (r), r = op i ( op 1 (s) ), nd l = minpref i,r (s). (5.8) Proof. We prove y indution on i. For i = 0, we hve W 0 = S nd the lim is oviously true euse w = t prefixes(s) in tht se. Assume the lim is true for ll j i 1. Let r = op i ( op 1 (s) ). Beuse d(r, s) = i, we hve to show tht there exists string r W i 1 nd strings u, u, v Σ, with r = uv, r = u v, d (u, u ) = 1, nd u h i Then we hve r W i y Definition 5.6. Set l = minpref i,r (s). Beuse w nd t re prefixes of r nd s, whih lredy hve distne i, we hve l = minpref i,w (t) nd l h i y eqution (5.7). Set u = s,l+ s r, u = r,l, nd v = r l+1,. By Definition 5.1, we hve r = u v, s = uv, d (u, u ) = i, nd u = op i ( op 1 (u) ). Set u = op i 1 ( op 1 (u) ) nd r = u v, then we get r = op i 1 ( op 1 (s) ). We re finished if we n show tht r W i 1 euse we hve u = l h i nd d (u, u ) = 1. Let w = op i 1 ( op 1 (t) ), then d(w, t) = i 1. Beuse, for ll 1 j i 1, we hve minpref j,t(j) (t) h j 1 +1, we n pply the indution hypothesis nd find tht there exists string ˆr = op i 1 ( op 1 (s) ) in W i 1 with w prefixes(ˆr) nd l = minpref i 1,ˆr (s). This proves our lim euse ˆr = r. When we trnslte this to error trees, we find tht, given pttern w, the i-error length-l ourrene s of w orresponding to lef leled y (id s, l) n e found in et i (S) w. Unfortuntely, not ll leves in sutree represent suh n ourrene. The following lemm gives riterion for seleting the leves (the errors must pper efore w, i.e., if l w ). Lemm 5.12 (Ourrenes leding to mthes). For r W i, let w prefixes(r) e some prefix of r, nd let s S e string orresponding to r suh tht l = minpref i,r (s). There exists prefix t prefixes(s) suh tht d (t, w) = i nd r w +1, = s t +1, if nd only if w l.

117 CHAPTER 5. TEXT INDEXING WITH ERRORS 107 Proof. If w l, then there re strings u, v Σ with w = uv nd u = w,l. Beuse w is prefix of r, we hve r = wx = uvx. By Definition 5.1, there lso exists prefix u prefixes(s) with s = u vx nd d (u, u) = i. Hene, we hve t = u v prefixes(s), d(w, t) = d(uv, u v) = i, nd x = r w +1, = s t +1,. Conversely, ssume tht w < l nd tht there exists prefix t prefixes(s) with d(t, w) = i nd r w +1, = s t +1,. Set m = w nd rell tht w = r,m. Then s = tr w +1, nd, thus we hve t = s, s r +m. Therefore, we find tht d(r,m, s,m+ s r ) = i nd r m+1, = s m+1+ s r,, i.e., m is ndidte for minpref i,r (s), whih is ontrdition to m < l. In the error tree, this mens tht we hve n i-error ourrene of w for lef p in et i (S) w with pth(p) = r if nd only if p is leled (id s, l) with w l. Finlly, there is dihotomy tht diretly implies n effiient serh lgorithm. Lemm 5.13 (Ourrene properties). Assume w mthes t prefixes(s) for s S with i errors, i.e., d(w, t) = i. Let ρ = (op 1,...,op i ) e n ordered edit sequene suh tht w = op i (op i 1 ( op 1 (t) )). There re two mutully exlusive ses, (1) either w prefixes(w i ), or (2) there exists t lest one 0 j i for tht we find string r W j suh tht r = op j ( op 1 (s) ), nd we hve w = w,l for some w prefixes(r) nd some l > h j. Proof. Let t(j) = op j ( op 1 (t) ) e the j-th edit stge. Assume tht there exists no j suh tht minpref j,t(j) (t) > h j Then w prefixes(w i ) y Lemm Otherwise, let j e the smllest index suh tht minpref j+1,t(j+1) (t) > h j + 1. By Lemm 5.11, there exists string r W j with t(j) prefixes(r). By Lemm 5.3, there exists prefix w = w,l = t(j),l with l > h j, thus w prefixes(r). When serhing for pttern w, ssume tht either h i > mxpref(w i ) or w h i for ll i. The lst lemm pplied to error trees shows tht, if w mthes string t prefixes(s) for some s S with extly i errors, the following dihotomy holds. Cse A Either w n e mthed ompletely in the i-th error tree et i (S) nd lef p leled (id s, l) n e found in et i (S) w. Cse B Or prefix w prefixes(w) of length w > h j is found in et j (S) nd et j (S) w ontins lef p with lel (id s, l) Serh Algorithms Lemms 5.12 nd 5.13 diretly imply serh lgorithm long the se distintion mde ove. Rell tht we n uild the index effiiently if we hoose the prmeters h i either s h i = mxpref(w i ) or s h i = h + i for some fixed vlue h. The index supports serhes if either h i = mxpref(w i ) or the length of the serh pttern w is ounded y w h. For these prmeters, we n hek ll prefixes t prefixes(s) for whih Cse B pplies in time O( w ): If w h, then we n never hve w > h i for prefix w prefixes(w) nd so the se never pplies.

118 MAIN INDEXING DATA STRUCTURE Otherwise we hve h i = mxpref(w i ). In this se, the error trees eome tries nd so there is t most one lef in et i (S) w if the length of the prefix w prefixes(w) is greter thn h i. Eh lef n hve t most 2d + 1 lels orresponding to t most 2d+1 strings from S. We ompute the edit distne of w to every prefix of eh string in time O( w ) with d-ounded omputtion of the edit distne (see Setion 5.1.1). There n e t most 2d + 1 prefixes tht mth w, thus from ll d + 1 error trees, we hve t most d(2d + 1) 2 strings in totl for whih we must hek the prolem speifi onditions for reporting them. As result, we get the following lemm. Lemm 5.14 (Serh time for Cse B). If d is onstnt nd either h i = mxpref(w i ) or the length of the serh pttern w is ounded y w h min i h i, then the totl serh time spent for Cse B is O( w ). Cse A is more diffiult euse we hve to void reporting ourrenes multiple times. A string with errors ove w n our multiple times in et i (S) w. Beuse ny string t mthing the pttern w with d or fewer errors hs length w d t w +d, we n eliminte duplite reports using S (2d + 1) ukets if neessry. The min issue is to restrit the numer of reported elements for eh error tree i. If we n ensure tht no output ndidte is reported twie in eh error tree, then the totl work for reporting the outputs is liner in the numer of outputs. The strings in the se set S n e either independent or they re suffixes of string u (lso lled doument). Relling the definition for the se set mde in Setion 5.1, we hve to support four different types of seletions: 1. For the following prolems, we wnt to report eh prefix t of ny string s S tht mthes the pttern w with t most d errors: Σ edit k o ll P(Σ ) edit k o ll 2. For the following prolems, we wnt to report eh string s S for whih prefix t prefixes(s) mthes the pttern w with t most d errors: Σ edit k pos pref P(Σ ) edit k do pref P(Σ ) edit k pos pref 3. For the following prolems, we wnt to report eh string s S whih mthes the pttern w with t most d errors: P(Σ ) edit k do ll 4. For the following prolems, we wnt to report eh doument u if prefix t prefixes(s) of suffix s S of u mthes the pttern w with t most d errors: P(Σ ) edit k do sustr The si tsk is to mth the pttern in eh of the d + 1 error trees. If the omplete pttern ould e mthed, we re in Cse A. To support the seletion of the different types, we rete dditionl rrys for eh error tree. Let n i e the numer

119 CHAPTER 5. TEXT INDEXING WITH ERRORS 109 of lef lels in the i-th error tree et i (S). First, we rete n rry A i of size n i tht ontins eh lef lel nd pointer to its lef in the order enountered y depth first trversl of the error tree. For exmple, if the wek tree in Figure 5.2() were n error tree, we would first store the lels of the node 4, then ll lels of the node 6, nd so on until we would hve stored ll lels of the node 21 t the end of A i. The order of the depth first trversl n e ritrry ut must e fixed. Eh node p of the error tree is nnotted y the leftmost index left(p) nd the rightmost index right(p) in A i ontining lef lel from the sutree rooted under p. For virtul node q, left(q) nd right(q) re tken from the next non-virtul node elow q. To support the seletion of results, we rete n dditionl rry B i of the sme size. Depending on the type, B i ontins ertin integer vlues used to selet the orresponding lef lels using rnge queries. By Lemm 5.12, for reports of Type 1, we hve to selet the strings orresponding to those lels for tht the miniml prefix vlue l stored in the lel is smller thn the length of the pttern. This is hieved y setting B i [j] to l if A i [j] ontins the lef lel (id s, l). Let p e the (virtul) node orresponding to the lotion of the pttern w in the error tree et i (S). A ounded vlue rnge (BVR) query (left(p), right(p)) with ound w on B i yields the indies of lels in A i for tht l w in liner time in the numer of lels. For the reports of Type 2, we just hve to selet the strings orresponding to mthes. Oserve tht for ny lel (id s, l) found in the sutree et i (S) w there is prefix of s tht mthes w with t most i errors. Hene, we simply store in B i [j] n identifier for the string s if (id s, l) is stored in A i [j]. Let gin p e the (virtul) node orresponding to the lotion of the pttern w in the error tree et i (S). A olored rnge (CR) query (left(p), right(p)) on B i yields the different string identifiers found in et i (S) w. The reports of Type 3 require the omplete string to e mthed y the pttern w. We hve to tke re of sentinels tht we hve dded to the strings in S. Let p e the (virtul) node orresponding to the lotion of the pttern w in the error tree et i (S). Beuse we dded different sentinel for eh string in S, there is mth only if there is n outgoing edge leled y sentinel t p, nd there is one suh outgoing edge for eh different string in S. Thus, we n report the mthes diretly from the error tree. Finlly, for reports of Type 4, we wnt to report the douments of whih the strings in S re suffixes. For this se the string identifiers must ontin doument numer, nd we use the sme pproh s for Type 2, just storing doument numers in B i. Lemm 5.15 (Serh time for Cse A). If d is onstnt nd either h i = mxpref(w i ), or the length of the serh pttern w is ounded y w h min i h i, then the totl serh time spent for Cse A is O( w + o), where o is the numer of reported outputs. Proof. Mthing the pttern w in eh of the d error trees tkes time O( w ) euse we never reh the prt of the wek tries where the unique rnhing riterion does not hold. Thus, we find single (virtul) node representing w. The rnge queries (or the tree trversl for Type 3) re performed in liner time in the numer of outputs o. Eh output is generted t most d+1 times. Therefore, the totl time is O( w + o).

120 WORST-CASE OPTIMAL SEARCH-TIME For the spe usge, we hve the following ovious lemm. Lemm 5.16 (Additionl preproessing time nd spe for Cse A). The dditionl spe nd time needed for the rnge queries nd the rrys for solving Cse A is liner in the numer of leves of the errors trees. Proof. There re t most 2d + 1 lels per lef, nd so the size of the generted rrys is liner in the numer of leves. The time needed for the depth first trversls is lso liner in the rry sizes. Finlly, the rnge queries re lso prepred in time nd spe liner in the size of the rrys (see Setion 2.3.2). 5.3 Worst-Cse Optiml Serh-Time When setting h i to mxpref(w i ), our min indexing dt struture lredy yields worst-se optiml serh-time y Lemms 5.14 nd Wht is left is to determine the size of the dt struture nd the time needed for preproessing. Note tht, if S is the set of suffixes of string of length n, then lredy h 0 = mxpref(w 0 ) = mxpref(s) n e of size Ω(n). For independent strings, the worst-se size of h 0 = mxpref(s) nnot e ounded t ll in terms of n = S. Fortuntely, the verge size is muh etter nd it ours with high proility. In this setion, we derive the orresponding verge-se ounds for h i = mxpref(w i ). Together with Lemms 5.9 nd 5.16, this gives ound on the totl size nd preproessing time euse they re ll dominted y the size nd preproessing time needed for the d-th error tree: Corollry 5.17 (Dt struture size nd preproessing time). Let n e the numer of strings in the se set S. For onstnt d, the totl size of the min dt strutures is O(nh 0 h 1 h d 1 ), nd the time needed for preproessing is O(nh 0 h 1 h d 1 h d ). We show tht, under the mixing model for sttionry ergodi soures, the proility of h i deviting signifintly from log n is exponentilly smll. Using this ound, we n lso show tht the expeted vlue of h i is O(log n). If h i = mxpref(w i ) is greter thn l, there must e two different string s nd r in W i suh tht s,l = r,l. We first prove tht this implies the existene of n ext mth of length Ω( l i ) etween two sustrings of strings in S, then we ound the proility for this event. Note tht, if S ontins ll suffixes of string v, then S lso ontins v itself. Lemm 5.18 (Length of ommon or repeted sustrings). Let W 0,...,W d e the error sets y Definition 5.6. If there exists n i with h i (2i + 1)l for l > 1, then there exists string v of length v > l suh tht either v is sustring of two independent strings in S, or v = u j,j+l 1 = u j,j +l 1 with j j for some u S. Furthermore, for l 2, oth ourrenes of v strt in prefix of length 2il of strings in S. Proof. We prove the lim y indution over i. For i = 0, the lim is nturlly true euse h 0 is the length of the longest prefix etween two strings in S. This is either

121 CHAPTER 5. TEXT INDEXING WITH ERRORS 111 the longest repeted sustring in string u (if ll suffixes of u were inserted into S), or the longest ommon prefix of two independent strings. For the indution step, ssume tht we hve h j < (2j + 1)l for ll j < i nd tht h i (2i + 1)l. Let r, s W i e two strings with ommon prefix v of length v = h i (2i + 1)l, thus v = r, v = s, v. Let t (r), t (s) S e the elements from the se set orresponding to r nd s, i.e., d (t (r), r) = i nd d(t (s), s) = i. Rell tht, y Lemms 5.7 nd 5.10, r hi 1 +2, = t hi 1 +2, if r W i stems from t S. It follows tht t (r) nd t (s) shre the sme sustring w = t (r) h i 1 +2,h i = t (s) h i 1 +2,h i of length w = h i h i > (2i + 1)l (2(i 1) + 1)l 1 = 2l 1. Even if t (r) nd t (s) re the sme string u or suffixes of the sme string u, then w nnot strt t the sme position in u: Assume for ontrdition tht w = t (r) h i 1 +2,h i = t (s) h i 1 +2,h i = u k,k+ w, then we hve t (r) h i 1 +2, = u k, = t (s) h i 1 +2,, so r nd s would not rnh t h i, whih would e ontrdition. The lst lim follows from w = t (r) h i 1 +2,h i nd h i 1 (2i 1)l. Thus, w strts efore (2i 1)l + 2 2il for l 2 in t (r) nd likewise for t (s). For the nlysis, we ssume the mixing model introdued in Setion The intuition for the next theorem is s follows. The height of (ompt) tries nd suffix trees is ounded y O(log n), where n is the rdinlity of the input set for tries or the size of the string for suffix trees (see, e.g., [AS92] or [Szp00]). When llowing n error on prefixes ounded y the height, we essentilly rejoin some strings tht were lredy rnhing. The sme proess n tke ple gin with the rejoined strings. Thus, the height of the i-th error tree should ehve no worse thn i times the height of the trie or the suffix tree. Although this ound my not e very tight, we prove extly long this intuition. We onjeture tht the heights tully ehve muh etter. Theorem 5.19 (Averge dt struture size nd preproessing time). Let n e the numer of strings in the se set S, whih ontins strings nd suffixes of strings generted independently nd identilly distriuted t rndom y sttionry ergodi soure stisfying the mixing ondition. For ny onstnt d, the verge totl size of the dt strutures is O(n log d n), nd the verge time for preproessing is O(n log d+1 n). Furthermore, these omplexities re hieved with high proility 1 o(n ǫ ) (for some ǫ > 0). Proof. By Lemm 5.18, if there exists n i suh tht h i (2i + 1)l, then we find string v of length v l tht is repeted sustring of n independent string or ommon sustring of two independent strings. Let h rep e the mximl length of ny repeted sustring in single independent string in S, let h suf e the mximl length of ny repeted sustring of string u for tht we hve inserted ll suffixes into S, nd let h om e the mximl length of ny ommon sustring of two independent strings. If we ound h rep, h suf, nd h om y l, we lso ound h i y (2i + 1)l. We first turn to long sustrings ommon to independent strings. For two independent strings r nd s, let C i,j = mx{k r i,i+k 1 = s j,j+k 1 } e the length of mximl ommon sustring t positions i nd j of the two different strings. By the sttionrity of the soure, we hve Pr{C i,j l} = Pr{C 1,1 l}. The ltter is the proility tht r nd s strt with the sme string of length l, thus Pr{C 1,1 l} = w Σ l (Pr{w})2 = E[w l ]. For sttionry nd ergodi soures stisfying the mixing ondition, we hve E[w l ] = w Σ l (Pr{w})2 e 2r 2l for l

122 WORST-CASE OPTIMAL SEARCH-TIME y eqution (2.14). By Lemm 5.18, the ommon sustrings must e found in prefixes of length 2il of string in S. As result, we find Pr {h om l} Pr r,s S,1 i,j 2il r,s S,1 i,j 2il {C i,j l} Pr {C i,j l} = r,s S,1 i,j 2il 4n 2 l 2 i 2 E E [w l] [ w l] n 2 l 2 i 2 e 2r 2l (5.9) for some onstnt nd growing l. For the mximl lengths h rep nd h suf of repeted sustrings, we use known results from [Szp93], where the length h (m) of repeted sustring in string of length m is ounded y { } ( [ Pr h (m) l m l E [w l ] + me w l]) mle r 2l (5.10) for some onstnt nd growing l > ln m r 2. Beuse m 2il for h rep nd m S = n for h suf, we n ound Pr {h rep l} il 2 e r 2l, (5.11) nd { } Pr h suf l nle r 2l. (5.12) Beuse we hve h i (2i + 1)mx{h rep, h suf, h om }, we find tht { Pr {h i l} Pr mx{h rep, h suf, h om } Pr { h rep l 2i + 1 } + Pr { h suf l } 2i + 1 l 2i + 1 } + Pr { h om il 2 l l l 2 1 (2i + 1) 2e r 2i n 2i + 1 e r 2 2i n 2 l 2 i 2 e 2r 2 l } 2i + 1 l 2i+1 n 2 l 2 i 2 e r 2 2i+1, (5.13) l for some onstnt nd l > ln n r 2. We ondition on l = (1 + ǫ)2(2i + 1) ln n r 2 nd get { Pr h i (1 + ǫ)2(2i + 1) lnn } (1 + ǫ) 2 i 4 ln 2 n 2ǫ n, (5.14) r 2 for some onstnt. Thus, with high proility 1 o(n ǫ ) we hve h i = O(log n).

123 CHAPTER 5. TEXT INDEXING WITH ERRORS 113 The expeted size n e ounded y E [h i ] (1 o(n ǫ )) log n + log n + l 0 n l (1+ǫ)2(2i+1) ln n r 2 2 l 3 i 2 e r 2 n 2ǫ ( l + (1 + ǫ)2(2i + 1) lnn r 2 log n + i 5 n 2ǫ ln 3 n l 0 l l 3 e r 2 l 2i+1 ) 3 i 2 e r 2 l 2i+1 l 2i+1 = O (log n) (5.15) euse l 0 l3 e r 2 2i+1 is onvergent. Thus, the expeted size of hi is O(log n). This proves the theorem: We hve h i < h j for ll i j; therefore, it suffies to ound the preproessing time O(nh d+1 d ) y O(n log d+1 n) nd the index size O(nh d d 1 ) y O(n logd n). 5.4 Bounded Preproessing Time nd Spe In the previous setion, we hieved worst-se gurntee for the serh time. In this setion, we desrie how to ound the index size in the worst-se in trde-off to hving n verge-se look-up time. Therefore, we fix h i in Definitions 5.6 nd 5.8 to h i = h + i for some h to e hosen lter (ut the sme for ll error trees). By Lemms 5.9, 5.16, nd 5.18, the size nd preproessing time is O(nh d ) nd O(nh d+1 ), nd the index struture llows to serh for ptterns w of length w h in optiml time O( w + o). For lrger ptterns we need n uxiliry struture, whih is generlized suffix tree (see Setion 2.3.1) for the omplete input, i.e., ll strings in the se set S. The generlized suffix tree G(S) is liner in the input size nd n e uilt in liner time. We keep the suffix links tht re used in the onstrution proess. For pttern w, we ll sustring v right-mximl if v = w i,j is sustring of some string in S, ut w i,j+1 is not sustring of some string in S. The generlized suffix tree llows us to find ll right-mximl sustrings of w in time O( w ). This n e done in the sme wy s the omputtion of mthing sttistis in [CL94]. The pproh reminds of the onstrution of suffix trees: First we ompute nonil referene pir (see Setion 3.1.3) for the lrgest prefix of w tht n e mthed in the generlized suffix tree. The prefix is right-mximl. Then we tke suffix link from the se of the referene pir repling the se y the new node. After nonizing the referene pir gin, we ontinue to mth hrters of w until we find the right-mximl sustring strting t the seond position in w. This proess is ontinued until the end of w is rehed. The totl omputtion tkes time O( w ) euse we essentilly move two pointers from left to right through w, one for the order of the right-mximl sustring nd one for the se node of the referene pir. If we uild the generlized suffix tree y the lgorithm of Ukkonen [Ukk95], this proess n lso e seen s ontinuing the lgorithm y ppending w to the underlying string nd storing the lengths of the relevnt suffixes.

124 BOUNDED PREPROCESSING TIME AND SPACE Assume tht t is n i-error length-l mth of w. Then, in the relevnt edit grph, ny pth from the strt to the end vertex ontins i non-zero rs. However, we need t lest w rs to get to the end vertex. Thus, there re t lest w i zero-weight rs nd there re t lest ( w i)/(i + 1) onseutive ones. Therefore, there must e sustring of miniml length ( w i)/(i + 1) of w tht mthes sustring of t extly. Beuse we n hndle ptterns of length t most h effiiently with our min indexing dt struture, we only need to serh for ptterns of length w > h using the generlized suffix tree. For eh right-mximl sustring v of w, we serh t ll positions where v ours in ny string in S in prefix of length t most w, whih we n esily find in the generlized suffix tree using ounded vlue rnge queries (see Setions nd 2.3.2). For every ourrene, we need to serh t most w positions, whih eh tkes time O(d w ). This yields good lgorithm on verge if we set h = (d + 1)log n, where n is the rdinlity of S, euse the proility to find ny right-mximl sustring of length log n is very smll. Lemm 5.20 (Proility of mthing sustrings). Let w e pttern generted independently nd identilly distriuted t rndom y sttionry ergodi soure stisfying the mixing ondition. Let w = (d + 1)l. The proility tht there is sustring u of w of length l tht ours in prefix of length w + d of ny string in S is ounded y n( w + d) w e rmxl for some onstnt. Proof. For sttionry ergodi soure stisfying the mixing ondition, the following limit exists [Pit85]: r mx = lim mx t Σn{ln (Pr {t}) Pr {t} > 0} n n. (5.16) (Oserve tht r mx is positive) Suppose S = {s(1),..., s(n)}. Set C i,j,k = mx{r s(i) j,j+r 1 = w k,k+r 1 }. By sttionrity of the soure, Pr{C i,j,k > l} = Pr{C i,j,1 > l} = Pr{s(i) j,j+l 1 }. Beuse Pr{s(i) j,j+l 1 } mx t Σ l Pr{t}, we n pply eqution (5.16) nd find tht Pr{C i,j,k > l} e rmxl for some onstnt nd growing l. Hene, we get Pr { u > l} = Pr 1 k w,0 i n,1 j w +d C i,j,k > l n( w + d) w mx t Σ l Pr {t} n( w + d) w e rmxl. (5.17) As result, for n ritrry pttern w of length w (1 + ǫ)(d + 1)lnn, we find tht the expeted work is ounded y d( w ) 3 ( w + d)e rmxδ w ne rmx ln n = o(1), (5.18) for δ = ǫ 1+ǫ nd > 1 r mx, wheres we n find ll shorter ptterns in optiml time. The size of our dt struture is O(n log d n + N) nd the preproessing time is O(n log d+1 n + N), where N is the size of S.

125 Chpter 6 Conlusion In this work, we hve studied lgorithms nd dt strutures for text indexing. We presented the first liner-time lgorithm for onstruting ffix trees. Affix trees hve ll the pilities of suffix trees, ut they expose more struture of the string for whih they re uilt. They re ompt representtion of ll repeted sustrings of string, where the tree struture represents the type of repet, i.e., whether it is mximl, rightmximl, or left-mximl, nd the multipliity it hs. Affix trees re inherently dul. This is refleted in our lgorithm, whih n onstrut ffix trees on-line, even in oth diretions. This idiretionl onstrution my lso hve pplitions in studying the lol struture of lrge texts y strting to uild n ffix tree t point of interest, expnding to oth sides until desired property is found in the viinity of the strt point. Finlly, ffix trees llow idiretionl serh, whih is used, e.g., in [MP03]. The pths used in our lgorithm my e dditionlly helpful y hiding ompressile nodes. The downside of the ffix tree dt struture is its enormous size (see Setion 3.1.4). Wheres lot of reserh hs lredy een onduted to redue the size of the suffix tree (see, e.g., [Kur99, GKS03]), similr work on ffix trees remins to e done. For the suffix tree, lterntive dt strutures hve lso een developed. The suffix rry is one suh dt struture, ut n nlogous trnsltion of ffix trees to ffix rrys is not ovious. On the one hnd, node in suffix tree orresponds to n intervl in the suffix rry, whih is hrder to enode. On the other hnd, we re working with suffix numers in suffix rrys; therefore, it is esily possile to derive numers of shorter or longer suffixes y simple ddition or sutrtion. Still, we know of no idiretionl serh lgorithm on suffix rrys tht llows serhing pttern y expnding it in oth diretions. Our nlysis of the verge-se running time of the trie-sed pproh to ditionry indexing is vlid for wide rnge of string distnes. In omprison with the work of Cole et l. [CGL04], our nlysis shows tht the simple lgorithm is lredy very ompetitive on verge, hving similr symptoti query time. The nlysis lso sheds some light on the effiieny of different error models to onstrin serh, e.g., Hmming distne versus Hmming distne with don t re hrters. We disussed the mening of our nlysis for some pplitions in Setion 4.4. As lso mentioned there, one usully onsiders multiple riteri to guide the serh in prtie. Although our model ptures the influene of don t re hrters on Hmming distne, it 115

126 116 nnot stisftorily nswer the question of omining multiple riteri. It would e interesting to see how these intert, e.g., if the serh is stopped when ny one of set of multiple distnes etween serh pttern nd the words in the ditionry hs eome too lrge. Besides the issue of determining more prmeters suh s the vrine or onvergene properties, nother interesting question is the extent to whih the results re trnsferle to more generl proilisti models or to serh in the suffix tree. The ltter is of generl interest, euse lot of lgorithms hve een nlyzed for the trie model. Some progress hs een mde on onneting tries to suffix trees [AS92, JS94, JMS04], ut the generl prolem remins open. Frequently ppering terms in the verge-se symptoti ehvior of lgorithms on strings re infinite sums involving oeffiients of the Gmm funtion on imginry points. These led to smll osilltions of ounded mplitude, lthough it is diffiult to give tight ound. An explntion for this osilltions might follow from the disrepny etween rel nd integer lengths: The expeted length of ommon prefix of n independently nd uniformly distriuted strings over the lphet Σ is log Σ n [Pit85], rel vlue. On the other hnd, strings n only hve integer lengths. The mplitude of oth osilltions, of log Σ n log Σ n nd of the sum in eqution (4.62), is exponentilly inresing. It would e interesting to derive the deeper reltion, therey simplifying mny symptotis. For mthing with onstnt numer of errors or mismthes, we desried nd nlyzed text indexing dt struture. It is the first dt struture for pproximte text indexing tht solves this prolem with n optiml worst-se query time tht is liner in the size of the pttern nd the numer of outputs. Our pproh is very flexile nd n e dpted to numer of different text indexing prolems. This flexiility is due to its simple tree struture tht llows to use rnge queries for the seletion of outputs, tool tht hs lredy een helpful for ext text indexing [Mut02]. Furthermore, we were le to nlyze the symptoti running time for proilisti model tht is generl enough to even inlude sttionry nd ergodi Mrkov hins. For Hmming distne with onstnt numer of mismthes d, the symptoti running time of the trie serh lgorithm is O(log d+1 n) on verge. Beuse we ssumed long (tully infinite) pttern, this n lso e interpreted s O(mlog d n) for lrge ptterns of length m. On the other hnd, the text indexing dt struture with (worst-se) query time hs (expeted) size O(n log d n). If one is willing to ept only verge-se ounds on the running time, this llows flexile hoie etween spending n dditionl term O(log d n) in the running time or in the index size. The est lgorithm giving worst-se ounds requires this dditionl term in the query time nd in the index size [CGL04]. For smll ptterns tht hve t most logrithmi length m = O(log n), our lgorithm is very effiient in the worst-se nd in ontrst to other work does not require ll query ptterns to hve the sme length. For lrge ptterns of size m = Ω(log d n), the method of Cole et l. [CGL04] hs n optiml liner query time for onstnt d. Both dt strutures require spe O(n log d n). It remins to e exmined whether we n find dt struture with this size nd optiml query time for ptterns of length ω(log n) o(log d n). The omintion of our methods with the entroid pth method of Cole et l. [CGL04] my lso yield some new results. The possiility to move the term O(log d n) from the omplexity of the query

127 CHAPTER 6. CONCLUSION 117 time to the index size lso hints towrds the existene of more generl time-spe trde-off. It would, for instne, e interesting to see if we ould design dt struture tht is dptle in the sense tht it hs size O(n log d 1 n) nd supports queries in time O(mlog d 2 n) for d 1 + d 2 = d errors. Furthermore, it would e very interesting to prove lower ound on the pproximte text indexing prolem. For the ext se, liner lower ound hs een proven on the index size [DLO01]. However, lower ounds for the pproximte indexing prolem do not seem esy to hieve. The informtion theoreti method of [DLO01] seems to fll short euse pproximte look-up does not improve ompression nd there is no restrition on the look-up time. Using symmetri ommunition omplexity, some ounds for nerest neighor serh in the Hmming ue hve een shown [BOR99, BR00], ut these do not pply to the se where liner numer (in the size of the pttern) of proes to the index is llowed. The lower ound in [BV02] is derived from ordered inry deision digrms (OBDDs) nd ssumes tht pttern is lwys red in one diretion. Finlly, the question of prtil relevne of the indexing shemes needs further reserh. Although we elieve tht our pproh is firly esy to implement, we expet the onstnt ftors to e rther lrge. We expet even lrger onstnt for the pproh used in [CGL04]. Therefore, methods for reduing the spe requirements re needed. To this end, it seems interesting to study whether the error sets n e thinned out y inluding fewer strings. For exmple, it is not neessry to inlude strings in the i-th error set tht hve errors ourring on lef edges of the (i 1)-th error tree. Furthermore, n effiient implementtion without trees, sed solely on suffix-rrylike strutures, seems to e possile: Arrys with ll leves re needed for the rnge minimum queries nywy, nd effiient rry pking nd serhing is possile for suffix rrys [AOK02]. For prtil purposes, we elieve tht even solution for only d 3 errors is desirle.

128

129 Biliogrphy [AC75] [AG97] [AKL + 00] [AKO02] [AKO04] [AOK02] [AS65] Alfred V. Aho nd Mrgret J. Corsik. Effiient string mthing: n id to iliogrphi serh. Communitions of the ACM, 18(6): , Alerto Apostolio nd Zvi Glil, editors. Pttern Mthing Algorithms. Oxford University Press, Oxford, Amihood Amir, Dmitry Keselmn, Gd M. Lndu, Moshe Lewenstein, No Lewenstein, nd Mihel Rodeh. Indexing nd ditionry mthing with one error. J. Algorithms, 37: , Mohmed Irhim Aouelhod, Stefn Kurtz, nd Enno Ohleush. The enhned suffix rry nd its pplitions to genome nlysis. In Pro. 2nd Workshop on Algorithms in Bioinformtis (WABI), volume 2452 of Leture Notes in Computer Siene, pges Springer, Mohmed Irhim Aouelhod, Stefn Kurtz, nd Enno Ohleush. Repling suffix trees with enhned suffix rrys. J. Disrete Algorithms, 2:53 86, Mohmed Irhim Aouelhod, Enno Ohleush, nd Stefn Kurtz. Optiml ext string mthing sed on suffix rrys. In A. H. F. Lender nd A. L. Oliveir, editors, Pro. 9th Int. Symp. on String Proessing nd Informtion Retrievl (SPIRE), volume 2476 of Leture Notes in Computer Siene, pges Springer, Milton Armowitz nd Irene A. Stegun. Hndook of Mthemtil Funtions, with Formuls, Grphs nd Mthemtil Tles. Dover, New York, [AS92] Alerto Apostolio nd Wojieh Szpnkowski. Self-lignments in words nd their pplitions. J. Algorithms, 13: , [BBH + 87] [BFC00] Anselm Blumer, Jnet A. Blumer, Dvid Hussler, Ross M. MConnell, nd Andrzej Ehrenfeuht. Complete inverted files for effiient text retrievl nd nlysis. J. ACM, 34(3): , July Mihel A. Bender nd Mrtin Frh-Colton. The LCA prolem revisited. In Pro. Ltin Amerin Theoretil Informtis (LATIN), pges 88 94, My

130 120 BIBLIOGRAPHY [BFCP + 01] [BG96] [BGW00] [BK03] Mihel A. Bender, Mrtin Frh-Colton, Giridhr Pemmsni, Steven Skien, nd Pvel Sumzin. Lest ommon nestors in trees nd direted yli grphs. Preliminry version sumitted to the Journl of Algorithms, Gerth Stølting Brodl nd Leszek G sienie. Approximte ditionry queries. In Pro. 7th Symp. on Comintoril Pttern Mthing (CPM), volume 1075 of Leture Notes in Computer Siene, pges 65 74, Adm L. Buhsum, Mihel T. Goodrih, nd Jeffery Westrook. Rnge serhing over tree ross produts. In Pro. 8th Europen Symp. on Algorithms (ESA), volume 1879, pges , Stefn Burkhrdt nd Juh Kärkkäinen. Fst lightweight suffix rry onstrution nd heking. In Pro. 14th Symp. on Comintoril Pttern Mthing (CPM), volume 2676 of Leture Notes in Computer Siene, pges Springer, [BKML + 04] Dennis A. Benson, Ilene Krsh-Mizrhi, Dvid J. Lipmn, Jmes Ostell, nd Dvid L. Wheeler. Gennk: updte. Nulei Aids Reserh,, 32(Dtse issue):d23 D26, [BOR99] [BR00] [Br86] [Bri59] [BT03] Alln Borodin, Rfil Ostrovsky, nd Yuvl Rni. Lower ounds for high dimensionl nerest neighor serh nd relted prolems. In Pro. 31st ACM Symp. on Theory of Computing (STOC), pges ACM Press, Omer Brkol nd Yuvl Rni. Tighter ounds for nerest neighor serh nd relted prolems in the ell proe model. In Pro. 32nd ACM Symp. on Theory of Computing (STOC), pges ACM Press, Rihrd C. Brdley. Bsi properties of strong mixing onditions. In Ernst Eerlein nd Murd S. Tqqu, editors, Dependene in Proility nd Sttistis. Birkhäuser, Rene De L Brindis. File serhing using vrile length keys. In Pro. Western Joint Computer Conferene, pges , Mrh Arno Buhner nd Hnjo Täuig. A fst method for motif detetion nd serhing in protein struture dtse. Tehnil Report TUM-I0314, Fkultät für Informtik, TU Münhen, Septemer [BTG03] Arno Buhner, Hnjo Täuig, nd Jn Griesh. A fst method for motif detetion nd serhing in protein struture dtse. In Pro. Germn Conferene on Bioinformtis (GCB), volume 2, pges , Otoer [BV00] Gerth Stølting Brodl nd Srinivsn Venktesh. Improved ounds for ditionry look-up with one error. Informtion Proessing Letters (IPL), 75(1 2):57 59, 2000.

131 BIBLIOGRAPHY 121 [BV02] [BYG90] Pul Beme nd Erik Vee. Time-spe trdeoffs, multiprty ommunition omplexity, nd nerest-neighor prolems. In Pro. 34th ACM Symp. on Theory of Computing (STOC), pges ACM Press, Rirdo A. Bez-Ytes nd Gston H. Gonnet. All-ginst-ll sequene mthing. Tehnil report, Universidd de Chile, [BYG96] Rirdo A. Bez-Ytes nd Gston H. Gonnet. Fst text serhing for regulr expressions or utomton serhing on tries. J. ACM, 43(6): , [BYG99] [CCGL99] Rirdo A. Bez-Ytes nd Gston H. Gonnet. A fst lgorithm on verge for ll-ginst-ll sequene mthing. In Pro. 6th Int. Symp. on String Proessing nd Informtion Retrievl (SPIRE), pges IEEE, Amit Chkrrti, Bernrd Chzelle, Benjmin Gum, nd Alexey Lvov. A lower ound on the omplexity of pproximte nerest-neighor serhing on the Hmming ue. In Pro. 31st ACM Symp. on Theory of Computing (STOC), pges , [CGL04] Rihrd Cole, Lee-Ad Gottlie, nd Moshe Lewenstein. Ditionry mthing nd indexing with errors nd don t res. In Pro. 36th ACM Symp. on Theory of Computing (STOC), pges , [CL94] [CLR90] [CN02] [Co95] [Com74] [CR94] Willim I. Chng nd Eugene L. Lwler. Suliner pproximte string mthing nd iologil pplitions. Algorithmi, 12: , Thoms H. Cormen, Chrles E. Leiserson, nd Ronld L. Rivest. Introdution to Algorithms. The MIT Press, 1st edition, Edgr Chávez nd Gonzlo Nvrro. A metri index for pproximte string mthing. In Pro. 5th Ltin Amerin Theoretil Informtis (LATIN), volume 2286 of Leture Notes in Computer Siene, pges , Arhie L. Cos. Fst pproximte mthing using suffix trees. In Pro. 6th Symp. on Comintoril Pttern Mthing (CPM), volume 937 of Leture Notes in Computer Siene, pges Springer, Luis Comtet. Advned Comintoris. D. Reidel Pulishing Compny, Dortreht-Hollnd, Mxime Crohemore nd Wojieh Rytter. Text Algorithms. Oxford University Press, New York, [DLO01] Erik D. Demine nd Alejndro López-Ortiz. A liner lower ound on index size for text retrievl. In Pro. 12th ACM-SIAM Symp. on Disrete Algorithms (SODA), pges ACM, Jnury 2001.

132 122 BIBLIOGRAPHY [Fr97] [FBY92] [FGD95] [FGK + 94] [Fie70] [FM00] [FMdB99] [FP86] [FR85] Mrtin Frh. Optiml suffix tree onstrution with lrge lphets. In Pro. 38th IEEE Symp. on Foundtions of Computer Siene (FOCS), pges , Mimi, Florid, USA, Otoer IEEE. Willim B. Frkes nd Rirdo Bez-Ytes, editors. Informtion Retrievl Dt Strutures & Algorithms, volume 1. Prentie Hll, Philippe Fljolet, Xvier Gourdon, nd Philippe Dums. Mellin trnsforms nd symptotis: Hrmoni sums. Theoretil Computer Siene, 144(1 2):3 58, Philippe Fljolet, Peter J. Grner, Peter Kirshenhofer, Helmut Prodinger, nd Roert F. Tihy. Mellin trnsforms nd symptotis: Digitl sums. Theoretil Computer Siene, 123: , Jerry L. Fields. The uniform symptoti expnsion of rtio of Gmm funtions. In Pro. Int. Conf. on Construtive Funtion Theory, pges , Vrn, My Polo Ferrgin nd Giovnni Mnzini. Opportunisti dt strutures with pplitions. In Pro. 41st IEEE Symp. on Foundtions of Computer Siene (FOCS), pges , Polo Ferrgin, S. Muthukrishnn, nd Mrk de Berg. Multi-method dispthing: geometri pproh with pplitions to string mthing prolems. In Pro. 31st ACM Symp. on Theory of Computing (STOC), pges ACM Press, Philippe Fljolet nd Clude Pueh. Prtil mth retrievl of multidimensionl dt. J. ACM, 33(2): , Philippe Fljolet nd Mireille Regnier. Some uses of the Mellin trnsform in the nlysis of lgorithms. In A. Apostolio nd Z. Glil, editors, Comintoril Algorithms on Words, volume F12 of NATO ASI, pges Springer, [Fre60] Edwrd Fredkin. Trie memory. Communitions of the ACM, 3(9): , [FS95] [FS96] [GBT84] Philippe Fljolet nd Roert Sedgewik. Mellin trnsforms nd symptotis: Finite differenes nd Rie s integrl. Theoretil Computer Siene, 144(1-2): , Philippe Fljolet nd Roert Sedgewik. The verge se nlysis of lgorithms: Mellin trnsform symptotis. Tehnil Report 2956, Institut Ntionl de Reherhe en Informtique et Automtique (INRIA), Hrold N. Gow, Jon Louis Bently, nd Roert E. Trjn. Sling nd relted tehniques for geometry prolems. In Pro. 16th ACM Symp. on Theory of Computing (STOC), pges ACM, April 1984.

133 BIBLIOGRAPHY 123 [GBY91] Gston H. Gonnet nd Rirdo A. Bez-Ytes. Hndook of Algorithms nd Dt Strutures: In Psl nd C. Addison-Wesley Pulishing Compny, 2nd edition, [GK97] [GKP94] [GKS03] [GMRS03] [GS04] [Gus97] [GV00] [Hm50] [HT84] Roert Giegerih nd Stefn Kurtz. From Ukkonen to MCreight nd Weiner: A unifying view of liner-time suffix tree onstrution. Algorithmi, 19: , Ronld L. Grhm, Donld E. Knuth, nd Oren Ptshnik. Conrete Mthemtis. Addison-Wesley, 2nd edition, Roert Giegerih, Stefn Kurtz, nd Jens Stoye. Effiient implementtion of lzy suffix trees. Softwre Prtie nd Experiene, 33: , Alessndr Griele, Filippo Mignosi, Antonio Restivo, nd Mrinell Siortino. Indexing strutures for pproximte string mthing. In Pro. 5th Itlin Conferene on Algorithms nd Complexity (CIAC), volume 2653 of Leture Notes in Computer Siene, pges , Dn Gusfield nd Jens Stoye. Liner time lgorithms for finding nd representing ll the tndem repets in string. J. Computer nd System Sienes, 69: , Dn Gusfield. Algorithms on Strings, Trees, nd Sequenes: Comp. Siene nd Computtionl Biology. Cmridge University Press, Roerto Grossi nd Jeffry Sott Vitter. Compressed suffix rrys nd suffix trees with pplitions to text indexing nd string mthing. In Pro. 32nd ACM Symp. on Theory of Computing (STOC), pges , Rihrd W. Hmming. Error deteting nd error orreting odes. The Bell System Tehnil Journl, pges , Dov Hrel nd Roert Endre Trjn. Fst lgorithms for finding nerest ommon nestors. SIAM J. Comp., 13(2): , My [Ind04] Piotr Indyk. Nerest neighors in high-dimensionl spes. In Jo E. Goodmn nd Joseph O Rourke, editors, Hndook of Disrete nd Computtionl Geometry, hpter 39. CRC Press LLC, 2nd edition, [JMS04] [JS91] Philippe Jquet, Bonit MVey, nd Wojieh Szpnkowski. Compt suffix trees resemle PATRICIA tries: Limiting distriution of depth. J. the Irnin Sttistil Soiety, Philippe Jquet nd Wojieh Szpnkowski. Anlysis of digitl tries with Mrkovin dependeny. IEEE Trnstions on Informtion Theory, 37(5): , 1991.

134 124 BIBLIOGRAPHY [JS94] Philippe Jquet nd Wojieh Szpnkowski. Autoorreltion on words nd its pplitions: Anlysis of suffix trees y string-ruler pproh. J. Comintoril Theory, Series A, 66: , [JU91] Petteri Jokinen nd Esko Ukkonen. Two lgorithms for pproximte string mthing in stti texts. In Pro. 16th Int. Symp. on Mthemtil Foundtions of Computer Siene (MFCS), volume 520 of Leture Notes in Computer Siene, pges Springer, [KA03] Png Ko nd Srinivs Aluru. Spe effiient liner time onstrution of suffix rrys. In Rirdo A. Bez-Ytes, Edgr Chávez, nd Mxime Crohemore, editors, Pro. 14th Symp. on Comintoril Pttern Mthing (CPM), volume 2676 of Leture Notes in Computer Siene, pges Springer, [Kl02] [Kär95] Olv Kllenerg. Sttionry proesses nd ergodi theory. In Foundtions of Modern Proility, hpter 9. Springer, Juh Kärkkäinen. Suffix tus: A ross etween suffix tree nd suffix rry. In Pro. 6th Symp. on Comintoril Pttern Mthing (CPM), volume 937 of Leture Notes in Computer Siene, pges , [Kär02] Juh Kärkkäinen. Computing the threshold for q-grm filters. In Pro. 8th Sndinvin Workshop on Algorithm Theory, pges , [Kin73] [Kir96] [Knu97] [Knu97] [Knu98] [KP86] [KPS93] John Frnk Chrles Kingmn. Sudditive ergodi theory. Annls of Proility, 1(6): , Peter Kirshenhofer. A note on lternting sums. Eletroni Journl of Comintoris, 3(2), Donld E. Knuth. The Art of Computer Progrmming Fundmentl Algorithms, volume 1. Addison Wesley, 3rd edition, Septemer Donld E. Knuth. The Art of Computer Progrmming Seminumeril Algorithms, volume 2. Addison Wesley, 3rd edition, July Donld E. Knuth. The Art of Computer Progrmming Sorting nd Serhing, volume 3. Addison Wesley, 2nd edition, Ferury Peter Kirshenhofer nd Helmut Prodinger. Some further results on digitl serh trees. In Lurent Kott, editor, 13th Int. Colloq. on Automte, Lnguges nd Progrmming, volume 226 of Leture Notes in Computer Siene, Rennes, Frne, Springer. Peter Kirshenhofer, Helmut Prodinger, nd Wojieh Szpnkowski. Multidimensionl digitl serhing nd some new prmeters in tries. Int. J. Foundtions of Computer Siene, 4:69 84, 1993.

135 BIBLIOGRAPHY 125 [KS03] Juh Kärkkäinen nd Peter Snders. Simple liner work suffix rry onstrution. In Pro. 30th Int. Colloq. on Automt, Lnguges nd Progrmming (ICALP), volume 2719 of Leture Notes in Computer Siene, pges Springer, [KSPP03] [Kur99] [Lev65] [ŁS97] [LS99] [LT97] [M00] [M03] [M04] Dong Kyue Kim, Jeong Seop Sim, Heejin Prk, nd Kunsoo Prk. Liner-time onstrution of suffix rrys (extended strt). In Rirdo A. Bez-Ytes, Edgr Chávez, nd Mxime Crohemore, editors, Pro. 14th Symp. on Comintoril Pttern Mthing (CPM), volume 2676 of Leture Notes in Computer Siene, pges Springer, Stefn Kurtz. Reduing the spe requirement of suffix trees. Softwre Prtie nd Experiene, 29(13): , Vldimir Iosifovih Levenshtein. Binry odes ple of orreting deletions, insertions nd reversls. Dokldy Akdemii Nuk SSSR, 163(4): , August Tomsz Łuzk nd Wojieh Szpnkowski. A suoptiml lossy dt ompression sed on pproximte pttern mthing. IEEE Trnstions on Informtion Theory, 43(5): , N. Jesper Lrsson nd Kunihiko Sdkne. Fster suffix sorting. Tehnil Report LU-CS-TR:99-214, Dept. of Comp. Siene, Lund University, Sweden, Dniel Lopresti nd Andrew Tomkins. Blok edit models for pproximte string mthing. Theoretil Computer Siene, 181: , Moritz G. Mß. Liner idiretionl on-line onstrution of ffix trees. In Pro. 11th Symp. on Comintoril Pttern Mthing (CPM), volume 1848 of Leture Notes in Computer Siene, pges Springer, June Moritz G. Mß. Liner idiretionl on-line onstrution of ffix trees. Algorithmi, 37(1):43 74, June Moritz G. Mß. Averge-se nlysis of pproximte trie serh. In Pro. 15th Symp. on Comintoril Pttern Mthing (CPM), volume 3109 of Leture Notes in Computer Siene, pges Springer, July [M04] Moritz G. Mß. Averge-se nlysis of pproximte trie serh. Tehnil Report TUM-I0405, Fkultät für Informtik, TU Münhen, Mrh [MC76] Edwrd M. MCreight. A spe-eonomil suffix tree onstrution lgorithm. J. ACM, 23(2): , April 1976.

136 126 BIBLIOGRAPHY [MF04] [MM93] [MN04] [MN05] [MN05] [Mor68] Giovnni Mnzini nd Polo Ferrgin. Engineering lightweight suffix rry onstrution lgorithm. Algorithmi, 40(1):33 50, June Udi Mner nd Gene Myers. Suffix rrys: A new method for on-line string serhes. SIAM J. Comp., 22(5): , Otoer Moritz G. Mß nd Johnnes Nowk. A new method for pproximte indexing nd ditionry lookup with one error. To e pulished in ipl Informtion Proessing Letters (IPL), Moritz G. Mß nd Johnnes Nowk. Text indexing with errors. Tehnil Report TUM-I0503, Fkultät für Informtik, TU Münhen, Mrh Moritz G. Mß nd Johnnes Nowk. Text indexing with erros. In Pro. 16th Symp. on Comintoril Pttern Mthing (CPM), To e pulished in Springer LNCS, Donld R. Morrison. PATRICIA prtil lgorithm to retrieve informtion oded in lphnumeri. J. ACM, 15(4): , Otoer [MP69] Mrvin Minsky nd Seymour Ppert. Pereptrons. MIT Press, [MP03] Ginrlo Muri nd Giulio Pvesi. Pttern disovery in RNA seondry struture using ffix trees. In Pro. 14th Symp. on Comintoril Pttern Mthing (CPM), volume 2676 of Leture Notes in Computer Siene, pges , [MPP01] Conrdo Mrtinez, Alois Pnholzer, nd Helmut Prodinger. Prtil mth queries in relxed multidimensionl serh trees. Algorithmi, 29: , [Mut02] [MZS02] [Nv98] [Nv01] [NBY00] S. Muthukrishnn. Effiient lgorithms for doument retrievl prolems. In Pro. 13th ACM-SIAM Symp. on Disrete Algorithms (SODA). ACM/SIAM, Krisztián Monostori, Arkdy Zslvsky, nd Heinz Shmidt. Suffix vetor: Spe- nd time-effiient lterntive to suffix trees. In Mihel J. Oudshoorn, editor, Pro. Twenty-Fifth Austrlsin Computer Siene Conferene (ACSC2002), Melourne, Austrli, Gonzlo Nvrro. Approximte Text Serhing. PhD thesis, Dept. of Computer Siene, University of Chile, Sntigo, Chile, Gonzlo Nvrro. A guided tour to pproximte string mthing. ACM Computing Surveys, 33(1):31 88, Mrh Gonzlo Nvrro nd Rirdo Bez-Ytes. A hyrid indexing method for pproximte string mthing. J. Disrete Algorithms, 1(1): , Speil issue on Mthing Ptterns.

137 BIBLIOGRAPHY 127 [NBYST01] [Nör24] Gonzlo Nvrro, Rirdo A. Bez-Ytes, Erkki Sutinen, nd Jorm Trhio. Indexing methods for pproximte string mthing. IEEE Dt Engineering Bulletin, 24(4):19 27, Deemer Niels Erik Nörlund. Vorlesungen üer Differenzenrehnung. Springer, Berlin, [Now04] Johnnes Nowk. A new indexing method for pproximte pttern mthing with one mismth. Mster s thesis, Fkultät für Informtik, Tehnishe Universität Münhen, Grhing. Münhen, Germny, Ferury [Ofl96] [Pit85] [Pit86] [PW99] [Rud87] [SM02] Keml Oflzer. Error-tolernt finite-stte reognition with pplitions to morphologil nlysis nd spelling orretion. Computtionl Linguistis, 22(1):73 89, Boris Pittel. Asymptotil growth of lss of rndom trees. Annls of Proility, 13(2): , Boris Pittel. Pths in rndom digitl tree: Limiting distriutions. Advnes in Applied Proility, 18: , Edwrd H. Porter nd Willim E. Winkler. Approximte string omprison nd its effet on n dvned reord linkge system. In Reord Linkge Tehniques 1997: Proeedings of n Int. Workshop nd Exposition, pges , Wshington, D.C., Ntionl Reserh Counil, Ntionl Ademy Press. Wlter Rudin. Rel nd Complex Anlysis. MGrw-Hill, 3rd edition, Klus U. Shulz nd Stoyn Mihov. Fst string orretion with Levenshtein utomt. Int. J. on Doument Anlysis nd Reognition (IJ- DAR), 5:67 85, [Sto95] Jens Stoye. Affixäume. Diplomreit, Universität Bielefeld, My [Sto00] [SV88] [Szp88] Jens Stoye. Affix trees. Tehnil Report , Universität Bielefeld, Tehnishe Fkultät, Bruh Shieer nd Uzi Vishkin. On finding lowest ommon nestors: Simplifition nd prlleliztion. SIAM J. Comp., 17(6): , Deemer Wojieh Szpnkowski. The evlution of n lterntive sum with pplitions to the nlysis of some dt strutures. Informtion Proessing Letters (IPL), 28:13 19, [Szp88] Wojieh Szpnkowski. Some results on v-ry symmetri tries. J. Algorithms, 9: , 1988.

138 128 BIBLIOGRAPHY [Szp93] [Szp93] [Szp00] [TE51] [Tem96] [Ukk85] [Ukk93] Wojieh Szpnkowski. Asymptoti properties of dt ompression nd suffix trees. IEEE Trnstions on Informtion Theory, 39(5): , Septemer Wojieh Szpnkowski. A generlized suffix tree nd its (un)expeted symptoti ehviors. SIAM J. Comp., 22(6): , Deemer Wojieh Szpnkowski. Averge Cse Anlysis of Algorithms on Sequenes. Wiley-Intersiene, 1st edition, Frneso Giomo Triomi nd Arthur Erdélyi. The symptoti expnsion of rtio of Gmm funtions. Pifi J. Mthemtis, 1: , Nio M. Temme. An Introdution to Clssil Funtions of Mthemtil Physis. Wiley, New York, Esko Ukkonen. Algorithms for pproximte string mthing. Informtion nd Control, 64: , Esko Ukkonen. Approximte string-mthing over suffix trees. In Pro. 4th Symp. on Comintoril Pttern Mthing (CPM), volume 684 of Leture Notes in Computer Siene, pges Springer, [Ukk95] Esko Ukkonen. On-line onstrution of suffix trees. Algorithmi, 14: , [Vui80] [WDH + 04] [Wei73] [YY95] [YY97] Jen Vuillemin. A unifying look t dt strutures. Communitions of the ACM, 23(4): , F.A.O. Werner, G. Durstewitz, F.A. Hermnn, G. Thller, W. Krämer, S. Kollers, J. Buitkmp, M. Georges, G. Brem, J. Mosner, nd R. Fries. Detetion nd hrteriztion of SNPs useful for identity ontrol nd prentge testing in mjor Europen diry reeds. Animl Genetis, 35(1):44 49, Ferury Peter Weiner. Liner pttern mthing lgorithms. In Pro. 14th IEEE Symp. on Swithing nd Automt Theory, pges IEEE, Andrew C. Yo nd Frnes F. Yo. Ditionry look-up with smll errors. In Pro. 6th Symp. on Comintoril Pttern Mthing (CPM), volume 937 of Leture Notes in Computer Siene, pges , Andrew C. Yo nd Frnes F. Yo. Ditionry look-up with one error. J. Algorithms, 25: , 1997.

139 Index elertor, 5 tive prefix, 36 tive suffix, 36 ffix tree, 2 4, 8, 34, 31 62, 115 ompt, 3, 36 exmple of, 32 mortized, nlysis, 16, 31, 46, 54, 56, 57 onstnt time, 16 verge omptifition numer, 77, 79 verge-se nlysis, 6 9, 15, 63 65, 67 se set, 95, 108, 113 Bernoulli numers, 28, 81 generlized, 28, 82 Bernoulli polynomils, generlized, 28, 78 Bernoulli trils, independent, 17, 65 see lso memoryless soure Bet funtion, 27, 69, 70 pproximtion of, 74 idiretionl, 54 onstrution of ffix trees, serh, 115 Borel-Cntelli Lemm, 66, 67 nonize, 42, 41 43, 113 Crtesin tree, 23 Cuhy Residue Theorem, 27 entroid pth, 8, 94, 116 omplexity, 15, 16 expeted, 18 mesures, onvergene, lmost sure, 19, 66, 67 method of Kesten nd Kingmn, 66 in proility, 6, 18, 66 ost model, it, 16 logrithmi, 15 uniform, 15 word, 16 see lso omplexity CST, 20 DAWG, 3, 7 see lso direted yli word grph denonize, 46, depth, 13 depth(), 13, 100 see lso string depth depth first trversl, 6, 25, 93, 109, 110 ditionry indexing, 4, 5 pproximte, 5, 8, 9, 26, 63, 93, 115 prolem, 25 direted yli word grph, 22 ompt, 3 see lso DAWG distne, rithmeti, 90, 91 edit, 2, 6 9, 15, 26, 94 97, 99, 101, 108 k-ounded omputtion of, 99, 102, 108 weighted, 14 Hmming, 2, 5, 7, 8, 14, 26, 64, 88, 90, 91, 95, 96, 115, 116 Levenshtein, 15 see lso string distne doument listing prolem, 22, 25, 26 don t-re symol, 6, 14, 90 dynmi progrmming, 14, 24 dynmi tle, 16, 35, 39,

140 130 INDEX edge, 13 tomi, 13, 38 ompt, 20 see lso open edge edit distne, see distne edit grph, 96, 97, 99 relevnt, 96, 97, 97 99, 102, 114 edit opertion, 14 edit pth, 97 edit sequene, see ordered edit sequene edit stge, 97 end node, 49 end point, 42, 50 ergodi, 18 error tree, 104 height of, 94, 102, 111 Euler Tour, 24 Eulerin numers, 29 Eulerin polynomils, 28, 82 expliit lotion, 13, 32, 33 exponentil nelltion effet, 26, 76 Gmm funtion, 27, 70 Hmming distne, see distne height, 13 impliit lotion, 13, 34, 35, 43 k-d-trie, 6 k-error ourrene, 101 Lw of Lrge Numers, 65 left-rnhing, 12 Levenshtein distne, see edit distne liner serh lgorithm, 64 longest ommon susequene, 14 Mrkov soure, 17 mthing sttistis, 113 Mellin trnsform, 6, 28, 29, 84 memoryless soure, 17, 64, 92 mixing ondition, 18, 19, 111, 114 nested, 12 longest prefix, see tive prefix longest suffix, see tive suffix node depth, 13, 24 non-nested, see nested norml itertion, see suffix itertion ourrene, 11 pproximte, 25 set of, 12 ourrenes(), 12 see lso k-error ourrene ourrene listing prolem, 25 open edge, 36, 39, 41, 44, 49, 51 ordered edit sequene, 97, output sensitive, 22, 93 pth, 13, 38 of Σ + -tree node, 13 pth(), 13, 100 of prefix only nodes, 4, 38, 39, 47, 49 51, 54, 56, 115 pth ompression, 13, 20 PATRICIA tree, 20 exmple of, 21 pttern, 11 prefix, 11 tive, see tive prefix set of, 12 prefixes(), 12 prefix free, 12 redution, 12 set, 12, 20, 21, 101 pfree(), 12 prefix itertion, 47 prefix node, 35 prefix only node, 38 prefix tree, 3, 34 purse, 57 rnge minimum query, 23 rnge query, 9, 22 25, 101, 102, 109, 110, 116, 117 ounded vlue, 23, 24, 109, 114 olored, 23, 24, 109 geometri, 7 referene pir, 35, 113 se of, 35 nonil, 35, 36, 113 omined, 35, 36 prefix, 35, 36 representtion of virtul nodes, 13

141 INDEX 131 size of, 39 suffix, 35, 36 register, 5 relevnt suffix, 41, 113 residue, 27 reverse itertion, see prefix itertion Rie s integrls, 26, 26 29, 69, 76, 78, 84 right-rnhing, 12 sentinel, 20, 22, 103, 109 σ-field, 17, 18 σ, 64 Σ, 11 Σ + -tree, 13, 19 22, 32 34, 95, 99, 100 tomi, 13 see lso trie ompt, 13 see lso PATRICIA tree wek, 100 see lso wek trie sttionry, 18 sttionry nd ergodi, 17 Mrkov hin, 18, 19, 116 sequene, 18, 19 soure, 17, 18, 19, 110, 111, 114 stem from, 103, 105, 111 Stirling s formul, 70, 71, 75 string, 11, depth, 13 distne, 14 15, 115 omprison-sed, 14, 63, 64, 90, 91 length, 11 prefix of, 11 suffix of, 11 strong α-mixing ondition, 18, 19 sustring, 11 proper, 12 set of, 12 psustrings(), 12 set of, 12 sustrings(), 12 sutree, 13 suffix, 11 tive, see tive suffix nested, see nested proper, 12 set of, 12, 95, 110 suffixes(), 12 suffix rry, 2, 4 6, 22, 115, 117 suffix tus, 22 suffix itertion, 47 suffix link, 3, 4, 20, 31 34, 113 tomi, 4 hin of, 32, 33, 37 non-tomi, 37 suffix link tree, 32 exmple of, 32 of tomi suffix tree, 32 of ompt suffix tree, 33 suffix node, 34 suffix tree, 2 9, 20, 92, 94, 99, 100, 115 tomi, 3, 32, 40 dulity of, 32 ompt, 3, 20 wek dulity of, 32 onstrution of, Ukkonen, 40 Weiner (modified), 43 dul, 34 exmple of, 21, 32 generlized, 22, 91, 93, 113, 114 for doument listing, height of, 19, 94, 111 suffix vetor, 22 text orpus, 11, 26, 93 text indexing, 2, 11, 19, 115 pproximte, 2, 4 8, 25, 93 95, 116, 117 prolem, 7, 9, 22, 25 26, 93, 116 lssifition of, 26, 63, 64, 93, 95, 108 trie, 5, 6, 8, 9, 20, 26 ompt, 20, , 104 omptifition of strings, 77 ompressed, 20 see lso PATRICIA tree exmple of, 21 height of, 90, 100, 102, 111 hidden hrters, 77

142 132 INDEX rndom, 69 serh lgorithm, 65 numer of visited nodes, 67 see lso k-d-trie underlying string, 35 unique rnhing riterion, 13, 33, 100, 109 relxed, 100 virtul node, 13, 35, 100, 109 wek trie, 9, 100, , 104, 105, 109 with high proility, 18, 19, 93, word set of Σ + -tree, 13 words(), 13, 100