WDS'05 Prceedings f Cntributed Ppers, Prt I, 41 46, 2005. ISBN 80-86732-59-2 MATFYZPRESS Suffix Tree fr Sliding Windw: An Overview M. Senft Chrles University, Fculty f Mthemtics nd Physics, Mlstrnské náměstí 25, 118 00 Prgue, Czech Republic. Abstrct. The suffix tree is very pwerful dt structure develped riginlly fr string mtching nd string serching. It hs fund mny pplictins ver the time nd sme f them belng int the dt cmpressin field. Mny f these pplictins need suffix tree built fr sliding windw nd there exist tw clever lgrithms by Fil nd Greene nd by Lrssn tht mke this pssible. Hwever, s we shw bth pprches hve flwed prfs. We remedy this situtin bth by explining simple lterntive lgrithm nd giving crrect prf. Intrductin In 1973 Weiner intrduced new pwerful dt structure fr string mtching nd serching [Weiner, 1973]. This dt structure is clled suffix tree nd hs fund mny pplictins ver the time. Despite lsing sme grund t CDAWG [Crchemre nd Vérin, 1997] nd the suffix rry [Mnber nd Myers, 1993] ltely, the suffix tree is still very interesting dt structure. Prticulrly interesting is the pplictin f the suffix tree t dt cmpressin [Fil nd Greene, 1989; Lrssn, 1999; Senft, 2005]. Mny f them require the suffix tree t be mintined ver s clled sliding windw [Ziv nd Lempel, 1977]. Fil nd Greene develped clever methd t dpt suffix tree fr sliding windw [Fil nd Greene, 1989] tht ws lter mdified by Lrssn [Lrssn, 1999]. These methds re very similr nd hve cmmn wekness: their crrectness prfs re flwed s we will shw lter. We remedy this situtin bth by giving crrect prf nd ls describing simpler wrking methd. This pper is rgnised s fllws: The next sectin reviews sme necessry nttin nd terminlgy, leding t the definitin f tw min cncepts: the suffix tree nd the sliding windw. The third sectin describes suffix tree dpttin fr sliding windw nd ls cntins ur riginl results. First the suffix tree cnstructin nd symbl deletin lgrithms re reviewed, then suffix tree implementtin is described nd edge lbel mintennce ddressed. Tw well knwn lgrithms fr edge lbel mintennce [Fil nd Greene, 1989; Lrssn, 1999] re described nd nlysed nd ur wn simple replcement given. Weknesses in crrectness prfs re shwn fr Fil s nd Greene s s well s Lrssn s lgrithm nd new sund prf is given. We cnclude this pper with finl remrks in the lst sectin. Cncepts nd Nttin We mit bsic string nd grph-relted definitins like the definitin f n lphbet, symbl nd prefix r rt, edge nd prent. We ls give nly infrml definitins fr mst nn-bsic cncepts used in this pper nd refer the reder t e.g. [Senft, 2005] fr detils. Strings The fllwing string-relted definitins will simplify the suffix tree definitin s well s the descriptin f lgrithms in further sectins. String α is sid t ccur in string δ if there exists psitin i such tht the sequence f symbls beginning t psitin i nd ending t psitin i + α 1 equls t string α. This sequence f chrcters is clled n ccurrence f α in δ t psitin i. A string is unique in δ if it ccurs in δ exctly nce. A right brnching substring f string δ ccurs t lest twice in δ nd t lest tw f these ccurrences re fllwed by tw different chrcters. A prper substring f string δ is either the empty string r right brnching substring f δ r unique suffix f δ. Suffix Tree T give suffix tree definitin stndrd grph terminlgy (cf. [Hrry, 1969]) will be used. Hwever, t simplify things bit vertices f tree tht re nt leves will be clled ndes. The suffix tree fr string δ is rted tree with edges lbelled by nnempty substrings f δ (see Fig. 1). The strings represented in the suffix tree re exctly ll substrings f δ. The rt represents 41
c c c c c c c () Plin suffix tree (b) Augmented suffix tree Figure 1. An exmple suffix tree fr string cc. A plin suffix tree is n the left nd n ugmented suffix tree (discussed lter) n the right. Ndes (explicit) nd lctins n edges (implicit ndes) re dented by big circles nd smll circles, respectively. Lines tht strt in nde, g thrugh zer r mre edge lctins nd end with n rrw in nther nde re edges. Slid bjects re regulr suffix tree ndes nd edges, nd dshed bjects re uxiliry. Single symbls lng edges frm edge lbels. the empty string. Other vertices represent strings tht result frm the cnctentin f lbels f edges n the pth frm the rt t this nde. Due t this dulity we cn spek but vertices in string terms nd vice vers. Nte tht mny substrings re nt represented by vertex, but every substring α cn be represented s lctin pir (β, γ), where β is nde f the suffix tree representing prefix f string α nd γ is suffix f string α cntining the rest f the string (i.e. α = βγ). The shpe f the suffix tree is given by tw mre rules. First, n prent hs tw edges t its children with lbels beginning with the sme chrcter nd secnd, there exist n nn-rt nde in the tree with nly ne child. Sliding Windw The lst cncept f this sectin is the sliding windw cncept. A sliding windw n string δ is substring f δ strting t b-th symbl f δ nd ending t f-th symbl f δ. Bth b (bck) nd f (frnt) cn be incremented r decremented t mke the windw slide ver the string δ. Fr mst pplictins the bility t increment b nd f is enugh nd the mximum substring length is bunded by sme cnstnt M. Such sliding windws re used fr exmple in dt cmpressin. Methds like thse described in [Ziv nd Lempel, 1977; Senft, 2005] use the sliding windw fr the mst recently cmpressed prt f the input string nd use the infrmtin cntined here t cmpress the fllwing symbls. After this the windw is mved int new psitin nd the whle prcess is repeted. Suffix Tree fr Sliding Windw Cmpressin methds emplying sliding windw generlly hve t d serches in this windw. Fil nd Greene were the first t dpt suffix tree fr this purpse [Fil nd Greene, 1989]. T mke suffix tree slide, three things re needed: suffix tree cnstructin, n lgrithm t delete the lngest prefix frm the tree nd methd f keeping the edge lbels vlid. This sectin cvers ll f them. Suffix Tree Cnstructin There exist mny different lgrithms fr the suffix tree cnstructin [Weiner, 1973; McCreight, 1976; Ukknen, 1995; Frch, 1997]. Fr the sliding windw implementtin we cn use nly thse lgrithms tht cnstruct suffix tree directly frm the string nd red the string frm left t right like 42
thse described in [McCreight, 1976; Ukknen, 1995]. These lgrithms cn then be used fr extending the sliding windw t the right (i.e. incrementing f). Here, the very elegnt nd esy t understnd lgrithm f Ukknen is briefly reviewed, fr mre detils see [Ukknen, 1995]. Ukknen s lgrithm wrks n-line s it cnstructs the suffix tree fr string δ by mking few smll chnges t the suffix tree fr string δ. T understnd which chnges need t be mde, vertices f these trees must be nlysed. Nte tht the set f strings represented by vertices f the suffix tree fr string δ is the sme s the set f ll prper substrings f string δ. Due t this dulity we cn spek but vertices in string terms nd vice vers. Clerly, ll right brnching substrings f string δ re ls right brnching substrings f δ, but unique suffix f δ my nt be suffix f δ t ll. Hwever, ll chnges t the tree cme frm the suffixes f δ. They cn be divided int the fllwing grups: suffix grup 1 All suffixes α unique in δ (fr thse suffixes α is unique suffix f δ), suffix grup 2 All suffixes α nt unique in δ, such tht α is unique suffix f δ, suffix grup 2 α is nt right brnching in δ, suffix grup 2b α is right brnching in δ, suffix grup 3 All suffixes α nt unique in δ, such tht α is nt unique suffix f δ. Nte tht suffixes f δ re split int these grups ccrding t their length. The trnsfrmtin between the tw suffix trees hs t d the fllwing chnges fr ech suffix in suffix grup: suffix grup 1 Every suffix α in this grup is lef in the ld tree nd α is lef in the new tree. S every lef α nd its incming edge re replced by lef α nd n edge with ppended t its lbel. suffix grup 2 Suffixes in this grup becme unique fter is ppended t them. S new lef α is creted nd cnnected by new edge with lbel t the nde α. suffix grup 2 Nde α des nt exist yet nd string α is lcted n n edge. This edge is split int tw edges t α s lctin nd new nde α is creted. suffix grup 2b Nde α lredy exists. suffix grup 3 This grup cntins suffixes α tht re nt leves in the ld tree, such tht α is nt lef in the new tree. Tht mens tht α is lredy represented by the ld tree nd nthing hs t be dne here. T eliminte the wrk needed fr grup 1, Ukknen devised n implementtin trick f s clled pen edges (see [Ukknen, 1995] fr detils). The lst grup t del with is the suffix grup 2. The nly cmplictin here is tht we hve t find ll these suffixes smehw. T d it fst, the tree is ugmented with uxiliry edges clled suffix links (intrduced by Weiner in [Weiner, 1973]) nd ne uxiliry nde clled nil. Figure 1 cntins bth plin nd ugmented suffix tree fr cmprisn. Using suffix link, we cn simply jump frm lnger suffix t its lngest prper suffix nd exmine them ne by ne. We strt with the lngest repeted (i.e. nn-unique) suffix f string δ nd stp when the current suffix α cn be extended by ppending withut the need t crete new lef. The string α is then the lngest repeted suffix f δ. Lngest Prefix Deletin While there exist mny lgrithms fr suffix tree cnstructin, there is nly ne lgrithm fr deletin f the lngest prefix frm the suffix tree. The deletin f the lngest prefix ws first published in [Fil nd Greene, 1989] nd it mkes the incrementtin f b pssible fr the sliding windw fr which the suffix tree is built. Agin, we give nly brief review nd refer the reder t the riginl pper fr mre detils. Like in the cse f Ukknen s cnstructin lgrithm we py ttentin t the chnges in vertex sets between the suffix tree fr string δ nd the suffix tree fr string δ. In the cse f cnstructin ll chnges cme frm the suffixes, but here ll chnges cme frm the prefixes f δ. Als, while the lngest repeted suffix ws imprtnt fr cnstructin, the lngest repeted prefix is imprtnt fr deletin. The lngest repeted prefix will remin in the tree fter the deletin. This prefix ensures tht ll shrter 43
prefixes will be preserved s well nd tht they will ls keep their right brnching prperty nd will nt becme unique suffix. On the ther hnd, ll lnger prefixes re unique nd must be deleted. Frtuntely, they re ll lcted n the edge t the lef representing the lngest f these prefixes, i.e. the whle string δ. It fllws tht ll unique prefixes cn be remved by just deleting the lef δ nd its incming edge. Hwever, the lngest repeted prefix requires ttentin s it is the nly prefix tht cn lse the right brnching prperty r becme unique suffix: The lngest repeted prefix is right brnching. Thn it must be the prent f the lef δ. S, fter deleting the lef δ, its prent must be checked. If the prent hs nly ne remining child, then it must be deleted nd the tw incident edges jined. The lngest repeted prefix is nt right brnching. In this cse the lngest repeted prefix α is represented by lctin n the edge leding t the lef δ nd is equl t the lngest repeted suffix. Insted f deleting the lef δ, the lef is renmed t α nd the incming edge relbelled. Armed with bth cnstructin nd deletin lgrithm, we cn mke the suffix tree fr sliding windw. At first, the suffix tree fr the empty string is creted. This tree is suffix tree fr sliding windw with b = 1 nd f = 0. Next, the cnstructin lgrithm is used t increment f until the sliding windw size limit is reched. After tht, the deletin is used t dvnce b fr the first time nd frm nw n the deletin fllws every cnstructin step until f reches the end f the string. Frm this pint n, nly deletin is used t mve b until the suffix tree fr the empty string is btined. Mintining Edge Lbels Up t nw, n implementtin detils were necessry. Hwever, they re needed fr the edge lbel mintennce prblem explntin. Fr cnstnt size lphbets, the suffix tree cn be built in liner time nd spce in the size f the input string [Weiner, 1973]. T fulfil the liner spce requirement, the edge lbels must be represented by pirs f ffsets f first nd lst symbl f the lbel in the input string. Generlly ne ffset is stred in ech nde nd the prent nd child ffsets tgether describe the edge lbel f the edge cnnecting these tw ndes (fr detils see [Lrssn, 1999]). An in-memry cpy f the input string is required fr fst ccess t lbels. Hwever, if sliding windw behviur is required, then nly prt f the input string is kept in memry in circulr buffer t sve spce. In this cse ffsets describing edge lbels must be updted t keep lbels vlid s lder prts f the buffer re verwritten. T slve the edge lbel vlidity prblem described bve, three methds f keeping lbels vlid re described [Fil nd Greene, 1989; Lrssn, 1999; Senft, 2003]. The first tw lgrithms use clever prtil updtes, while the third uses simple btch updte. Btch Updte. Prbbly the simplest methd t keep the ffset vlid is btch updte [Senft, 2003]. Insted f ding sme clever tricks nd nly prtil updtes, simple full updte is dne. The size f the string symbl rry is incresed t t lest 2M symbls nd every time the sliding windw hs mved by whle M symbls, the ffsets re updted thrughut the tree. Nte tht this pprch is crrect, becuse fter M symbls prcessed, ll leves re replced nd must hve vlid ffset. These ffsets cn be simply prpgted up in the tree t mke ll ffsets in the suffix tree vlid t lest fr the next M sliding windw mves. Fil s nd Greene s Apprch. Fil nd Greene re the uthrs f the first dpttin f suffix tree fr sliding windw [Fil nd Greene, 1989]. They resn s fllws: Due t the use f suffix links in the suffix tree cnstructin nt ll ndes re entered regulrly nd ffsets stred therein cn becme utdted. The simple slutin f updting ll ncestrs fter new lef insertin is unnecessrily slw. Better pprch is t updte nly prt f the tree using the s clled perclting updte. Ech nde must hve credit cunter with nly tw vlues: 0 nd 1, tht is initilly set t 0. If nde receives credit frm ne f its children, the credit cunter is checked nd: If the credit cunter is 0, nly the cunter is incremented. Otherwise ne credit is used t updte ffset in nde s prent nd the secnd is sent t this prent. The credit cunter is decremented. When new lef is creted, tw credits re spent: ne t updte its prent nd the secnd is sent t this prent. In the ppsite situtin when nde is deleted, the nde s credit cunter is ignred nd 44
k k Figure 2. A suffix tree cnstructin exmple: trnsitin frm the suffix tree fr string t the suffix tree fr the string k. Dtted bjects re new in the tree. tw new credits re spent t updte its prent nd t send ne credit t this prent. In this cse it is necessry t check, whether the child s ffset is mre recent thn the prent s ffset. As there cn be t mst s mny new leves creted nd ndes deleted s there re symbls in the input string, the scheme wrks in liner mrtised time. The crrectness f this pprch will be ddressed lter. Lrssn s Apprch. Lrssn hs mdified Fil s nd Greene s perclting updte technique nd gve different crrectness prf [Lrssn, 1999]. Like in the Fil s nd Greene s scheme, every nde hs credit cunter nd the behviur fter receiving credit frm child nd fter new lef is creted is the sme. Hwever, the scheme differs n deletin. When nde is deleted it nly updtes its prent if it hs credit in its credit cunter. The time cmplexity is the sme s fr the riginl lgrithm. Crrectness. Fil nd Greene s well s Lrssn gve crrectness prf fr their respective lgrithms [Fil nd Greene, 1989; Lrssn, 1999]. Despite trying tw different pprches, bth prfs re flwed (this ws first mentined in [Senft, 2003]). The therem frmulted by Fil nd Greene is t wek t prve crrectness, while the therem Lrssn tried t prve is invlid. Fil nd Greene ttempted t prve the fllwing therem: Using perclting updte, every internl nde will be updted t lest nce every M sliding windws mves.. Hwever, this updte culd chnge the ffset vlue t new, but still reltively ld vlue tht becmes invlid befre the next updte. Als, the prf f this therem is invlid fr the sme resn the Lrssn s prf is. The flw is described belw. Lrssn prbbly nticed the wekness f Fil s nd Greene s crrectness prf nd decided t give new crrectness prf. Hwever, his prf is flwed like tht f Fil nd Greene. He defined fresh credit s credit tht cme t the current nde frm lef tht ws nt yet deleted frm the tree. His therem lks s fllws: Ech nde hs received fresh credit frm ech f its children.. Hwever, this is invlid, becuse when new nde is creted in the tree by edge splitting, it des nt updte its prent nd cnsequently this prent des nt receive fresh credit frm this child, cntrdictin. Nte tht the new nde did nt receive fresh credit frm its lder child s well. An exmple f this situtin is shwn in Fig. 2. Nte tht the sme situtin ls breks Fil s nd Greene s prf. After pinting ut the deficiencies in the previus wrk, we try t remedy this situtin. One pssible slutin is t use btch updte insted f perclting updte. Anther pssibility is t mdify the updting lgrithm t fix the cntrdicting cses. Hwever, fr better slutin is t give sund prf. We prve the Lrssn s updting methd here. A nde is clled ncient if the lef with which it ws creted is n lnger in the tree. Nte tht every ncient nde is lder thn ny nn-ncient nde r ny lef. As ll nn-ncient ndes nd ll leves hve their ffsets up-t-dte, we nly hve t prve tht the sme hlds fr ll ncient ndes. Therem Every ncient nde received fresh credit frm ll subtrees rted in its children. Prf We will prve this by cntrdictin. Suppse tht there exists suffix tree cntining n ncient nde α, tht did nt receive credit frm the subtree S. We nlyse the situtin f such nde tht is the deepest in the tree: 1. If the subtree S cnsists f lef, then the ncient nde α received fresh credit frm it, either directly r thrugh ndes deleted lter. 2. If the rt f the subtree S is nther ncient nde, then the situtin is similr t the previus cse s ny ncient nde deeper in the tree received enugh fresh credits t send ne t its prent. 45
3. The lst pssibility is tht the rt f the subtree is nn-ncient nde β. There re tw pssibilities: () The nn-ncient nde received fresh credit frm child ther thn the lef it ws creted with nd thus sent fresh credit t its prent. This cse is similr t thse bve. (b) Otherwise, β must hve nly tw children, where ne is the lef the nde ws creted with. The resn fr this is tht the third child culd nly pper fter n dditin f nther lef t this nde nd subsequent fresh credit being sent. The secnd child cn be: i. An ncient nde r lef lder thn its prent tht sent fresh credit up befre the prent ws creted. S this cse is gin similr t cse 1. ii. A nn-ncient nde tht sent fresh credit up, befre the current prent ws creted. This is ls similr t cse 1. iii. A nn-ncient nde tht hs nt yet sent fresh credit. S, there cn be n rbitrrily lng sequence f such ndes in prent-child reltin, but it must be finite. The lst nde in this sequence must hve child tht flls int ne f the previus tw ctegries. Tht mens tht even in this lst cse the ncient nde α receives fresh credit. By every ccunting the ncient nde must hve received fresh credit nd tht is cntrdictin. Cnclusin This pper gives n verview f the stte f the rt pprches t mintin suffix tree fr sliding windw. A gret ttentin ws pid t the edge lbel mintennce prblem slutins nd the deficiencies f ld pprches. T remedy the unstisfctry situtin, simple nd crrect edge lbel mintennce lgrithm ws given s well s new sund prf fr the ld edge lbel mintennce methds. S, the ld methds re sfe nd there exists simple lterntive methd. These results cn be pplied t bth ld nd recent wrk [Fil nd Greene, 1989; Lrssn, 1999; Senft, 2005]. References Crchemre, M. nd Vérin, R., Direct cnstructin f cmpct directed cyclic wrd grph, in Structures in Lgic nd Cmputer Science, edited by A. Apstlic nd J. Hein, vl. 1261, pp. 192 211, 1997. Frch, M., Optiml suffix tree cnstructin with lrge lphbets, in Prceedings f the 38th Annul IEEE Sympsium n Fundtins f Cmputer Science, pp. 137 143, 1997. Fil, E. R. nd Greene, D. H., Dt cmpressin with finite windws, Cmmunictins f the Asscitin fr Cmputing Mchinery, 32, 490 505, 1989. Hrry, F., Grph Thery, Addisn-Wesley, New Yrk, 1969. Lrssn, N. J., Structures f String Mtching nd Dt Cmpressin, Ph.D. thesis, Deprtment f Cmputer Science, Lund University, Sweden, 1999. Mnber, U. nd Myers, G., Suffix rrys: A new methd fr n-line string serches, SIAM Jurnl n Cmputing, 22, 935 948, 1993. McCreight, E. M., A spce-ecnmicl suffix tree cnstructin lgrithm, Jurnl f the Asscitin fr Cmputing Mchinery, 23, 262 272, 1976. Senft, M., Lssless Dt Cmpressin using Suffix Trees, Mster s thesis, Fculty f Mthemtics nd Physics, Chrles University, Prgue, Czech Republic, (in Czech), 2003. Senft, M., Suffix tree bsed dt cmpressin, in SOFSEM 2005: Thery nd Prctice f Cmputer Science, 31st Cnference n Current Trends in Thery nd Prctice f Cmputer Science, edited by P. Vjtáš et l., vl. 3381 f LNCS, pp. 350 359, Springer, 2005. Ukknen, E., On-line cnstructin f suffix trees, Algrithmic, 14, 249 260, 1995. Weiner, P., Liner pttern mtching lgrithms, in Prceedings f the 14th Annul IEEE Sympsium n Switching nd Autmt Thery, pp. 1 11, 1973. Ziv, J. nd Lempel, A., A universl lgrithm fr sequentil dt cmpressin, IEEE Trnsctins n Infrmtin Thery, IT-23, 337 343, 1977. 46