Effcent Algorthms for omputng the Trplet an Quartet Dstance Between Trees of Arbtrary Degree Gerth Støltng Broal, Rolf Fagerberg Thomas Malun hrstan N. S. Peersen, Anreas San, Abstract The trplet an quartet stances are stance measures to compare two roote an two unroote trees, respectvely. The leaves of the two trees shoul have the same set of n labels. The stances are efne by enumeratng all subsets of three labels trplets an four labels quartets, respectvely, an countng how often the nuce topologes n the two nput trees are fferent. In ths paper we present effcent algorthms for computng these stances. We show how to compute the trplet stance n tme On log n an the quartet stance n tme On log n, where s the maxmal egree of any noe n the two trees. Wthn the same tme bouns, our framewor also allows us to compute the parameterze trplet an quartet stances, where a parameter s ntrouce to weght resolve bnary topologes aganst unresolve non-bnary topologes. The prevous best algorthm for computng the trplet an parameterze trplet stances have On runnng tme, whle the prevous best algorthms for computng the quartet stance nclue an O 9 n log n tme algorthm an an On.688 tme algorthm, where the latter can also compute the parameterze quartet stance. Snce n, our algorthms mprove on all these algorthms. Introucton Trees are wely use n many scentfc fels to represent relatonshps, n partcular n bology where trees are use e.g. to represent speces relatonshps, so calle phylogenes, the relatonshp between genes n gene famles, or for herarchcal clusterng of hgh-throughput expermental ata. When usng trees to nvestgate relatonshps, fferent ata or fferent computatonal reconstructon methos, can lea to slghtly fferent trees Department of omputer Scence, Aarhus Unversty, Denmar. gerth@cs.au.. MADALGO, enter for Massve Data Algorthms, a enter of the Dansh Natonal Research Founaton. Department of Mathematcs an omputer Scence, Unversty of Southern Denmar, Denmar. rolf@maa.su.. Bonformatcs Research enter, Aarhus Unversty, Denmar. {malun,cstorm,asan}@brc.au.. on the same set of leaves. Several stance measures exst to compare trees constructe from fferent ata, or trees constructe wth the same ata but usng fferent reconstructon methos. These nclue the Robnson-Fouls stance [8, the trplet stance [, an the quartet stance [6, whch all enumerate certan features of the trees they compare an count how often the features ffer between the two trees. The Robnson-Fouls stance enumerates all eges n the trees an counts how many of the nuce bparttons ffer between the two nput trees. The trplet stance for roote trees an quartet stance for unroote trees enumerate all subsets of leaves of sze three an four, respectvely, an count how many of the nuce topologes ffer between the two nput trees. For trees wth n leaves, the Robnson-Fouls stance can be compute n tme On [, whch s optmal. The quartet stance can be compute n tme On log n for bnary trees [, n tme O 9 n log n for trees where all noes have egree less than [9, an n tme On.688 for trees [7 of arbtrary egree. See also hrstansen et al. [ for a number of algorthms for general trees wth fferent traeoffs epenng on the egree of nner noes. For the trplet stance, On tme algorthms exst for both bnary an general trees [,. In ths paper we present effcent algorthms for computng the trplet an quartet stance between two trees of arbtrary egree. Our algorthms are nspre by the eas ntrouce n [, extene to hanle the atonal complexty of trees of arbtrary egree that mply non-bnary trplet an quartet topologes socalle unresolve topologes. We show how to compute the trplet stance n tme On log n an the quartet stance n tme On log n, where s the maxmal egree of any noe n the two trees. Snce n, our algorthms mprove on all prevous algorthms for computng the trplet an quartet stance. Bansal et al. [ ntrouce the concept of parameterze trplet an quartet stances, where a scalng
parameter s use to weght resolve bnary topologes aganst unresolve non-bnary topologes. Ther On tme algorthm for computng the trplet stance between general trees also computes the parameterze trplet stance. The prevous best algorthm for computng the parameterze quartet stance s the On.688 tme algorthm n [7 for the quartet stance between general trees that can be extene to compute the parameterze quartet stance wthout loss of tme. The algorthms presente n ths paper can also compute the parameterze trplet an quartet stance an thus also mprove on all prevous algorthms for computng these measures. Overvew The trplet an quartet stance measures are efne for two trees, T an T, havng the same set of leaves,.e, they both have n leaves an these are labele wth the same set of leaf labels. The trplet stance s efne for roote trees, whle the quartet stance s efne for unroote trees. A trplet s a set {,, } of three leaf labels. Ths s the smallest number of leaves for whch the subtree nuce by these leaves can have fferent topologes n two roote trees T an T. The possble topologes, gven by the lowest common ancestor LA relatonshps between the three leaves, are shown n Fg.. The last case s not possble for bnary trees, but t s for trees of arbtrary egrees, whch s the subect of ths paper. The trplet stance s efne as the number of trplets whose topology ffer n the two trees. It can naïvely be compute by enumeratng all n sets of three leafs an for each comparng the nuce topologes n the two trees. A quartet s a set {,,, l} of four leaf labels. Ths s the smallest number of leaves for whch the subtree nuce by these leaves can have fferent topologes n two unroote trees T an T. The possble topologes, gven by the relatonshps of the paths between pars of leaves n the quartet, are shown n Fg.. The last case s not possble for bnary trees, but t s for trees of arbtrary egrees. For the remanng three cases, the set {,,, l} s splt nto two sets of two leaves. We sometmes use the notaton l for splts, wth ths partcular nstance referrng to the leftmost topology n Fg.. Smlar to the trplet stance, the quartet stance s efne as the number of quartets whose topology ffer n the two trees. It can naïvely be compute by enumeratng all n sets of four leafs an for each comparng the nuce topologes n the two trees. Base on termnology from phylogenetc trees, the rghtmost case n each of the Fg. an s calle T Fgure : Trplet topologes. l l Fgure : Quartet topologes. l T Resolve Unresolve Resolve A: Agree B: Dffer Unresolve D E Fgure : ases for topologes n the two trees. unresolve, whle the remanng are calle resolve. Both for trplets an quartets, the possble topologes n the two trees T an T can be parttone nto the fve cases shown n Fg.. Note that n the resolve-resolve case, the trplet/quartet topologes may ether agree or ffer n the two trees. In the resolve-unresolve an unresolve-resolve cases they always ffer, an n the unresolve-unresolve case, they always agree. At the outermost level, our algorthms for calculatng the trplet an quartet stances wor by frst fnng the row an column sums n Fg.,.e., fnng the sums A + B +, D + E, A + B + D, an + E, where each captal letter esgnates the number of trplets/quartets fallng n that category. As we show n Sec., fnng these sums can be one n On tme by smple ynamc programmng. Once all these are foun,, D an E can be foun f A an B are nown, or D, an B can be foun f A an E are nown. The trplet or quartet stance can then be calculate as B + + D = A + B + + A + B + D A B. We wll therefore concentrate on calculatng A an B or E n the remaner of ths paper. The parameterze trplet an quartet stance s efne by Bansal et al. [ as B + p + D, for p, whch maes t possble to weght the contrbuton of unresolve trplets/quartets to the total trplet/quartet stance. For p =, unresolve trplets/quartets o not contrbute to the stance,.e. unresolve trplets/quartets are gnore, whle for p =, they contrbute fully to the stance as s the case n the unparameterze trplet/quartet stance. The parameterze trplet/quartet stance can be calculate by B + pa + B + + A + B + D A B. l
Trplets Resolve Unresolve Quartets Resolve Unresolve Type α Type β Type γ l l l l Fgure : The anchor ponts whte noes of resolve an unresolve trplets an quartets. Note that eges n the fgures correspon to sont paths n the unerlyng tree. For roote trees an trplets, we wll show that A an E can be compute n tme On log n. For unroote trees an quartets, we wll show that A can be compute n tme On log n, an B can be compute n On log n tme, where s the maxmal egree of any noe n the two trees. Our technques for fnng A an B, for both trplets an quartets, are very smlar n nature, but qute techncal. We therefore gve a hgh-level ntroucton to these technques, manly formulate for fnng A for trplets, n the rest of ths secton some termnology ntrouce wll be gven for quartets too. For a tree T, we assgn each of the n trplets, as well as each of the n quartets, to a specfc noe n the tree. We say that the trplet/quartet s anchore at that noe. For trplets, ths anchor noe s the lowest common ancestor n T of the three leaves. See Fg., where anchor noes are whte. For quartets, the efnton s slghtly more nvolve: The quartet stance tself s efne for unroote trees, but we wll later nee to root T n an arbtrary nternal noe. In the roote tree, a quartet can appear n one of several forms. The possble forms, an our chosen anchor noe for each, are shown n the bottom part of of Fg.. For a noe v T we enote by τ v the set of trplets anchore at v. Then {τ v v T } s a partton of the set T of trplets. Thus, {τ v τ u v T, u T } s also a partton of T. Our algorthm wll fn AT = v T u T Aτ v τ u, where AS on a set S of trplets s the number of trplets beng resolve n both T an T, an havng the same topology n both trees. In the algorthm, we capture the resolve trplets n τ v by a colorng of the leaves. For a noe v, we say l that the tree s colore accorng to v f leaves not n the subtree of v are colore wth the color, all leaves n the th subtree are colore. For two bnary noes u T an v T we can e.g. compute Aτ v τ u as follows. When colore accorng to v, then the resolve trplets n τ v are exactly the trplets havng two leaves colore an one leaf colore, or two leaves colore an one leaf colore. The number Aτ v τ u can be foun as follows: let a an b be the two subtrees of u, let a an b be the number of leaves colore n the two subtrees a an b, respectvely, an a an b the two number of leaves colore. Then Aτ v τ u = a b + b a + b a + a b. We call these the number of resolve trplets of u compatble wth the colorng. In later sectons, we generalze such calculatons to noes of arbtrary egree. Naïvely gong through T an for each noe v colorng all leaves woul tae tme On per noe, for a total tme of On. We reuce the number of recolorngs by a recursve colorng algorthm responsble for successvely generatng all colorngs accorng to noes n T. Gong through T for each colorng an countng the number of resolve trplets compatble wth the colorng woul also tae tme On. We reuce ths wor by usng a herarchcal ecomposton tree HDT for T. A HDT s a balance bnary tree bult on the noes of T, wth the noes of the HDT corresponng to parts of T. In the HDT, the part of an nner noe correspons to the unon of the parts of ts two chlren, wth the root corresponng to all of T. We show that we are able to ecorate the noes of the HDT wth nformaton quantfyng the contrbuton of the noe s part to the total count of the number of resolve trplets compatble wth the colorng. The value at the root then contans Aτ v τ u. In essence, the HDT performs the nner sum of AT = v T u T Aτ v τ u, whle the colorng algorthm performs the outer sum. The crux of the nformaton ecoratng scheme s that the nformaton n a noe n the HDT can be calculate n tme O from the nformaton n ts two chlren, hence upates of colors of leaves of T are cheap to propagate through the tree. The excepton s for the case of calculatng the value B for quartets, where the propagaton from chlren to parents n the HDT taes tme O, whch accounts for the extra factor of n the runnng tme of our algorthm for the quartet stance. The nformaton ecoratng scheme s qute nvolve, wth many countng varables efne to assst n the propagaton of values an of the man count of the number of resolve trplets compatble wth the colorng. The man techncal contrbuton of the
paper s ths nformaton ecoratng scheme, an the constructon of the HDT. Both are generalzatons from bnary trees to trees of arbtrary egree of technques from [. The rest of the paper s organze as follows: In Sec. we show how to fn the row an column sums of Fg.. In Sec., we escrbe the constructon of the HDT. In Sec., we gve the man algorthm an prove t has the clame tme complexty. In Sec. 6 an Sec. 7 we escrbe the nformaton ecoraton scheme. ountng trplets an quartets n a sngle tree In ths secton we escrbe how to count the number of resolve an unresolve trplets an quartets n a sngle tree wth n leaves n On tme usng straghtforwar ynamc programmng. Algorthms achevng ths was also presente n [, but our algorthm s sgnfcantly smpler an very ntutve. The total number of trplets an quartets n a tree are n an n. In the followng we escrbe how to compute the number of unresolve trplets an quartets n a tree. The number of resolve trplets an quartets can be obtane by subtractng the number of unresolve from the total number of trplets an quartets, respectvely. For each noe v of the tree we compute the number of leaves n v n the subtree roote at v n On tme by a bottom up traversal of the tree. Durng the traversal we count for each noe v how many resolve an unresolve trplets an quartets are anchore at v as escrbe below. The sum over the answers from the nvual noes s the fnal answer. To count the number of unresolve trplets an quartets roote at a noe v of egree, we compute the values s v, pv, tv an q v. sv s the total number of leaves n the frst subtrees below v where n v = s v, an the counters p v, tv an q v are the number of sets of two, three an four leaves, respectvely, where the leaves n a set belong to fferent subtrees among the frst subtrees below v. Lettng n v enote the sze of the subtree roote at the th chl of v, these values can be compute by ynamc programmng usng s v = n v p v = t v = q v = s v p v t v q v = s v + nv = p v + nv sv = t v + nv pv = q v + nv tv The number of unresolve trplets roote at v s the value t v. The number of unresolve quartets roote at v s q v + tv n nv, where the frst term counts the number of unresolve quartets where all four leaves are n stnct subtrees below v, an the secon term counts the number of unresolve quartets where one of the four leaves s not n the subtree of v. Snce we for each ege to a chl use O tme, the total tme for computng the trplet an quartet nformaton for a sngle tree s On. Herarchcal Decomposton of Roote Trees In orer to count the trplets/quartets n T an T, we bul a ata structure calle the herarchcal ecomposton tree HDT on top of T. The HDT s a balance bnary tree where each noe, or component, correspons to a set of noes n T. In later sectons, we escrbe how we ecorate the noes of the HDT wth countng nformaton. In ths secton, we frst escrbe our constructon of an HDT on any roote tree T wth n noes of arbtrary egree. We then prove that the constructon wors n tme On, where n s the number of noes n T, an that the resultng HDT s locally balance. A roote tree s locally balance f for any noe v n the tree, the heght of the subtree roote n v s Olog v, where v s the number of leaves n the subtree. Lemma n [ shows that the unon of root-to-leaf paths n a locally balance tree contans O + log n noes, whch s a property of the HDT we nee later. The noes of the HDT are one of the followng four fferent types of components, each corresponng to a specfc type of subsets of noes n T. See also Fg.. L: a leaf n T, I: an nternal noe n T, : a connecte subset of the noes of T, G: a set of subtrees wth roots beng sblngs n T. For a type G component we requre that t s ownwar close, meanng that n T no ege from a noe nse the component to a chl crosses the bounary of the component. For a type component we requre that at most two eges n T crosses the bounary of the component more precsely, at most one gong upwars n T, an at most one gong ownwars n T. We efne fve actons on components: two transformatons, whch change the type of a component, an three compostons, whch merge components. The fve actons are epcte n Fg. 6. Each composton merges exactly two components nto one new. We wll vew the new component as the parent of the two ol components. In ths way, a seres of compostons wll generate a bnary tree. Ths s the HDT. Intutvely, the HDT s a balance bnary tree, where the leaves correspon to the noes of an unbalance tree wth noes of arbtrary egree. Gven a tree T wth n noes, we construct the HDT bottom-up n a number of rouns. Before the frst roun, we frst convert each leaf of T to an L component,
L I Fgure : The four fferent types of components. L components contan a leaf from T, I components contan a sngle nternal noe from T, components contan a connecte set of noes, an G components contan entre subtrees of sblngs n T. L G L G G G G G G GG G IG Fgure 6: The two types of transformatons top an the three types of compostons bottom. an each nternal noe to an I component. Ths gves a tree wth I an L components as the noes, an the eges of T as the eges. A roun wll transform one such tree nto another smaller tree by performng a number of transformatons an compostons of the types epcte n Fg. 6. Each roun s worng on the tree create by the prevous roun, an the process stops when the tree has sze one. All trees are vewe as roote n the component contanng the root of T. From the efntons of the transformatons an compostons, the followng are seen to be nvarants of the HDT constructon process: each component correspons to.e., contans a subset of the noes of T ; these subsets form a partton of the noes of T ; an I component correspons to sngle nternal noes of T ; a component correspons to a connecte subset of noes n T, wth at most two eges connectng t to the rest of T, one from the root of the subset to ts parent, an one f the component s ownwars open from a leaf of the subset to a chl; a G component s ownwars close, s a chl of an I component, an correspons to the subtrees n T roote by a subset of the chlren of the noe of ths I component; each ege of the tree of components correspons to some ege of the orgnal T, except for eges between a G component an ts parent I component, whch may I correspon to a set of eges n T between the nternal noe corresponng to the I component an a subset of ts chlren. The compostons performe n a roun wll be nonoverlappng,.e., each composton merges two components exstng at the start of the roun. The new components forme by the compostons performe n a roun wll be nternal noes of the HDT, an the orgnal components forme from the noes of T before the frst roun wll be the leaves of the HDT. Only compostons create nternal noes components of the HDT. The transformatons are only changng the types of components, not mergng components. They have no nfluence on the structure of the HDT, an are only nclue because they wll ease the formulaton of our later ecoraton of the noes of the HDT wth trplet/quartet countng nformaton. All wor urng the constructon of the HDT s one va DFS traversals of the current tree of components, startng at the current trees root. Before the frst roun, we n a traversal frst transform all L components nto components va the L transformaton. Snce no composton prouces L components, such components wll not appear agan. In each roun, we n a traversal frst transform all ownwars close components beng chlren of I components nto G components va the G transformaton. Now all chlren of I components are ether G components or ownwars open components of type or I. all an I component forng f t has at least two ownwars open chlren. In a secon traversal, we perform the compostons of the roun as follows. G components wll be consere vste when vstng ther parent I component. Thus, the traversal wll traverse paths possbly of length zero consstng of non-forng I components an components, as well as traverse forng I components. Durng the traversal, the followng non-overlappng set of compostons are performe: For an I component forng or not wth g chlren beng G components, par sblngs arbtrarly nto g/ pars an perform a composton of type GG G on each par. For a non-forng I component wth only one chl beng a G component, perform one IG composton. For a maxmal subpath of c consecutve components, form c/ consecutve pars parng topmost wth secon, thr wth fourth, etc. an perform a composton on each par. Note that any I component havng less than two
chlren must have exactly one G component as a chl when create before frst roun, all I components have at least two chlren, a number whch can only ecrease once some chlren have become G components, after whch there wll always be at least one G component chl. Hence, all non-forng I components are covere by the cases above. Lemma.. Let T be a roote tree wth n noes of arbtrary egree. The above constructon of an HDT for T wors n tme On, where n s the number of noes n T. The resultng HDT s locally balance: the subtree roote at a noe v has heght h log /9 v, where v s the number of leaves n the subtree of v. Proof. We frst conser the effect of a roun on the entre current tree of components whch we may assume has sze at least two, otherwse no roun woul tae place. We wll assgn each component appearng at the start of the roun to some composton performe, n such a way that no composton s assgne more than 9 components. Snce each composton reuces the number of components by one, ths shows that each roun reuces the number of components by at least a factor of 9/8. The two components tang part n a composton are assgne to that composton. In stuaton, the I component an the possble non-pare G component s assgne to one of the compostons, gvng at most four components assgne to any composton. In stuaton, there s possbly one non-pare component to assgn. If there s a non-forng I component at the upper en of the subpath of, we assgn the non-pare component to a composton tang place there at least one s. So far, no composton has been assgne more than fve components. Left to assgn s at most one component per path from the topmost subpath of components, as well as the forng I components not covere by, plus ther possble sngle G component. all a path wth no forng I components at ts lowest en a leaf path, an let m be the number of forng I components. Snce the paths an the forng I components form a tree wth all nternal noes at least bnary, there are at least m+ leaf paths. There are m non-leaf paths. A leaf path has length at least two, snce leaf paths start at forng I components by a ownwars open component except for the specal case m =, where the tree s a sngle leaf path ths path has length at least two snce the sze of the tree s a least two. Hence, at least one composton taes place on each of the leaf paths. Dstrbutng the components left to assgn evenly among these compostons gves on average + m/m + < per composton. In total, no composton s assgne more than nne components an all components have been assgne. Hence, each roun reuces the number of components by a factor of 9/8. It follows that the heght of the HDT s log 9/8 n, where n s the number of noes n T. Snce each roun can be performe n tme lnear n the current number of components, t also follows that the total constructon tme s O = n8/9 = On. To prove that the HDT s locally balance, we must, however, show that the subtree of any noe v has logarthmc heght. A noe v n the HDT s a component of type I,, or G. Any component correspons to a subset S of the noes of T. In the ntalzaton phase, these noes are converte to a set of components, after whch pont each roun wll reuce ths set of components to a smaller set va the compostons performe. The heght h of the subtree of v s the number of rouns performe untl the set reaches sze one. In any roun untl roun h, the components beng escenants of v forms a partton of S. If v s an I component, t correspons to a sngle noe of T, so h s zero. If v s a component, t correspons to a connecte subset of noes n T, wth at most two eges connectng t to the rest of T, one from the root of the subset to ts parent, an one f the component s ownwars open from a leaf of the subset to a chl. Smlarly, n any roun untl roun h, the components parttonng S form n the current tree of components a connecte subset T of noes wth at most one ege leavng the subset upwars an at most one ege leavng t ownwars. Durng a roun, the traversal algorthm performng compostons enters T va the upwar ege an leaves va the ownwar ege. Ths can only ffer from runnng the algorthm on T seen as a tree by tself startng at the root of T f the two runs woul ffer on the parng of consecutve components on paths contanng the enterng an leavng ege. However, ths woul merge components nse an outse of T, whch cannot happen untl after roun h as v woul correspon to a fferent subset than S of noes n T. Hence, the analyss above for T apples to T : each roun reuces ts number of components by a factor of at least 9/8. Fnally, f v s a G component, t correspons to the subtrees n T roote by a subset of the chlren of the noe of ths I component. In any roun untl roun h, the components parttonng S form a subset T n the current tree of components of noes consstng of subtrees below the same I component. For a gven roun, let s be the number of subtrees n T, an let g s be the number of those consstng of a sngle G component measure after the G transformatons
at the start of the roun. The remanng s g subtrees have sze at least two, snce ther root s a ownwars open component. Let m be the total number of components n these subtrees. We have s, otherwse v woul alreay have been forme. In ths roun, the algorthm performng compostons operates on the s g subtrees not beng G components as f the algorthm was run on each of these nvually startng from ther roots. Hence, m s reuce by a factor of 9/8 n the roun as the subtrees have szes at least two. The algorthm reuces the g G components to g/, whch s at least a reucton by a factor of /, f g. So for g, the reucton n number of components s at least a factor of 9/8. If g =, we have s, so at least one other tree exsts, an contans at least two components. The total number of components s m +, of whch m s reuce at least by a factor of 9/8. For g = the total fracton of components remove s therefore at least m/9 /m +, whch s at least /. So for any value of g, the number of components s reuce by a factor of at least /9. Summng up, no matter what type of component v s, we have h log /9 v. The Man Algorthm In ths secton, we gve the man algorthm, whch recursvely traverses the nternal noes of T. For each nternal noe v, t colors the leaves of T accorng to v, an then queres the HDT of T for the number of trplets/quartets n T compatble wth that colorng. The results of the queres are smply summe urng the traversal of T. orrectness follows from the scusson n Sect.. The constructon an balance of the HDT was scusse n the prevous secton. In the next sectons, we show how to ecorate ts noes wth countng nformaton whch allows the queres to be answere n O tme from the nformaton n the root of the HDT. Ths requres the HDT countng nformaton to be upate after each colorng urng the traversal of T. Recall that colors are ntegers between zero an, where s the maxmal egree of any noe n the two trees. For the trplet stance, T s alreay roote. For the quartet stance, we frst choose an arbtrary nternal noe as the root. The traversal of T s a DFS-type traversal startng at the root. The followng nvarant s mantane urng the traversal: When enterng a noe v, all leaves n the subtree of v have the color, an all leaves not n the subtree of v have the color. When extng v, all leaves n T have the color. Before the traversal, all leaves are colore to ntalze the nvarant. The traversal s then performe by the recursve algorthm ount shown n Fg. 7, startng wth an ntal call ountr, where r s the root of the tree. In ount, the sze of a subtree s efne as ts number of leaves. The szes of all subtrees are foun n lnear tme urng a preprocessng cf. Sec.. The subroutne of colorng by a gven color all leaves n a subtree of T s performe by a traversal of the subtree, an taes tme lnear n the sze of the subtree. The corresponng leaves of T are smultaneously gven the same colors leaves n T an T wth the same labels are ept pare together by brectonal ponters. In the tme analyss of ount, we cover the wor of colorng the leaves n a subtree n T an corresponng leaves n T by chargng O wor to a leaf n T each tme t s colore. Note that the recolorng wor n an nstance of ountv happens at leaves not n the largest chl c. Hence, f a leaf s charge at nstances ountv at ancestor noes v = v, v, v,... n T lste n orer of ncreasng epth n the tree, then the sze of the subtree of v + s a most half that of v. Hence, a leaf s only charge Olog n tmes, for a total charge of On log n urng the entre run of ountr. The remanng part of the tme analyss s to conser the wor one to upate the HDT when the colors of leaves change. Our goal s to show that ths wor, when recolorng leaves n subtree c n an nstance ountv, can be covere by chargng O + log v / c to each leaf n the subtree of c. If ths s possble, each leaf wll be charge O + log v / v +, where v, v, v,... are the ancestor noes ncurrng charge, lste n orer of ncreasng epth. Snce ths sum s telescopng n the secon part of the terms, an by the analyss above has at most log n terms, the sum s Olog n, leang to On log n total wor. To acheve the goal, we wll exten the above verson of ountv by compressng T an ts HDT urng the recurson such that both are always of sze O v when nvong ountv. As wll be seen n the next sectons, the tme for upatng a noe n the HDT from ts chlren when a leaf changes color nse the component of the noe whch s a leaf n ts subtree n the HDT wll be O + x, where x s the number of colors fferent from use nse the component. The total upatng wor n the frst forloop n ount can be performe by frst marng the paths n the HDT from all leaves recolore, an then propagatng the change of nformaton n the noes of the HDT n a bottom-up manner, passng each mare noe only once. Snce the only colors n use at the start of ountv by the nvarant are an, a nonconstant value of x for a noe n the HDT means that
ountv f v s a leaf olor v by the color. else Let c be the chl of v wth largest subtree, an let c,... c be ts remanng chlren. for = to olor the leaves n the subtree of c by the color. // Leaves are now colore accorng to v Query the HDT for the number of trplets/quartets n T compatble wth the colorng. A that number to the global count. for = to olor the leaves n the subtree of c by the color. ountc for = to olor the leaves n the subtree of c by the color. ountc Fgure 7: The man algorthm performng a recursve traversal of T Θx fferent colors affecte leaves n ts subtree, hence for Θx fferent values of a subtree c ha leaves below the noe. Let = c. We now from Lemma. that the HDT s locally balance, an from Lemma n [ that the unon of root-to-leaf paths n a locally balance tree of sze n contans O + log n noes. Snce here, n = O v va the compresson one, t follows that chargng O + log v / c to each leaf n the subtree of c covers the cost, as x fferent such unons of paths pass the noe upate n the HDT. Ths was the goal. The same analyss apples to the secon of the for-loops. We now outlne how to perform the compresson of T an ts HDT urng the recurson n ount. Ths wll be one by ang two extra subroutnes, contract an extract to ount. Smlar contract an extract operatons appeare n [. The operaton contract s performe ust before the recursve call ountc, f the sze of c s less than some fxe fracton of the sze of the current verson of T, an t generates a new, contracte copy of T, to be use for the call ountc. The operaton acts on T after the leaves corresponng to c have been esgnate as n ncontractble. Usng the compostons from Sec. greely as long as one apples, a tree of sze O c can be constructe whch contans all the esgnate leaves wth the same nuce topologes for all subsets of these leaves. Usng a queue of possble composton ponts n the tree urng the contracton, ths can be one n O v tme. The HDT s bult from scratch for the new tree n tme O c, usng Lemma.. Snce only two colors wll be n use when contractng, the ecoraton of the HDT also taes tme O c. ontract operatons can happen along a path n the recurson tree gong from noes towars ther largest chl c. Snce the szes of the contracte versons of T generate along such a path are exponentally ecreasng, the cost of the topmost contract operaton on the path s proportonal to the total cost of all contract operaton on the path. New paths can start n the recursve calls ountc for, but the cost of those can be charge to the extracte verson of c, as wll follow from the analyss of the extract operaton below. The secon operaton wll extract a tree of sze O c contanng the leaves of the subtree c wth the same nuce topologes for all subsets of these leaves. Ths s one ust after the query to the HDT, when these leaves all have color, an s repeate for all. It s one by fnng the unon of root-to-leaf paths n the HDT from the leaves of c to the root by traversng the paths one by one, stoppng when a prevous traverse path s met. Durng a top-own traversal of the unon of the paths, all subtrees hangng to ses of ths unon of paths are converte to sngle noes, wth the countng nformaton reset to represent the stuaton where all leaves n these subtrees have color. Ths taes tme O + log v, an gves a tree of sze O + log v. Runnng contract on ths tree wth the = c leaves mare ncontractble prouces a tree of sze O c, for whch a HDT s then bult wth the esgnate leaves havng the color, the rest the color, all of ths n the same tme boun. By the analyss above, chargng O + log v / c to each leaf n the subtree of c covers the cost. As mentone above, these new trees an assocate HDTs
are prouce for each c for ust after the query n ountv. They are use n the last for-loop when callng recursvely,.e., they tae the place of c n the call ountc. Ths nsures the esre lnear sze constrant for these nvocatons. 6 ountng share trplets an quartets In ths secton we conser the basc ea how to count the number of share trplets an quartets between two trees T an T, where all trplets an quartets are anchore at a noe v of egree n T. We assume all leaves n the th subtree below v have been colore, an all leaves not n v s subtree are colore. The resolve trplets anchore at v n T are exactly all trplets of leaves, where all leaves have a nonzero color, an two of the leaves have the same color. The unresolve trplets anchore at v are exactly all trplets wth three nonzero an stnct colore leaves. Gven the colorng of the leaves n T we must count the number of trplets n T satsfyng the same color constrants, snce these are exactly the trplets anchore at v n T whch also appear n T. Unfortunately, a resolve quartet n T can appear n three fferent ways, type α, β, an γ see Fg., snce we conser a roote verson of T. Furthermore n T the same quartet may appear as a resolve quartet of any of the types α, β, an γ. Ths leaves us wth nne cases to search for. Furthermore, ue to the colorng of the leaves, we shoul also search for all relevant permutatons of the colors n T. Fgure 8 summarzes the fferent cases we nee to search for n T gven a colorng of T. Note that searchng resolve quartets appearng as type α n T an type γ n T can be reuce to the symmetrc problem of searchng for quartets appearng as type γ n T an type α n T by swappng the two trees. Thus the cases α-β, α- γ, an β-γ, omtte n Fg. 8, can be entfe by swappng the two trees an searchng for β-α, γ-α, an γ-β, respectvely. We wll not scuss all cases, but only conser the case γ-β, where a quartet appears as type γ n T anchore at v an type β n T. We now that one leaf must have color, an three leaves must have nonzero color, of whch exactly two have the same color. In T the quartet s roote at a noe v, such that the four leaves belong to three subtrees below v, an exactly two leaves are n the same subtree below v. Snce the splt gven by the quartet n T s the splt, where, ths colorng can only be realze by the two colorngs n the γ-β entry n Fg. 8. We count the share trplets an quartets that are anchore at v n T, by conserng a herarchcal ecomposton of T. A trplet or quartet s anchore at a noe v n a herarchcal ecomposton of T, f the component of v contans all leaves of the trplet or quartet, but nether of the two chlren of v has ths property. Trplets an quartets can only be anchore at noes corresponng to an GG G component compostons. We wll count the number of share trplets an quartets n a bottom up traversal of the herarchcal ecomposton of T, where we n each noe v compute a set of trplets an quartets anchore at v Fg. an usng a set of counters that we compute for each noe of the herarchcal ecomposton of T. The etals are escrbe n the followng subsectons. Snce ealng wth quartets requres a sgnfcant number of cases, we have tre to be as systematc as possble n the fgures. The cases coul lely be reuce by combnng some of the cases/counters, but to ensure that all cases were covere, we ece for a more systematc approach n the presentaton. 7 ounters for components The counters we mantan for each component are shown n Fg.. The counters requre for countng share trplets are mare. The maorty of the counters are only use for countng share resolve quartets. We frst explan some basc counters. For a component we let n enote the number of leaves wth color n component. Here can be any of the colors,...,. The number of leaves colore zero n s n. The number of leaves wth a nonzero color s enote n. In the llustratons n Fg. we use to ncate a color that can tae several values. Smlarly we have counters for G components. Some counters are efne for both an G components, an some are only efne for components. Those efne for both an G components have a superscrpt X n Fg. an Fg.. We now conser more avance counters. Each of the remanng counters count the number of pars or trplets n a component wth some constrants on the colorng an relatve placement n the components, Ths s capture by the subscrpts. The color constrant s ncate by, {,..., }, an. Here enotes any color n {,..., } f oes not appear n the subscrpt, an otherwse any color n {,..., }\{}, an s any color n {,..., } \ {, }. Two colors n a subscrpt, e.g. n G, counts the number of pars of leaves n G wth non-zero colors an, for any, where the leaves are n two stnct subtrees of the super root.e. the LA of the two leaves n T s the super root of G. For a component the corresponng counter n counts the same type of pars of leaves, but where the LA of the two leaves s a noe on the external path.e. the path connectng the two external eges an
T α β γ α swap T an T T β γ Fgure 8: Share resolve quartets the fferent cases that shoul be counte T α β γ α T β swap T an T γ Fgure 9: Quartets resolve fferently n the two trees the fferent cases that shoul be counte both leaves are not contane n the subtree roote at the chl subtree on the external path. Parenthess aroun the subscrpts, e.g. n an n G, requre the LA of the leaves not to be a noe on the external path or the super root, respectvely. E.g. for the pars counte by n the colors are requre to be two stnct colors n {,..., }. Gong one step further, e.g., we let n G enote the number of resolve trplets n G where all three leaves are n the same subtree below the super root outer parenthess n the subscrpt, one of the leaves shoul be colore, one colore zero, an the last leaf shoul have any color n {,..., } \ {} n the subscrpt, an the LA of the last two leaves shoul be lower than the LA wth the nner parenthess. Droppng the outer parenthess,.e. n G, requres the trplets to be roote at the super root of G. For components we use to requre that the leaves shoul branch out from fferent noes on the external path. E.g. for n we count trplets wth leaves colore, {,..., } \ {}, an, where the leaves shoul branch out from three fferent noes on the external path, where the left-to-rght orer n the subscrpt equals bottom-up orer on the external path. Parenthess an can also be combne, e.g. n see Fg.. Fnally we use square parenthess to ncate that we o not requre anythng specal wth respect to the super root or noes on the external path. E.g. n [ counts all trplets consstng of three leaves n colore,, an, an where the LA of the two leaves s lower than the LA wth the leaf colore. We o not care f the trplets are roote on the external path or contans eges from the external path. Fgure lsts an llustrates all the up to fferent types of counters for a component we nee to mantan. Note that 8 of the types nvolves n the subscrpt,.e. each of these covers fferent counters, one for each possble value of = {,..., }. In total we nee to mantan up to + 8 counters for a component. 7. ounters for share resolve trplets Fgure llustrates the counters for each each noe v of the herarchcal ecomposton of T, an Fg. shows n a very conense form how to compute these, gven the counters for the chlren of v. For a composton t s trval that n = +,.e. the number of leaves colore n s the sum of the number of leaves colore n an.
A nontrval case s n X. For a composton we have n = n + n, snce we requre that the two leaves colore an,, branch out from the same noe on the external path. For a GG G composton the equaton becomes n G = n G + ng + ng n G n G + n G n G n G, where the four terms come from: that the par s n G, n G, that the leaf s n G an the non- leaf s n G, or the symmetrc case of wth G an G swappe. Snce these four terms are common for many GG G compostons, we restate ths as n G = n G + ng + g G, G + g G, G g G, G =, such that we only have to efne g for the GG G composton. In the table we use as a placeholer for the subscrpts n a row, when efnng generc expressons applyng to several rows. ounters nvolvng n the subscrpt only apply to components. E.g. conser n whch can be compute as n = n + + n n + n n, where the frst two terms are common to many counters, an the last counter s referre to as f n Fg.. Here f can be compute as the number of trplets where ether branches out n, an an non- branch out n, or where an branch out n an only non- branches out n. We leave t to the reaer to o the teous wor of checng all the calculatons n Fg.. It shoul be note that we only nee to eep trac of counters that ether have no n the subscrpt, or wth an n the subscrpt an at least one leaf n the component s colore these counters are always zero f no leaf s colore n the component. If there are x fferent colors from {,..., } use to color the leaves n a component, then up to + 8x counters shoul be compute. Eght of the sums s a sum over x terms contane n the box of Fg., an the remanng 7 + 8x sums only contan a constant number of terms. It follows that all counters at a noe n the herarchcal composton can be compute n O + x tme. 7. ountng share trplets Havng efne the counters, we can now escrbe how to compute the number of share resolve an unresolve trplets anchore at a noe n the herarchcal ecomposton of T. Fgure lsts the requre computatons. The number of share resolve trplets anchore n a GG G composton s gven by the equaton = n G n G n G + = n G n G n G, snce a trplet anchore at v n T an at the super root of G n T must have two equally colore leaves color n one subtree below the super root of G an another colore leaf n another subtree, where one subtree s n G an the other n G. Snce most sums have two symmetrc terms wth G an G swappe, we have ntrouce the functon σ[fg, G G, G = fg, G + fg, G, such that e.g. the above sum can be rewrtten as σ [ = n G. The number of share resolve trplets anchore n a composton nvolves fferent cases. A resolve trplet anchore at v n T has two leaves colore an one colore, for some colors an wth. The ege between an n T can splt the trplet n four fferent ways one for each ege of the trplet. To mae sure we count all cases we have labele the eges of the trplet wth numbers -, an for each sum countng resolve share trplets anchore n the noe, we lst the splts of the trplet whch are covere. The three sums n Fg. count the number of trplets, where contans one, both s, an, respectvely note that cannot contan one an wthout contanng both s. For a noe n the herarchcal ecomposton of T we can compute the number of share resolve trplets anchore at ths noe n O + x tme, where x s the number of stnct colors from {,..., } use to color the leaves n the component. For share unresolve trplets the countng s smpler see Fg. an can also be one n O+x tme. Here the three leaves of a trplet have fferent colors. For compostons one leaf must be n colore, an the two other colore leaves must branch out from a sngle noe on the external path n to two fferent subtrees. For GG G compostons the trplet must branch to three fferent subtrees of the super root of G, where one leaf colore s n G or G, respectvely, an the other leaves are n G or G, respectvely. 7. ountng share resolve quartets ountng share resolve quartets Fg. follows the same approach as for countng share trplets, except that the number of cases to conser ncreases sgnfcantly. The
T T GG G G G ountng resolve-resolve trplets, = n n = = ountng unresolve-unresolve trplets,, = n σ = ng σ = ng Fgure : ontrbuton to the number of share trplets for component compostons asymptotc tme s also the same, but wth a sgnfcantly bgger constant. Frst, each resolve quartet n T can appear as type α-γ, an the entcal quartet n T can also appear as one of the types α-γ. Ths mples nne fferent cases. Three cases can by symmetry be counte by swappng T an T, see Fg. 8. For quartets of type α an γ n T there are sx possble splts to conser n the composton, whereas type β n T only nees fve possble splts to conser. For GG G compostons there s one case f the root of the quartet not the anchor has egree two types α an γ an three cases f the root has egree three type β, snce for egree three we nee to conser the possble ways that the three subtrees can be strbute between the two subcomponents actually there are two an sx cases, respectvely, snce there are also the symmetrc cases wth G an G swappe, but ths s counte usng σ[. In Fg. we next to each sum wrte the strbuton of the eges from the two subcomponents covere by the sum, e.g. : ncate that eges an lea to one component, an to the other. 7. ountng sagreeng resolve quartets In ths secton we count the number of four tuples of leaves that are resolve quartets n both trees, but where the four leaves efne fferent quartets splts n the two trees. We tae the same general approach as for countng share resolve quartets. The cases that shoul be consere are summarze n Fg. 9. The computaton of the contrbuton for each of the cases at a noe n a herarchcal ecomposton of T s shown n Fg. 6. The atonal counters requre for the computaton are lsten n Fg., an ther computaton n Fg.. Sums mare are the bottlenec of the computaton, snce they requre O + x tme to compute, where x s the number of fferent colors from {,..., } use to color the leaves n the component. The total tme to hanle a noe n a herarchcal ecomposton of T s O + x = O + x,.e. n the worst case a factor larger than requre by the analyss n Sec. for achevng On log n runnng tme. It follows that the total tme for computng the number of sagreeng resolve quartets becomes On log n. The smplest case llustratng the bottlenec n our approach s the computaton of the contrbuton of the case α-α n a GG G composton. The quartets n the two trees are quartet n T quartet n T an the contrbuton can be compute n O + x tme as = =+ ng ng. Interestngly, we arrve at the same n of bottlenec when tryng to compute the number of unresolve quartets n T that are resolve n T rectly. References [ M. S. Bansal, J. Dong, an D. Fernánez-Baca. omparng an aggregatng partally resolve trees. Theoretcal omputer Scence, 8:66 66,. [ G. S. Broal, R. Fagerberg, an. N. S. Peersen. omputng the quartet stance between evolutonary trees n tme On log n. Algorthmca, 8:77 9,. [. hrstansen, T. Malun,. N. S. Peersen, M. Raners, an M. S. Stssng. Fast calculaton of the quartet stance between trees of arbtrary egree. Algorthms for Molecular Bology,, 6. [ D. E. rtchlow, D. K. Pearl, an. L. Qan. The trples stance for roote bfurcatng phylogenetc trees. Systematc Bology, :, 996. [ W. H. E. Day. Optmal-algorthms for comparng trees wth labele leaves. Journal of lassfcaton, :7 8, 98.
[6 G. F. Estabroo, F. R. McMorrs, an. A. Meacham. omparson of unrecte phylogenetc trees base on subtrees of four evolutonary unts. Systematc Zoology, :9, 98. [7 J. Nelsen, A. Krstensen, T. Malun, an. N. S. Peersen. A sub-cubc tme algorthm for computng the quartet stance between two general trees. Algorthms for Molecular Bology, 6:,. [8 D. F. Robnson an L. R. Fouls. omparson of phylogenetc trees. Mathematcal Boscences, : 7, 98. [9 M. S. Stssng,. N. S. Peersen, T. Malun, G. S. Broal, an R. Fagerberg. omputng the quartet stance between evolutonary trees of boune egree. In Proceengs of the th Asa-Pacfc Bonformatcs onference APB, pages. Imperal ollege Press, 7.
ounter G ounter G ounter ounter n X n X n n n X n X n X n X n X n X n X n X n X n X n X n X n X n X n X n X n n n n n n n n n n n n n n n n X n n X [ n X n n X [ n X n X n n n X [ n X [ n X [ n X n n X n n Fgure : Illustraton of the fferent counters use to count share trplets an quartets
L G G G L GG G G IG ounter { n n n G n G n f colorl = n X otherwse + n G + n G n n G { f colorl = n X otherwse + n G + n G n n G ommon + n G + n G + g G, G + g G, G n G n X n X n X ng n X n X g G, G = n X n X ng n X n G ng n X ommon + n X n X n X ng n G + n G n n n n n X n n n n I n G n X n X n [ n [ n X ommon n X [ n X [ n X [ n n n + ommon n n n n n n n n n n n n n n n n n n n n n n n f, = + + f, n G + n G n n + n + + n + n n [ + g G, G + g G, G n n G = f, g G, G n + n n + n ng + + f, not efne not efne n n n n n n n n n n n n + n n n + n n n + n n + n n n X = = nx n X = = nx n X = = nx n X = = nx n X = = nx n X = = nx n = = n n X [ = n X nx n X n X [ = = nx Fgure : omputaton of counters for the fferent component compostons
ase T T αα 6 βα 6 ββ γα 6 γβ 6 6,,,6,,,6,,,,, : 6,,, 6, γγ 6 6 6 = n GG G G G = n = n n n = = = n = n [ n [ n n n n = = n = n = = n = n n n [ n [ = n = n = = n n = = n = n n n n : = ng n G ng : σ = ng ng : σ = ng :,: σ = ng : σ :,: σ n = = = n = n = n = n n = = = n [ n = n = n n = n n = n = n [ = = n = = n = n = n [ n = n = ng = ng ng ng ng : σ = ng ng : σ [ : σ : σ = : σ = ng :,: σ = ng = ng ng :6 σ = ng :6 σ = :6 σ = ng Fgure : ontrbuton to the number of share resolve quartets for component compostons
ounter G ounter ounter ounter n X n X n X n X n X [ n X [ n X [ n X [ n X [ n n n n n n n n X n X [ n n n X n X n X n X n X n X n n n n n n n n n n n X n n X n n X n n X Fgure : The atonal counters use to compute the number of sagreeng resolve quartets
L G G G L GG G G IG ounter n n n G n G n ommon + n G + n G + g G, G + g G, G n G n X n X n X n X g G, G = n X =, ng ng n X n X ommon + n G + n G n X n n n I n G n X n X n X n X n X [ n [ n [ n [ n X n [ ommon + + f, n G + n G + g G, G + g G, G n n G n X [ + + n + n ng + + n n n G n X [ n X [ n X [ ommon n n n n n n n n n n n n n n n n n n n f, = + n n + = f, n n g G, G + ng + + f, not efne not efne n n n n n n =, n n + n + n n n n n + + n + n n + n n n + n X [ = = nx [ n X = = nx n = = n n = = n n = = n Fgure : omputaton of counters for the fferent component compostons
ase T T GG G G G αα 6 βα 6 ββ γα 6 γβ γγ 6 6 6 6,,,6 = n,,,6, = = n = n = = n = n = = n = n =+ n n n =, n =, n = n = n = n = n = n 6 = n n n n n n = n = n = n = = n = n = n = n : = =+ ng ng : = =, ng n G ng : σ = =, ng : σ = ng ng : σ = ng : σ = ng ng : σ : σ = = n = n = n = n [ 6 = = n = n = = n [ n 6 = n = n = = n = n = n [ 6 = n = ng ng : σ = ng = ng ng : σ [ : σ = = n = n [ n 6 = n = ng ng : σ ng ng ng ng = ng ng :6 σ = :6 σ = ng :6 σ = ng :6 σ = ng Fgure 6: ontrbuton to the number of sagreeng resolve quartets