Top K Neaest Keywod Seach on Lage Gaphs Miao Qiao, Lu Qin, Hong Cheng, Jeffey Xu Yu, Wentao Tian The Chinese Univesity of Hong Kong, Hong Kong, China {mqiao,lqin,hcheng,yu,wttian}@se.cuhk.edu.hk ABSTRACT It is quite common fo netwoks emeging nowadays to have labels o textual contents on the nodes. On such netwoks, we study the poblem of top-k neaest keywod (k-nk) seach. In a netwok G modeled as an undiected gaph, each node is attached with zeo o moe keywods, and each edge is assigned with a weight measuing its length. Given a quey node q in G and a keywod, a k-nk quey seeks k nodes which contain and ae neaest to q. k-nk is not only useful as a stand-alone quey but also as a building block fo tackling complex gaph patten matching poblems. The key to an accuate k-nk esult is a pecise shotest distance estimation in a gaph. Based on the latest distance oacle technique, we build a shotest path tee fo a distance oacle and use the tee distance as a moe accuate estimation. With such epesentation, the oiginal k-nk quey on a gaph can be educed to answeing the quey on a set of tees and then assembling the esults obtained fom the tees. We popose two efficient algoithms to epot the exact k-nk esult on a tee. One is quey time optimized fo a scenaio when a small numbe of esult nodes ae of inteest to uses. The othe handles k-nk queies fo an abitaily lage k efficiently. In obtaining a k-nk esult on a gaph fom that on tees, a global stoage technique is poposed to futhe educe the index size and the quey time. Extensive expeimental esults confom with ou theoetical findings, and demonstate the effectiveness and efficiency of ou k-nk algoithms on lage eal gaphs.. INTRODUCTION Many eal-wold netwoks emeging nowadays have labels o textual contents on the nodes. Fo example in a oad netwok, a location may have labels such as McDonald s, hospital, and kindegaten. In a social netwok, a peson may have infomation including name, inteests and skills, etc.. In a bibliogaphic netwok, a pape may have keywods and abstact, and an autho may have name, affiliation and email addess. In this study, we conside the poblem of top-k neaest keywod (k-nk) seach on lage netwoks. In a netwok G modeled as an undiected gaph, each node is attached with zeo o moe keywods, and each edge is assigned with a weight measuing its length. Given a quey node Pemission to make digital o had copies of all o pat of this wok fo pesonal o classoom use is ganted without fee povided that copies ae not made o distibuted fo pofit o commecial advantage and that copies bea this notice and the full citation on the fist page. To copy othewise, to epublish, to post on seves o to edistibute to lists, equies pio specific pemission and/o a fee. Aticles fom this volume wee invited to pesent thei esults at The 9th Intenational Confeence on Vey Lage Data Bases, August 6th 0th 0, Riva del Gada, Tento, Italy. Poceedings of the VLDB Endowment, Vol. 6, No. 0 Copyight 0 VLDB Endowment 0 8097//0... $ 0.00. q in G and a keywod, a k-nk quey in the fom of Q = (q,, k) looks fo k nodes which contain and ae neaest to q. Diffeent fom a lage body of eseach on k-neaest neighbo (k-nn) seach on spatial netwoks [,, 6, 8, 9, 7], we define G as a geneal gaph without coodinates. Thus ou solution can apply to a wide ange of netwoks. Motivation. k-nk is an impotant and useful quey in gaph seach. As a stand-alone quey, it has a wide ange of applications. Futhemoe, it can seve as a building block fo tackling complex gaph patten matching poblems which impose both stuctual and textual constaints. Hee we list a few applications of k-nk queies. Conside the social netwok Facebook as an example, in which pesonalized seach based on gaph stuctue and textual contents has become inceasingly popula. A peson looks fo 0 fiends o potential fiends who like hiking to paticipate in a hiking activity. Intuitively, if two pesons shae some common fiends, i.e., they ae two hops away, they ae moe likely to become fiends. In contast, if they ae fa away fom each othe in the netwok, they ae less likely to establish a link. Thus the poblem is to find 0 pesons who like hiking and ae neaest to the peson who seves as the oganize. It can be answeed by a k-nk quey. Moe geneally, we also conside a quey containing multiple keywods connected by AND o OR opeatos to expess moe complex semantics, e.g., a peson looks fo k fiends o potential fiends who like hiking AND (OR) photogaphy and ae neaest to him. Take a oad netwok with locations associated with keywods as anothe example. Fo paents looking fo k kindegatens neaest to thei home fo thei childen, thei equiements can be expessed by a k-nk quey whee the quey node is the home location, and the keywod is kindegaten. In the thid example, we show how k-nk queies seve as a building block fo solving the gaph patten matching poblem. Conside a couple who wants to buy a house. They have some constaints like having a kindegaten and a hospital within km, and a supemaket within km of thei home. These constaints can be expessed as a sta patten, and the patten matching poblem can be decomposed into thee k-nk queies with keywods kindegaten, hospital and supemaket espectively and k = fo each potential house location to be consideed. Recently, Bahmani and Goel [] have designed a Patitioned Multi-Indexing () scheme to answe k-nk queies appoximately. is an inveted index built based on distance oacle [0] which is a distance estimation technique. Given a k-nk quey Q = (q,, k), it etuns k nodes containing keywod in ascending ode of thei appoximate distance fom the quey node q. inheits the log V appoximation facto fo distance estimation fom distance oacle [0], whee V is the set of nodes in the https://www.facebook.com/about/gaphseach 90
gaph. The majo dawback of is that its distance estimation eo could be quite lage in pactice. This can geatly distot the anking of the candidate nodes caying the quey keywods, and thus lead to a low esult quality. In this wok, we study how to answe k-nk queies accuately and efficiently using compact index. The key to an accuate k-nk esult is a pecise shotest distance estimation in a gaph. As we use a geneal gaph model, existing k-nn solutions on spatial netwoks [,, 6, 8, 9, 7] cannot be applied, as they usually ely on specialized stuctues that leveage popeties of spatial data to optimize thei solutions. Instead we use distance oacle [0] as the fundamental distance estimation famewok. Fo each component of a distance oacle, we will build a shotest path tee, based on which we can estimate the shotest distance between two nodes by thei tee distance. The tee distance is moe accuate than the distance estimated by distance oacle, which we call witness distance to distinguish. As we tansfom a distance oacle on a gaph into a set of shotest path tees, the oiginal k-nk quey on the gaph can be educed to answeing the k-nk quey on a set of tees. Thus we fist focus on pocessing k-nk queies to find exact top-k answes on a tee. Then we study how to assemble the esults obtained fom the tees to fom the appoximate top-k answes on the gaph. Contibutions. Ou main contibutions in this wok ae summaized as follows. () Given a tee, we fist conside a common scenaio when uses ae inteested in a small numbe of answe nodes bounded by a small constant k, i.e., k k. We popose the fist algoithm tee- with quey time O(k + log V ), whee V is the numbe of nodes caying the quey keywod, and index size O(k doc(v ) ), whee doc(v ) is the total numbe of keywods on all the nodes in the gaph. () Next we emove the k estiction and handle k-nk queies fo an abitay k on a tee. We popose the second algoithm tee- with quey time O(k log V ) and index size O( doc(v ) log V ) which is independent of k, thus is moe scalable. () Based on ou poposed tee algoithms, we pesent ou algoithm fo appoximate k-nk quey on a gaph. We popose a global stoage technique to futhe educe the index size and the quey time. We also show how to extend ou methods to handle a quey with multiple keywods. () Ou expeimental evaluation demonstates the effectiveness and efficiency of ou k-nk algoithms on lage eal-wold netwoks. We show the supeioity of ou methods in anking top-k answe nodes accuately, when compaed with the state-of-the-at top-k keywod seach method []. Roadmap. The est of the pape is oganized as follows. Section fomally defines the poblem. Section discusses two existing elated studies and thei dawbacks. Section pesents ou famewok. Sections and 6 intoduce two poposed algoithms to answe k-nk queies on a tee fo a small k and an abitay k espectively using compact index stuctues. Section 7 elaboates on the way to answe k-nk queies on a gaph by appoximating the gaph with a bounded numbe of tees. Section 8 pesents extensive expeimental evaluation. Section 9 eviews the pevious woks elated to ous. Finally, Section 0 concludes the pape.. PROBLEM DEFINITION We model a weighted undiected gaph as G(V, E), whee V (G) epesents the set of nodes and E(G) epesents the set of edges in G. We use V and E to denote V (G) and E(G) if the context is obvious. Each edge (u, v) E has a positive weight, denoted,α b g d f u β α,β a h i,α j n e p β m k c t o Figue : A Gaph G with Keywods as weight(u, v). A path p = (v, v,, v l ) is a sequence of l nodes in V such that fo each v i( i < l), (v i, v i+) E. The weight of a path is the total weight of all edges on the path. Fo any two nodes u V and v V, the distance of u and v on G, dist(u, v), is the minimum weight of all paths fom u to v in G. Each node v V contains a set of zeo o moe keywods which is denoted as doc(v). The union of keywods fo all nodes in G is denoted as doc(v ). Note that doc(v ) is a multiset and doc(v ) = P v V doc(v). We use V V to denote the set of nodes caying keywod in V. DEFINITION. Given a gaph G(V, E), a top-k neaest keywod (k-nk) quey is a tiple Q = (q,, k), whee q V is a quey node in G, is a keywod, and k is a positive intege. Given a quey Q, a node v V is a keywod node w..t. Q if v contains keywod, i.e., v V. The esult is a set of k keywod nodes, denoted as R = {v, v,, v k } V, and thee does not exist a node u V \ R such that dist(q, u) < max v R dist(q, v). To futhe epot the distance in the top-k esult, we can use the fom R = {v : dist(q, v ), v : dist(q, v ),, v k : dist(q, v k )}. In this pape, we aim at answeing a k-nk quey Q = (q,, k) on a gaph G. Fo simplicity, we assume that thee is only one keywod in the quey. We will discuss how to answe a quey containing multiple keywods with AND and OR semantics. Example : Fig. shows a gaph G. Assume that the weight of each edge is. Fo a k-nk quey Q = (f,, ), the keywod node set is V = {b, c, k, n, t}. The esult of Q is R = {b :, n :, k : } since dist(f, b) =, dist(f, n) =, and dist(f, k) =.. EXISTING SOLUTIONS A staightfowad appoach to answeing a k-nk quey Q = (q,, k) on G is to use Dijksta s algoithm to seach fom the quey node q and output k neaest keywod nodes in nondeceasing ode of thei distances to q. The time complexity is O( E + V log V ). Obviously, Dijksta s algoithm is inefficient when the size of the gaph is lage o the keywod nodes ae fa away fom q. In the liteatue, [] and [] design diffeent indexing schemes to pocess (top-k) neaest keywod queies on a gaph o a tee. We intoduce the two methods in the following two subsections.. Appoximate k NK on a Gaph Bahmani and Goel [] find an appoximate answe to a k-nk quey in a gaph based on a distance oacle [0]. Distance Oacle: Distance oacle is a technique fo estimating the distance of two nodes in a gaph [0]. Given a gaph G, a distance oacle is a Voonoi patition of V (G) detemined by a set of andomly selected cente nodes. Moe specifically, given a numbe n c, we andomly select n c nodes fom V (G) as the cente nodes to constuct a distance oacle O. Then the patition is constucted by assigning each node v V (G) to its neaest cente node, denoted as wit O(v), which is called the witness node of v w..t. O. If v is a cente node, wit O(v) = v. Fo each node v V (G), the shotest distance fom v to its witness node, i.e., dist(v, wit O(v)), is pecomputed. Afte constucting O, given two nodes u and v in G, if u and v ae in the same patition in O, i.e., wit O(u) = s v 90
O j a m t p e h 6 g d b f i u n v c o k s u i h g d b f a n e j s v p m o O t k c Figue : Two Distance Oacles O and O wit O(v), we compute the estimated distance, called witness distance, as dist O(u, v) = dist(u, wit O(u)) + dist(v, wit O(v)). If u and v ae not in the same patition in O, dist O(u, v) = +. One distance oacle is usually not enough fo distance estimation in a gaph G. It cannot estimate the distance of two nodes in diffeent patitions. Even fo two nodes in the same patition, the estimation may have a lage eo. Theefoe, a set of = p log V distance oacles {O, O,, O } ae constucted, whee p can be consideed as a constant. The algoithm is pocessed in log V phases. In phase i (0 i < log V ), p distance oacles ae constucted whee each distance oacle contains i andomly selected cente nodes. Given distance oacles, the distance of two nodes u and v in G can be estimated as an uppe bound dist(u, v) = min i dist Oi (u, v). The time complexity to compute the estimated distance dist(u, v) fo any two nodes u and v in a gaph G is O(log V ). The distance oacles consume O( V log V ) space. Das Sama et al. [0] pove that when p = Θ( V / log V ), the estimated distance can be bounded by dist(u, v) dist(u, v) ( log V ) dist(u, v) with a high pobability. Example : Fig. shows two distance oacles O and O fo the gaph shown in Fig.. Thee is one cente node in O, and fou cente nodes, n, o and t in O. The distance of nodes j and s is estimated as dist(j, s) = min{dist O (j, s), dist O (j, s)} = min{dist(j, ) + dist(s, ), dist(j, n) + dist(s, n)} =. Answeing k-nk with Distance Oacle: [] designs a Patitioned Multi-Indexing () scheme which uses a set of distance oacles to answe a k-nk quey in a gaph. Fo each patition in a distance oacle O i, an inveted list is constucted fo each keywod in the patition. Specifically, fo a patition with a cente node c and a keywod, the inveted list contains all nodes in the patition that contain keywod anked in nondeceasing ode of thei distances to c. Given a k-nk quey Q = (q,, k) and a distance oacle O i, the algoithm fist finds the patition that q belongs to in O i. The esult w..t. O i is the fist k elements in the inveted list fo in the patition, denoted as R Oi = {u : dist(c, u ) + dist(c, q), u : dist(c, u ) + dist(c, q),, u k : dist(c, u k ) + dist(c, q)}. The final esult R is computed by meging the nodes in each R Oi and maintaining k nodes with the shotest distances to q. The quey time complexity is O(k log V ). We illustate the algoithm using the following example. Example : Conside the gaph in Fig. and two distance oacles in Fig.. Fo keywod, the inveted list fo the patition centeed at node in O has elements {b :, n :, k :, c :, t : 6}. The inveted list fo the patition centeed at node o in O has element {k : }. Given a k-nk quey Q = (m,, ), fom O, we can get a esult R O = {b : + dist(, m), n : + dist(, m)} = {b :, n : 7}, and fom O, we can get a esult R O = {k : + dist(o, m)} = {k : }. By meging R O and R O, the final answe is R = {k :, b : }. The exact answe is R = {c :, k : } accoding to Fig.. Limitation: Although in theoy, the witness distance used by [] can be bounded by a facto of log V of the exact distance with a high pobability, in pactice, howeve, we find the distance In [0], the set {O, O,, O } is defined as a distance oacle.,,[,0] g,,[,6] f,7,[7,8] d,9,[9,8] b,9,[9,0] a,,[,6] i,8,[8,8] h,0,[0,8] u,0,[0,0] j,,[,] n,6,[6,6] e,,[,8] p,, [,] m,, [,8] k,,[,] v,, [, ] s,,[,] c,6,[6,7] t,7,[7,7] o,8,[8,8] Figue : A Tee T with Peode and Inteval on Each Node k a n c CT t b j k a n 6 e c ECT 6 b t 9 7 Inteval [,] [,] 6 [7,0] Result b n k n b Inteval [,6] 7 8 Result c t c TVP [9,0] b Figue : CT(), ECT() and TVP() fo Keywod estimation eo can be quite lage. Fo example, fo the gaph G in Fig. and two distance oacles O and O in Fig., fo two nodes s and v, the witness distance in O is dist O (s, v) = dist(s, ) + dist(v, ) = 0, and that in O is dist O (s, v) = dist(s, n) + dist(v, n) = 6. Howeve, the exact distance is dist(s, v) = in G, which is much smalle than both dist O (s, v) and dist O (s, v). The inaccuate distance estimation can geatly distot the anking of the nodes caying the quey keywod, and thus lead to a low esult quality, as illustated in Example.. Exact NK on a Tee Tao et al. [] compute the exact answe to a -NK quey on a tee T(V, E). Given a quey Q = (q,, ), the esult is the neaest node in T that contains keywod, denoted as NN(q, ). The basic idea is as follows. We label a node v with the sequence numbe of v in the peode tavesal of T. Fo a cetain keywod, all nodes with the peode label in the inteval [, V ] can be patitioned into seveal disjointed intevals, such that any node v in the same inteval shaes an identical NN(v, ). The patition is called tee Voonoi patition of, denoted as TVP(). By pecomputing TVP() fo all keywods on the tee, a quey Q = (q,, ) can be answeed in O(log V ) time using a binay seach in TVP(). In ode to compute TVP() fo all keywods in T efficiently, two new data stuctues, namely, Compact Tee CT() and Extended Compact Tee ECT(), ae poposed in []. DEFINITION. (Compact Tee and Extended Compact Tee) Fo a tee T and a keywod, a compact tee CT() is a tee that keeps only two types of nodes in T : a keywod node that contains keywod, and a node that has at least two diect subtees containing nodes caying keywod. In the peode tavesal of T, fo two successive nodes u and v, if NN(u, ) NN(v, ), v is called a change node. An extended compact tee ECT() is a tee constucted by adding all change nodes into the compact tee CT(). Using ECT(), TVP() can be constucted easily. In [], the authos pove that the total size of all compact tees and all extended compact tees fo all keywods in the tee T(V, E) is bounded by O( doc(v ) ). The time to compute all compact tees and all extended compact tees fo all keywods in the tee T(V, E) is bounded by O( doc(v ) log V ). Example : Fig. shows a tee with the peode label fom to 0 on its nodes. Fo keywod, thee ae keywod nodes b, c, k, n, t. Fo node s, NN(s, ) = c. The compact tee of, CT(), is shown on the left pat of Fig.. Node is in CT() because has thee diect subtees with nodes caying keywod. e is not in CT() because e is not a keywod node and e has only one diect subtee ooted at m with nodes caying keywod. The extended compact tee of, ECT(), is shown in the middle pat of Fig. with the 90
peode label maked beside each node. Node e is in ECT(), because fo its paent node h, NN(h, ) = b NN(e, ) = c. The tee Voonoi patition of, TVP(), is shown on the ight pat of Fig.. Fo node s with peode label, it is in the inteval [, 6], thus NN(s, ) = c as listed in TVP().. SOLUTION OVERVIEW Answeing k-nk on a Gaph using Tee Distance: To addess the dawback of witness distance, in this pape, we popose to use tee distance in pocessing a k-nk quey. We obseve that fo a patition of a distance oacle, we can constuct a shotest path tee ooted at the cente node of the patition. Since a tee contains moe stuctual infomation than a sta, using tee distance will be moe accuate than using witness distance fo estimating the distance of two nodes. Fo a distance oacle O i, let the set of tees constucted in O i be T i. T i can be consideed as a tee by adding a vitual oot and seveal vitual edges with weight + that connect the new vitual oot to evey oot node in T i espectively. Let the k-nk esult on tee T be R T. Suppose we have an algoithm to compute R T on a tee T, we can solve the k-nk poblem in a gaph by meging R Ti fo each tee T i, i. Obviously, such a esult will be moe accuate than the esult by []. The following example illustates the k-nk quey pocessing based on tee distance. Example : Fo the distance oacles O and O shown in Fig., the coesponding shotest path tees T and T ae shown in Fig.. Fo T, thee is only tee ooted at because thee is only patition in O. Fo T, thee ae tees ooted at nodes n, o,, t espectively, because thee ae patitions in O. In each tee, the path fom any node to the oot node is a shotest path in the oiginal gaph. Fo two nodes s and v, thei tee distance is in both T and T, the same as the exact distance dist(s, v) in G. Fo a k-nk quey Q = (m,, ), we have R T = {c :, t : }, and R T = {k : }. By meging R T and R T, we get R = {c :, k : }. Such a esult is much bette than the esult in Example computed using witness distance fo the same quey. With the tee distance fomulation, the key opeation in answeing a k-nk quey on a gaph is to answe the k-nk quey on a tee. Theefoe, we stat with pocessing a k-nk quey on a tee. Answeing k-nk on a Tee: We show that it is nontivial to answe a k-nk quey on a tee efficiently even if k is bounded. Ou fist attempt is to extend the existing -NK solution on a tee T(V, E) in []. Recall that in [], fo a cetain keywod, the ange [, V ] is patitioned into seveal disjoint intevals, and nodes with the peode label in an identical inteval shae the same -NK esult. When k, each inteval needs to be futhe patitioned to ensue that all nodes with the peode label in the same inteval shae an identical k-nk esult. The numbe of intevals inceases exponentially w..t. the numbe of keywod nodes on the tee until it eaches V fo a keywod. Clealy, using such an appoach, the index size is too lage in pactice even fo a small k. Ou second attempt is that, fo each node v on the tee T(V, E) and each keywod, we pecompute its k neaest nodes that contain. When pocessing a quey Q = (q,, k) with k k, we can simply etieve the pecomputed esult on node q and output the fist k nodes diectly. Such an appoach is impactical because fo each keywod, we need O(k V ) space to stoe the pecomputed esults. In the following, we fist intoduce two algoithms fo answeing exact k-nk on a tee T(V, E). Ou fist algoithm tee- can only handle bounded k values with quey pocessing time O(k + log V ) and index size O(k doc(v ) ) fo all keywods whee k is an uppe bound value of k. Ou second algoithm tee- can handle an abitay k with quey pocessing time O(k log V ) T k g a j n v f d b i h u e p m s c o t k b g u a j n m f d h i e p o s v T c t Figue : Shotest Path Tees T and T Algoithm : tee- (Q,T ) Input: A k-nk quey Q = (q,, k), and a tee T. Output: Answe fo Q on T. R ; (u, u ) the enty edge of q on CT(); R R k (cand (u) dist(q, u)); R R k (cand (u ) dist(q, u )); etun R; and index size O( doc(v ) log V ) fo all keywods which is independent of k. We then show ou algoithm fo appoximate k-nk on a gaph by meging esults on a bounded numbe of tees. We popose a global stoage technique to futhe educe the index size and the quey time on a gaph. Finally we show how to extend ou method to handle a quey with multiple keywods.. K NK ON A TREE FOR A SMALL K In this section, we study how to answe a k-nk quey Q = (q,, k) on a tee T(V, E). We fist conside a common scenaio when uses ae inteested in a small numbe of answe nodes bounded by a small constant k, i.e., k k. Recall that fo a keywod, its compact tee CT() keeps all the stuctual infomation of on the tee T. Ou idea is to pecompute the top-k esults fo evey keywod and evey node on CT(). Since the total size of all compact tees is bounded by O( doc(v ) ), the total space to stoe the top-k esults of nodes on all compact tees is bounded by O(k doc(v ) ). Given a quey Q = (q,, k), if q is on CT(), we can simply epot the pecomputed answe on CT(). If q is not on CT(), we need to find a way to constuct the answe using the pecomputed esults as well as the stuctue of CT() and T. In the following, we fist intoduce how to answe a k-nk quey using CT(), followed by discussions on the constuction of the index.. Quey Pocessing Fo a keywod, and each node v in the compact tee CT(), we use a candidate list cand (v) to denote the pecomputed k-nk esults fo k = k on node v anked in nondeceasing ode of thei distances to v, in the fom of cand (v) = {v : dist(v, v ), v : dist(v, v ),, v k : dist(v, v k )} whee dist(v, v ) dist(v, v ) dist(v, v k ). Given a quey Q = (q,, k) on a tee T(V, E) whee k k, if q is in CT(), we can simply epot the fist k elements in cand (q) as the answe. The difficult case is when q is not in CT(). In ode to answe such a quey, we define an enty edge to be the edge in CT() that is neaest to q. Intuitively, the enty edge plays a ole of connecting the quey node q to the compact tee CT(). The fomal definition of enty edge is as follows. DEFINITION. (Enty Node and Enty Edge) Given a compact tee CT(), fo each edge (u, u ) on CT() with u being a child node of u, (u, u ) epesents a unique path fom u to u on the oiginal tee T. Fo any node v on T, we say v sticks to CT(), denoted as v s CT(), if and only if thee exists an edge (u, u ) on CT() such that v is on the path fom u to u on T, othewise v does not stick to CT(), denoted as v / s CT(). Fo a node q on T, let v be the fist node on the path fom q to the oot node of T such that v s CT(). v is called the Enty Node of q w..t., 90
Algoithm : opeato R δ Input: Candidate list R = {u : d u, u : d u, }, distance δ. Output: A candidate list by adding δ to all distances in R. R ; fo i = to R do R R S {u i : d ui + δ}; etun R ; denoted as EN (q). The coesponding edge (u, u ) on CT() is called the Enty Edge of q w..t., denoted as EE (q). Note that fo a node q and a keywod, EE (q) is an edge on the compact tee CT(), and EN (q) is a node on the oiginal tee T. We use an example to illustate the enty node and enty edge. Example 6: Fo the tee T shown in Fig. and keywod, the compact tee CT() is shown on the left pat of Fig.. Fo ease of illustation, we also mak the nodes in CT() dak on the tee T in Fig.. Fo edge (, c) in CT(), h s CT() because h is on the path fom to c in T. p / s CT() since p is not on the tee path of any CT() edge. Fo node v, its enty node is EN (v) = e, as e is the fist node on the path (v, p, e, h, d, ) such that e s CT(). The enty edge fo v is EE (v) = (, c) since the enty node e fo v is on the path fom to c in T. The enty nodes and enty edges fo some othe nodes in T ae listed in the following table. Node g j d e p u EN g j d e e b EE (, a) (a, k) (, c) (, c) (, c) (, b) The Algoithm: Given a tee T(V, E), fo keywod, all keywod nodes ae contained in CT(). Fo any node q V, the path fom q to any keywod node will go though the enty node EN (q). Based on such popety, the esult of a quey Q = (q,, k) is identical with the esult of the quey Q = (EN (q),, k). Howeve, EN (q) may not be on CT(), thus the esult of Q is not necessaily pecomputed. Let (u, u ) = EE (q), since EN (q) is on the path fom u to u on the tee T, the path fom EN (q) to any keywod node in T will go though eithe u o u. Thus, the answe fo Q can be constucted by meging the pecomputed candidate lists cand (u) and cand (u ) on CT(). Ou algoithm fo pocessing a quey Q = (q,, k) on a tee T is shown in Algoithm. We assume that the compact tee CT() fo each keywod and the list cand (u) fo evey node u on CT() have been computed. Afte initializing the esult R in line, we find the enty edge (u, u ) fo q on CT() (line ). We add a distance dist(q, u) to evey node in cand (u) using the opeato, to eflect the distance fom q to a keywod node via u. We then mege the new esult into R using the k opeato (line ). Similaly we apply the two opeatos to cand (u ) with the distance dist(q, u ) (line ). We will descibe the opeatos and k late. We use the following example to illustate the algoithm. Example 7: Given the tee T shown in Fig. and CT() on the left pat of Fig., fo a quey Q = (o,, ), the enty edge EE (o) = (, c). Suppose the lists cand () = {b :, n : } and cand (c) = {c : 0, t : } ae pecomputed. By adding dist(o, ) = to cand (), and adding dist(o, c) = to cand (c), we get the new lists {b : 6, n : 8} fo and {c :, t : } fo c. We mege the two lists and get the final esult R = {c :, t : }. The efficiency of Algoithm depends on thee opeations. The fist opeation is to find the enty edge fo any node on T (line ). The second opeation is to calculate the distance of any two nodes on T, e.g., dist(q, u) and dist(q, u ) (line -). The thid opeation is to mege two soted lists into a new one using opeatos and k (line -). Next, we discuss the thee opeations sepaately. 6 7 8 9 Algoithm : opeato R k R Input: Two soted candidate lists R = {u : d u, u : d u, } R = {v : d v, v : d v, }, and esult size k. Output: The meged candidate list. R ; i ; j ; while (i < R o j < R ) and R k do if i < R and (d ui d vj o j R ) then if u i / R then R R S {u i : d ui }; i i + ; else if j < R and (d vj d ui o i R ) then if v j / R then R R S {v j : d vj }; j j + ; etun R; Finding the Enty Edge: Given a keywod, fo any node v on a tee T(V, E), ou idea of finding the enty edge EE (v) of v is simila to the idea of finding the -NK answe using the tee Voonoi patition TVP() in []. Fo the ange [, V ], we patition it into seveal disjoint intevals, such that nodes with the peode label in the same inteval shae an identical enty edge. We call such patition an enty edge patition fo, denoted as EEP(). Given EEP(), EE (v) can be computed easily using a binay seach in EEP() in O(log V ) time. In the next subsection, we show how to build EEP() fo all keywods efficiently and pove that the total size of EEP() fo all keywods in T is bounded by O(doc V ). Computing Tee Distance: Given a tee T(V, E) with oot, suppose the distance fom to evey node in T has been pecomputed. Fo any two nodes u and v on T, we denote LCA(u, v) as thei lowest common ancesto. The distance of u and v can be computed as dist(u, v) = dist(, u) + dist(, v) dist(, LCA(u, v)). Using the techniques in [], LCA(u, v) can be found in O() time using O( V ) index space. Thus dist(u, v) fo any two nodes u and v on T can be computed in O() time using O( V ) index space. Meging Results: The esults ae meged using two opeatos and k. Algoithm shows the opeato, which takes a candidate list R and a distance δ as input, and outputs a candidate list by adding δ to all distances in R. The time complexity fo the opeato is O( R ). Algoithm shows the opeato k, which takes two candidate lists R and R soted in nondeceasing ode of the distances, and a value k as input, and outputs the meged candidate list R. R contains at most k elements soted in nondeceasing ode of the distances. R can be constucted by visiting each element in R and R at most once. The time complexity fo the k opeato is O(min{ R + R, k}). The k and opeatos satisfy the commutative, associative and distibutive laws as follows. (Commutative Law) R k R = R k R. (Associative Law) (R k R ) k R = R k (R k R ). (Distibutive Law) (R k R ) d = (R d) k (R d). THEOREM. Algoithm computes the exact k-nk answe fo a quey Q = (q,, k) on a tee T(V, E) in O(k + log V ) time. Algoithm uses the novel idea of enty edge, and elegantly extends the -NK method [] to handle k-nk (k > ) with the same quey time complexity, except fo an exta linea cost O(k) indispensable fo epoting the esults. Given the tee T, fo evey keywod, besides the compact tee CT(), two moe indexes ae needed. The fist index, the enty edge patition EEP(), is to find the enty edge fo any node on T. The second index is the candidate list cand (v) fo evey node on CT(). Below we show how to constuct the two indexes.. Constuction of Enty Edge Patition Given a tee T(V, E), fo each keywod, shaing the simila idea with the tee Voonoi patition TVP(), we constuct an enty 90
Algoithm : EEP-constuct (T,CT()) Input: A tee T(V, E) and a labelled compact tee CT(). Output: Enty edge patition EEP(). the oiginal oot of CT(); EEP() ; patition(eep(), [, V ], (φ, ), CT()); etun EEP(); Pocedue patition(eep(), inteval [s, t], edge (u, u ), CT()) 6 foeach subnode u of u on CT() in inceasing peode do 7 [s, t ] inteval of (u, u ); 8 if s < s then add ([s, s ], (u, u )) to EEP(); 9 patition(eep(), [s, t ], (u, u ), CT()); 0 s t + ; if s t then add ([s, t], (u, u )) to EEP(); edge patition EEP(), which divides [, V ] into seveal disjoint intevals, such that nodes in V with peode in the same inteval shae an identical enty edge on CT(). In ode to constuct the enty edge patition, fo each edge (u, u ) on CT(), we label (u, u ) with an inteval accoding to the following definition. DEFINITION. (Labeled Compact Tee) Given a tee T, a node v on T has an inteval [s v, t v] whee s v is the peode label of v on T and t v is the maximum peode label fo all nodes in the subtee ooted at v. Given a compact tee CT(), fo any edge (u, u ) on CT(), let the banching node of (u, u ) be the fist node along the path fom u to u on T, and denote it as u b. We label edge (u, u ) with the inteval of u b. The label of evey edge on a compact tee CT() can be computed easily when constucting CT(). Given any node v on a tee T and an edge (u, u ) on a compact tee CT(), denote the banching node of (u, u ) as u b, then v is in the subtee ooted at u b if and only if the peode label of v on T is in the inteval of u b, which is identical with the label of edge (u, u ). Fo ease of pesentation, fo each labeled compact tee CT(), we add a vitual oot φ and an edge fom φ to the oiginal oot of CT(). We use the following example to illustate the labeled compact tee. Example 8: Fo the tee T shown in Fig., we mak the peode and the inteval of each node on the tee. Fo the node h, its inteval is [0, 8] because the peode of h on T is 0 and the maximum peode fo all nodes on the subtee ooted at h is 8. The labeled compact tee CT() fo keywod is shown on the left pat of Fig. 6. Fo the edge (, c) on CT(), its banching node is d because d is the fist node along the path (, d, h, e, m, c) on T. The label of edge (, c) is the inteval of node d, which is [9, 8]. Fo a compact tee CT() of tee T and a keywod, suppose (u, u ) on CT() is an enty edge of a node v on tee T, i.e., EE (v) = (u, u ). The peode of v is in the inteval of (u, u ), because the inteval of (u, u ) contains all nodes unde the subtee ooted at the banching node of (u, u ). Based on such an obsevation, by excluding the intevals of all edges unde the subtee ooted at u in CT() fom the inteval of (u, u ), nodes with peode in the emaining intevals will use (u, u ) as the enty edge. Fo example, in the compact tee CT() shown in Fig. 6, the edge (φ, ) has an inteval [, 0]. has thee banches with intevals [, 6], [9, 8] and [9, 0] espectively. By excluding the thee intevals fom [, 0], two intevals [, ] and [7, 8] ae left. Thus nodes with peode in eithe of the two intevals [, ] and [7, 8] shae the same enty edge (φ, ). Fo edge (, c) with inteval [9, 8], by excluding inteval [7, 7] of the only banch of c, nodes with peode in eithe of the two intevals [9, 6] and [8, 8] shae the same enty edge (, c). Algoithm shows the constuction of the enty edge patition EEP() on CT() fo a keywod. Afte initializing EEP() (line ), the main opeation is a ecusive pocedue patition (line ), Φ [,0] [,6] [9,0] a [9,8] b [,] [6,6] c [7,7] k n t CT Inteval [, ] [, ] [, ] [6, 6] [7, 8] EntyEdge (Φ,) (,a) (a,k) (a,n) (Φ,) Inteval [9,6] [7,7] [8,8] [9,0] EntyEdge (,c) (c,t) (,c) (,b) Enty Edge Patition Figue 6: Labeled Compact Tee and Enty Edge Patition to patition the inteval [, V ] to seveal disjoint intevals. Each enty in EEP() is in the fom of ([s, t], (u, u )) denoting that nodes with the peode label in the inteval [s, t] shae the same enty edge (u, u ). Fo an edge (u, u ) with inteval [s, t], the pocedue pocesses evey child node u of u on CT() in inceasing peode of u (line 6). Fo each edge (u, u ) with inteval [s, t ], the inteval [s, t] is patitioned into thee pats: [s, s ], [s, t ] and [t +, t]. The fist pat is added to EEP() with the enty edge (u, u ) if it is not empty (line 8). The second pat is pocessed ecusively fo edge (u, u ) (line 9), and the thid pat is left to be futhe patitioned by othe child nodes of u by simply setting s to be t + (line 0). Afte pocessing all child nodes of u, if [s, t] is still not empty, we add [s, t] to EEP() with the enty edge (u, u ) (line ). The time complexity of Algoithm is O( V (CT()) ) since evey node on CT() is visited once. Fo each edge (u, u ) on CT(), at most two intevals ae added into EEP(). One is added befoe invoking patition fo edge (u, u ) (line 8) and the othe is added at the end of patition fo (u, u ) (line ). Thus the total numbe of intevals in EEP() is no moe than V (CT()). Example 9: Fo the labeled compact tee CT() shown in Fig. 6, when invoking patition(eep(), [, 0], (φ, ), CT()), we pocess the thee child nodes a, c, b of in ode. We fist pocess edge (, a) with inteval [, 6], which divides the inteval [, 0] into thee pats: [, ], [, 6], and [7, 0]. [, ] is added into EEP() with the enty edge (φ, ). [, 6] is pocessed ecusively by invoking patition(eep(), [, 6], (, a), CT()), and [7, 0] is pocessed by the othe two child nodes c and b similaly. EEP() is shown on the ight pat of Fig. 6. THEOREM. Fo a tee T(V, E) with the compact tees fo all keywods constucted, the enty edge patition EEP() fo all keywods can be constucted in O( doc(v ) ) time and stoed in O( doc(v ) ) space.. Constuction of Candidate List Given a compact tee CT() fo a tee T and a keywod, we need to compute the candidate list cand (v) fo evey node v on CT(). Since CT() keeps the stuctual infomation of all keywod nodes in T, it is sufficient to seach only on CT() to calculate cand (v). A simple solution is to compute each cand (v) sepaately on CT(). This appoach may take O( V (CT()) ) time to calculate cand (v) fo a node v, thus O( V (CT()) ) time to compute all candidate lists in CT() fo one keywod, which is too slow. In ode to save the computational cost, we design a novel method to update the candidate list of a node using those of its neaby nodes on the tee CT(). Note that in CT(), the path between two nodes u, v is unique: fom node u to the lowest common ancesto of u and v, LCA(u, v), and then fom LCA(u, v) to v. Based on this obsevation, we can follow the path to popagate the candidate list on u to v. Using this idea, we just need to tavese the tee CT() twice to build the candidate lists fo all nodes on CT(). The fist tavesal on CT() is a bottom-up one, such that the candidate list on each node is popagated to all its ancestos on CT(). The second tavesal on CT() is a top-down one, such that the candidate list on each node is futhe popagated to all its descendants. 906
Algoithm : cand-constuct (T,CT(), k) Input: A tee T, a compact tee CT(), and the uppe bound of k, k. Output: cand (v) fo each v on CT(). cand (v) fo each node v on CT(); cand (v) {v : 0} fo each node v on CT() that contains ; foeach v on CT() in a bottom-up fashion do u the paent node of v on CT(); cand (u) cand (u) k (cand (v) dist(u, v)); 6 foeach v on CT() in a top-down fashion do 7 u the paent node of v on CT(); 8 cand (v) cand (v) k (cand (u) dist(u, v)); _ k = a {n:,k:} c {c:0,t:} b a {n:,k:} c {c:0,t:} b {b:0} {b:0,n:} k n t k n t {k:0} {n:0} {t:0} {k:0,n:} {n:0,k:} {t:0,c:} Bottom-up Phase Top-down Phase {b:, n:} {b:, n:} Figue 7: Constucting Candidate Lists Algoithm shows the constuction of the candidate lists on CT(). We fist initialize the candidate list fo each keywod node to be the node itself and initialize the candidate list fo each nonkeywod node to be (line -). We then tavese CT() in a bottom-up fashion, e.g., using postode tavesal. Fo each node v tavesed, we mege cand (v) into that of its paent node u by adding a distance dist(u, v) to the list cand (v) (line -). At last, we tavese CT() in a top-down fashion, e.g., using peode tavesal. Fo each node v tavesed, we mege the list of v s paent node u, cand (u), into that of v by adding a distance dist(u, v) to the list cand (u) (line 6-8). Since the k opeato takes O(k) time, the time complexity of Algoithm is O(k V (CT()) ) using O(k V (CT()) ) space. Example 0: Fig. 7 shows the candidate lists afte the bottom-up phase and the top-down phase fo the compact tee CT() shown on the left pat of Fig.. Initially, the candidate list fo t is {t : 0} and the candidate list fo c is {c : 0}. Since c is a paent node of t, in the bottom-up phase, the list of t is popagated and meged into that of c by adding a distance dist(c, t) =, thus cand (c) = {c : 0, t : } afte the bottom-up phase. In the top-down phase, the list of c is popagated and meged into that of t, thus cand (t) = {t : 0, c : } afte the top-down phase. THEOREM. Given a tee T, an uppe bound of k, k, and CT() fo all keywods, the candidate lists cand (v) fo all keywods and all nodes v on CT() can be constucted in O(k doc(v ) ) time and stoed in O(k doc(v ) ) space. 6. K NK ON A TREE FOR A LARGE K Algoithm can only pocess a k-nk quey Q = (q,, k) with a bounded k, i.e., k k, on a tee T. If k can be abitaily lage, the index size cannot be bounded. In this section, we will emove the estiction on k and intoduce an algoithm to handle a k-nk quey fo an abitay k, with an index size independent of k. 6. A Basic Pivot Appoach Recall that fo a node u that contains keywod and an abitay node v in a tee T, the path fom v to u is unique on T, and can be divided into two segments: the fist segment is fom v to thei lowest common ancesto LCA(u, v), and the second segment is fom LCA(u, v) to u. Ou basic idea is to compute the fist segment online and pecompute the esults egading the second segment offline. Thus, in the pecomputing phase, instead of popagating a keywod node u to all nodes in T to update thei candidate lists, we just need to popagate u to its ancestos in T. In the quey pocessing phase, we do not seach the whole tee to get the answe fo a g {n:,k:} a {n:, k:} j {k:} k {k:0} f i n {n:0} v {b:, n:, k:, c:, t:6} d {c:,t:} h {c:,t:} b {b:0} u e {c:,t:} p m {c:, t:} s c {c:0, t:} o t {t:0} Figue 8: Basic Pivot Appoach quey, but instead, we just need to mege the pecomputed candidates along the path fom the quey node to the oot node of the tee T. Using this method, the size of the index to keep the candidate nodes can be lagely educed at the expense of longe quey time. We use depth(t) to denote the depth of tee T, and depth(u, T) to denote the depth of node u on tee T. Fo any two nodes u and v on T, u is a of v if and only if u is an ancesto of v on T. Fo each node v, we denote the set of s of v on T as PV(v, T). We have PV(v, T) = depth(v, T). Given a keywod, fo each node u on tee T, we use the candidate list cand (u) to denote the set of nodes that contain keywod on the subtee ooted at u on tee T, soted in nondeceasing ode of thei distances to u. The candidate list is in the fom of cand (u) = {u : dist T(u, u ), u : dist T(u, u ), } whee dist T(u, u ) dist T(u, u ). In ode to handle an abitay k, the size of cand (u) is not bounded by any pedefined k. Clealy, a node v cand (u) if and only if v contains keywod and u PV(v, T). In othe wods, a keywod node v only appeas in the candidate lists of its s. As a P esult, fo any keywod, the total size of all candidate lists fo is v V PV(v, T) = P v V depth(v, T). We use the following example to illustate the based appoach. Example : Fig. 8 shows a tee T with depth(t) = 6. Fo keywod, the nodes that contain ae maked with bold cicles. Fo evey node v, we ceate a candidate list cand (v) that contains all keywod nodes in its subtee, soted in nondeceasing distances to v. Fo example, cand (g) = {n :, k : } means thee ae two keywod nodes n and k in the subtee ooted at g with distances and to g espectively. Fo node p, PV(p, T) = {, d, h, e}. Fo a k-nk quey Q = (d,, ), the path fom d to the oot contains two nodes d and. We mege the lists cand (d) and cand () by adding a distance dist(, d) = to all elements in cand (). The final answe fo Q is {b :, c :, n : }. 6. Pivot Appoach with Tee Balancing The poblem is not pefectly solved using the basic appoach above. The easons ae twofold. Fist, in the pecomputing phase, the index size fo each keywod is P v V depth(v, T), which can be lage if depth(v, T) is lage. Second, when pocessing a quey Q = (q,, k), we need to tavese all nodes fom the quey node q to the oot of T. This is also costly if depth(q, T) is lage. Thus the key to optimizing both index space and quey time is to educe the aveage depth of nodes on the tee. A simple solution is to otate the tee T to find a pope oot such that the aveage depth of nodes is minimized. Howeve, such an appoach cannot essentially solve the poblem, as illustated by the following example. Let T(V, E) be a chain of n+ nodes whee evey node contains keywod. The best way is to select the middle node on the chain as the oot to minimize the aveage depth of nodes. The total index size is P v V depth(v, T) = P v V (T) depth(v, T) = n(n ), which is O(n ). Futhemoe, we need to tavese n nodes to answe a quey when the quey node q is at one end of the chain, leading to O(n) quey time. This example shows that both the index space and quey pocessing can still be vey costly, even though we otate the tee. 907
Oiginal Tee T DT(T) v PV(v, DT(T)) a {b : 8, f : } d {b :, f : } e {b : 8, f : } c {b :, g : } h {b : 0, g : } f {b : 6} g {b : 6} Figue 9: Distance Peseving Balanced Tee In ode to educe the aveage depth of nodes to optimize both index space and quey pocessing time, we intoduce a new stuctue called distance peseving balanced tee fo T(V, E), denoted as DT(T). Geneally speaking, DT(T) peseves all distance infomation fo any node pai on T and the height of DT(T) is at most log V. The fomal definition of DT(T) is as follows. DEFINITION. (Distance Peseving Balanced Tee) Given a tee T(V, E) with a positive weight on each edge, a Distance Peseving Balanced Tee of T, denoted as DT(T), is an unweighted tee with the following thee popeties. P : V (DT(T)) = V (T). P : depth(dt(t)) log V. P : Fo any two nodes u and v, let the lowest common ancesto of u and v on DT(T) be o = LCA DT(T) (u, v). The following equation always holds: dist T(u, v) = dist T(u, o) + dist T(v, o). Note that DT(T) is unweighted and the distances dist T(u, v), dist T(u, o) and dist T(v, o) in P ae calculated on the oiginal tee T, but not DT(T). The lowest common ancesto LCA DT(T) (u, v) is not necessaily the ancesto of u o v on the oiginal tee T. Based on P, we can also divide ou algoithm into two phases using DT(T). In the pepocessing phase, fo each keywod, and each node v that contains keywod, we popagate v into the candidate lists of its s on DT(T). In the quey pocessing phase, we tavese fom the quey node q to the oot node on DT(T). Using the balanced tee DT(T), the total size of the candidate lists fo a keywod is bounded by P v V depth(v, DT(T)) P v V log V, and the total size fo all keywods is bounded by O( doc(v ) log V ). Fo pocessing a quey, we need to tavese at most log V + nodes on the path fom the quey node to the oot of DT(T). Example : A tee T with depth(t) = and a distance peseving balanced tee of T, DT(T) with depth(dt(t)) = ae shown in Fig. 9. The weight of each edge is maked on T. Edge (b, d) is on T but not on DT(T), and edge (b, f) is on DT(T) but not on T. Fo two nodes a and d, LCA DT(T) (a, d) = f, thus dist T(a, d) = dist T(a, f) + dist T(d, f) = + =. Note that f is not an ancesto of d on the oiginal tee T. PV(v, DT(T)) fo each node v in DT(T) is listed on the ight pat of Fig. 9. Hee we intoduce ou algoithm of pocessing a k-nk quey on a tee T using DT(T), and in the next subsection, we will show that DT(T) always exists fo any tee T. We will also descibe how to constuct DT(T) fo a tee T and how to compute all candidate lists cand (v) fo all keywods and all nodes v on the tee DT(T). Quey Pocessing: Given a tee T and DT(T), Algoithm 6 shows how to pocess a quey Q = (q,, k). We tavese all nodes on the path fom q to the oot of DT(T), which is PV(q, DT(T)) S {q} (line ). Fo each tavesed node v, we add dist T(q, v) to all elements in cand (v) and then mege the list into the cuent esult R, since we need to fist go fom node q to node v (the fist segment), and then go fom v to the keywod nodes in cand (v) (the second segment). Note that the time complexity of the opeato in line is O( cand (v) ). Howeve, by combining with k, it is easy to educe the time complexity of line to O(k). {b:, n:, k:, c:, t:6} a f b {b:0} {n:, k:} e {c:,t:} g j n i m {c:, t:} u {k:} {n:0} h p c {c:0, t:} o k {k:0} d v s t {t:0} Figue 0: Pivot Appoach with Tee Balancing Algoithm 6: tee- (Q,T ) Input: A k-nk quey Q = (q,, k), and a tee T. Output: Answe fo Q on T. R ; foeach v PV(q, DT(T)) S {q} do R R k (cand (v) dist T (q, v)); etun R; Example : Fig. 0 shows a distance peseving balanced tee DT(T) fo the tee T shown in Fig. 8, with depth. Fo keywod, the nodes that contain ae maked with bold cicles in Fig. 0. Fo a quey Q = (e,, ), we just need to mege candidate lists cand (e) and cand () by adding a distance dist T(e, ) = to all elements in cand (). Howeve, if we use the basic appoach on the oiginal tee T without tee balancing, we need to mege candidate lists fo nodes e, h, d and espectively. The answe fo Q is {c :, t :, b : }. THEOREM. The time complexity fo answeing a k-nk quey on a tee T(V, E) using Algoithm 6 is O(k log V ). 6. Index Constuction Given a tee T, in ode to answe a quey Q = (q,, k) using Algoithm 6, we need to build two indexes. The fist index is the distance peseving balanced tee DT(T) fo T and the second index is the candidate list cand (v) fo each keywod and each node v on DT(T). We intoduce them sepaately in the following. Constucting DT(T): Befoe intoducing how to constuct a tee DT(T) to satisfy the thee popeties in Definition, we fist pesent an appoach to constucting a tee T fom T, which satisfies popeties P and P. In othe wods, T is distance peseving but not necessaily balanced. Let the initial T be T. We change T by pefoming the following steps. () Randomly select a node on T as the new oot and otate T accodingly. () Fo each diect subtee T c of on T, pefom steps () and () on T c ecusively. Clealy, afte steps () and (), T may not be isomophic to T. We have the following two obsevations on T. O : Afte pefoming step () on T, two nodes u and v ae in diffeent diect subtees of if and only if LCA T (u, v) =. Such a popety also holds afte pefoming step () on T because step () only changes the stuctue within a subtee of. O : Since the stuctue of T is not changed afte step (), we have dist T(u, v) = dist T(u, ) + dist T(v, ) on the oiginal tee T. Fom O and O, we have dist T(u, v) = dist T(u, LCA T (u, v))+dist T(v, LCA T (u, v)) afte step () on T. Such a popety also holds fo any subtee of T because it is pocessed using steps () and () ecusively. As a esult, T satisfies popety P. Ou DT(T) is constucted in a simila way as T. In ode to constuct a balanced tee, in step (), the oot node should be selected moe caefully, instead of andom selection. In ou method, we select a median node to be the oot node in step (), which is defined as follows. DEFINITION 6. (Median Node) Given a tee T, the Median Node of T is a node on T such that when using as the oot of T, fo each diect subtee T c of on T, V (T c) V (T) holds. 908
Algoithm 7: DT-constuct (T ) 6 7 Input: A tee T. Output: A distance peseving balanced tee DT(T). the median node of T ; otate T with as the oot; DT(T) a tee with a single node ; foeach diect subtee T i of in T do DT(T i) DT-constuct(T i); add DT(T i) as a subtee of in DT(T); etun DT(T); The median node is used to balance the size of each diect subtee of T when using as the oot of T, as a diect subtee of in T contains at most half of the nodes in T. Clealy, if a median node always exists fo any tee, we can select a median node of tee T as the oot and ecusively do this fo each diect subtee of the oot. In this way we can constuct a tee T with depth(t ) log V (T). The following lemma shows that the median node always exists on any tee T, and also gives a method to find the median node of T. LEMMA. Given a tee T, the median node of T is the node, V (T) such that the subtee ooted at contains moe than nodes and depth(, T) is the maximum. Accoding to Lemma, the median node is unique on T. Othewise if thee ae two such nodes with the same maximum depth, the size of the tee will be lage than V (T). Given a tee T, we can easily find the median node of T using time O( V (T) ) by tavesing each node in T only once. Algoithm 7 shows how to constuct DT(T) fo a tee T. Specifically, given a tee T, we fist find the median node of T as the new oot and then otate T accodingly (line -). The median node is also the oot of DT(T) (line ). Fo each diect subtee T i of in T, we ceate DT(T i) ecusively and add DT(T i) as a subtee of DT(T) (line -6). Example : Fo the tee T shown in Fig. 8, DT(T) is shown in Fig. 0. DT(T) is constucted as follows. Since is the median node of T, the oot of DT(T) is. Fo the fist subtee unde in T, its median node is a, thus the fist subtee unde in DT(T) is ooted at a. All othe nodes in DT(T) ae constucted similaly. We have depth(dt(t)) = log V (T) = log 0. THEOREM. Given a tee T(V, E), Algoithm 7 constucts a distance peseving balanced tee DT(T) fo T using O( V log V ) time and O( V ) space. Constucting cand (v): Fo a tee T(V, E), given DT(T), the algoithm fo constucting the candidate list cand (v) fo each node v and each keywod is quite simple. Fo each node v, we popagate its keywod infomation to all its s in DT(T). Ou algoithm is shown in Algoithm 8. We fist initialize evey candidate list to be (line ). Then we tavese each node v in DT(T) and each keywod that is contained in node v (line -). Fo each p of v as well as v itself, we calculate dist T(p, v) on the oiginal tee T, and add the element v : dist T(p, v) to the candidate list cand (p) (line -). Afte all candidate lists ae ceated, we sot the elements in evey candidate list in nondeceasing ode of the distances. The time complexity fo line - is O( doc(v ) log V ) since each keywod is popagated into at most log V candidate lists in DT(T). Fo line 6-7, we need O( doc(v ) log V ) time to sot all candidate lists in DT(T). THEOREM 6. Fo a tee T, Algoithm 8 computes the candidate lists cand (v) fo all nodes v and all keywods on DT(T) using O( doc(v ) log V ) time and O( doc(v ) log V ) space. Algoithm 8: cand-constuct (T,DT(T)) Input: A tee T, a distance peseving balanced tee DT(T). Output: cand (v) fo each v on DT(T) and each keywod. cand (v) fo each node v on DT(T) and each keywod ; foeach v V (DT(T)) do foeach doc(v) do foeach p PV(v, DT(T)) S {v} do cand (p) cand (p) S {v : dist T (p, v)}; 6 foeach v V (DT(T)) and keywod do 7 sot elements in cand (v) in nondeceasing ode of distances; Algoithm 9: gaph-knk (G,Q) Input: A gaph G(V, E) and a k-nk quey Q = (q,, k). Output: The answe fo Q on G. R ; foeach Distance Oacle O i do T i shotest path tee fo O i; R R k tee-knk(t i, Q); etun R; 7. APPROXIMATE K NK ON A GRAPH In this section, we discuss how to answe a k-nk quey Q = (q,, k) on a gaph G. We intoduce two algoithms gaph- and gaph- fo a bounded k and an abitay k espectively. We then popose a global stoage technique to educe the index size and quey pocessing time. We also show how ou appoach can be extended to handle multiple keywods. Finally, we summaize the complexities of all algoithms intoduced in this pape. Quey Pocessing: Ou geneal idea fo quey pocessing on a gaph is intoduced in Section. Suppose we have computed = O(log V ) distance oacles O, O,, O using the algoithm in [0]. Let the shotest path tees fo the oacles be T, T,, T espectively. Algoithm 9 shows ou famewok fo answeing Q on G. The algoithm simply enumeates all shotest path tees and answes the k-nk quey using a tee based appoach, denoted as tee-knk, on each shotest path tee T i, and meges all the esults using the k opeato (line ). Since we have two tee based solutions, namely, tee- and tee-, we have two coesponding algoithms on gaphs, denoted as gaph- and gaph-, by instantiating tee-knk (line ) to tee- and tee- espectively. Global Stoage: As discussed above, we have shotest path tees T, T,, T. Fo a keywod and a node v, let cand i v, be the candidate list of v on tee T i, i. To answe a k-nk quey Q = (q,, k) on a gaph, conside a case when the candidate lists of node v on two diffeent tees T i and T j ae both meged into the esult, in the fom of R R k (cand i v, dist Ti (q, v)) k (cand j v, distt j (q, v)). This expession can be genealized to the case of meging the candidate lists of node v on moe than two tees. Instead of keeping a candidate list cand i v, fo each tee T i ( i ) sepaately, we popose a technique called global stoage which keeps a global candidate list of node v and keywod fo all tees T, T,, T. Denote the global candidate list of node v and keywod as cand v,. It is computed by cand v, = cand v, cand v, cand v,. Fo a node v, a node v cand v, may appea in the candidate list cand i v, of multiple tees T i, but will be stoed at most once in the global candidate list cand v,. Theefoe, the global stoage technique can effectively educe the index size, but it adds difficulty to quey pocessing due to two easons: () we need to add dist Ti (q, v) to cand i v, using the opeato, i.e., cand i v, 909
Global Stoage: {b:, n:, k:, c:, t:6} e{n:, c:, t:} DT(T ) {b:, n:, k:, c:, t:6} f a {n:, k:} {c:,t:} b {b:0} e i g j n m {c:, t:} u {n:0} h {k:} p c {c:0, t:} o k {k:0} d v s t {t:0} DT(T ) {b:} m{c:, k:, t:} m {k:} {k:0} g d f b {b:0} o k u h i e {n:} t {t:0,c:} a {n:} p c {c:0} j n {n:0} v s Figue : Global Stoage Example fo gaph- dist Ti (q, v), but dist Ti (q, v) is quey dependent, thus cannot be pecomputed; () the global candidate list may povide a diffeent esult list fom the one computed by Algoithm 9 without using global stoage. In the following, we will show that the global candidate list can be used to answe k-nk queies without sacificing the esult quality. We fist define the domination elationship between two candidate lists. DEFINITION 7. Fo two candidate lists R = {u : d u, u : d u, } and R = {v : d v, v : d v, } soted in nondeceasing ode of distances, R is dominated by R, denoted as R R, if and only if R R and d ui d vi fo all i R. Clealy, the domination elationship is tansitive, i.e., if R R and R R, then R R. To solve the fist poblem, we need to find a mege method that is independent of dist Ti (q, v) and at the same time, can geneate an answe that is no wose than the answe computed without global stoage. The solution is expessed in Equ.. Fo any two candidate lists cand i v, dist Ti (q, v) and cand j v, distt j (q, v), using Equ., we can geneate a bette esult by meging cand i v, and cand j v, using k fist, then taking distances dist Ti (q, v) and dist Tj (q, v) out and applying the minimum value of them. Clealy, (cand i v, k cand j v, ) min{distt (q, v), i distt j (q, v)} is a valid candidate list fo quey Q, because cand i v, k cand j v, is a candidate list fo node v and min{dist Ti (q, v), dist Tj (q, v)} suggests a path fom q to v in G. (cand i v, dist T i (q, v)) k (cand j v, dist T j (q, v)) (cand i v, k cand j v, ) min{dist T i (q, v), dist Tj (q, v)} The second poblem can be solved if we pove that by meging moe candidate lists using the opeato, the answe will not get wose. Conside a node v cand v,, the meging opeation finds the minimum distance between v and v ove multiple tees, which is a efined estimation of thei distance on gaph. We fomulate such a situation using Equ.. cand i v, candi v, candj v, () Equ. and Equ. also hold fo multiple candidate lists. Theefoe, we show that using global stoage will not sacifice the esult quality. Moe impotantly, global stoage can effectively educe the index size and quey pocessing time. It applies to both gaph algoithms gaph- and gaph-. We use the following example to illustate global stoage. Example : We take the gaph- algoithm as an example. Fig. shows two tees DT(T ) and DT(T ) fo the shotest path tee T and T shown in Fig., with candidate list maked beside each node fo keywod. Using global stoage, fo the same node on diffeent tees, we mege all its candidate lists using and only keep one global candidate list. The global candidate lists fo nodes, e and m ae maked on the top of Fig.. Fo quey Q = (p,, ), without global stoage, we need to mege thee candidate lists, cand e, dist T (p, e), cand, dist T (p, ) and cand e, dist T (p, e). Using global stoage, only two candidate lists cand e, min{dist T (p, e), dist T (p, e)}, cand, () Table : Algoithm Complexities on Tees (T ) and Gaphs (G) Quey Time (T ) O(log V + k) O(k log V ) Index Time (T ) O(k doc(v ) ) O( doc(v ) log V ) Index Size (T ) O(k doc(v ) ) O( doc(v ) log V ) Quey Time (G) O((log V + k) log V ) O(k log V ) Index Time (G) O(k doc(v ) log V ) O( doc(v ) log V ) Index Size (G) O(k doc(v ) log V ) O( doc(v ) log V ) Table : Dataset Statistics V E doc(v ) keywods DBLP, 69, 69, 76, 80, 8, 0, 0 FLARN, 070, 76, 6, 99 6, 966, 66, 70 dist T (p, ) need to be meged. Fo quey Q = (h,, ), without global stoage, we get the esult R = {c :, b : }. Using global stoage, we can get a esult R = {n :, c : } with R R. Handling Multiple Keywods: We discuss how to extend ou appoach to handle a k-nk quey of multiple keywods with AND (denoted as ) and OR (denoted as ) semantics. Without loss of geneality, we assume the fomat of a keywod expession is (,, ) (,, ). It is easy to handle, by answeing each i, i, sepaately and meging the esults using the opeato. Fo handling i, i,, we select a keywod i,j fom { i,, i,, } with the least fequency V i,j as the pimay keywod and conside othe keywods as filte keywods. We answe the quey fo the single keywod i,j. Befoe meging each candidate list using the opeato, we emove the candidate nodes that do not contain one o moe of the filte keywods fom the candidate list. In this way, each element in the final answe satisfies the pedicate specified in the keywod expession. Compaison: Table summaizes and compaes the quey time, index time and index size fo and on tees and gaphs. Hee, the listed complexities of index time and index size ae fo all keywods in the tee/gaph. is faste than in quey pocessing on both tees and gaphs. When k is small, the index time and index space fo ae smalle than on both tees and gaphs. Howeve, when k is lage, the index time and index space fo ae lage, while the index time and index space of ae independent of k on both tees and gaphs. 8. EXPERIMENTS In this section, we epot the pefomance of ou methods,, and thei global stoage implementations -gs and -gs, with two baseline solutions BFS and. BFS is a bute-foce seach that uses Dijksta s algoithm to identify the neaest k keywod nodes, and (Patitioned Multi-Indexing) [] is the state-of-the-at appoximate algoithm based on distance oacle [0]. Fo all the distance oacles involved we set the paamete = log V. We implemented all methods in GNU C++, and conducted all expeiments on a Windows machine with an Intel Xeon.7GHz CPU and 8GB memoy. All methods un in main memoy. A GB memoy limit is set fo index size. Datasets and Queies. We use two eal gaphs, DBLP, and Floida oad netwok FLARN, with statistics listed in Table. DBLP includes, 060, 76 aticles, 6, 89 authos and, 7 confeences/jounals, all of which ae teated as nodes. Thee is an edge between nodes u and v, if u is an autho of aticle v, o u is an aticle published in confeence/jounal v. The keywods of an autho node include fist name and last name, the keywods of an aticle node include title wods, edito, yea, publishe, isbn, etc., and the keywods of a confeence/jounal node include association and name. A weight (log deg(u) + log deg(v)) is assigned to http://www.infomatik.uni-tie.de/ ley/db http://www.dis.unioma.it/challenge9/download.shtml 90
Hit ate 0.8 0.6 0. 0. 8 6 68 (a) Hit ate, DBLP (b) Speaman s ho, DBLP Hit ate.0 0.9 0.8 8 6 6 8 Speaman s ho Speaman s ho.0 0.8 0.6 0. 0. Eo 8 6 68 0.8 0.6 0. 0. 0 8 6 68 (c) Eo, DBLP.0 0.9 0. 0. 0. 0.8 0. 0.7 0. 0.6 8 6 68 8 6 68 (d) Hit ate, FLARN(e) Speaman s ho,flarn (f) Eo, FLARN Figue : Hit ate, Speaman s ho and Eo by Vaying k edge (u, v), whee deg(u) denotes the degee of node u. Compaed with the unit edge weight setting, the numeical edge weights can effectively diffeentiate the weights of all edges in a gaph. Thus fo any k-nk quey, this helps poduce a anking of top-k answe nodes with less ties in thei distances as the gound tuth, which is impotant fo fai and unambiguous anking quality evaluation. In FLARN, a node epesents an intesection o endpoint, an edge denotes a oad segment, and the edge weight is the distance of the oad segment. We obtained the keywods of nodes fom the OpenSteetMap poject with a bounding box. Howeve, only 7, 7 nodes out of, 070, 76 have keywods. To addess the keywod spaseness issue and bette disciminate diffeent methods, we assign a andom numbe (between 0 and ) of keywods to the nodes with no keywod. Afte this step, thee ae still, 08 nodes without any keywod in FLARN. We emove stop wods in DBLP and FLARN. Fo each dataset, we geneate 00 k-nk queies in the fom of Q = (q,, k), whee q V is a andomly selected quey node, and is a keywod andomly selected by following the keywod fequency distibution in the document collection. We test k =,,..., 8. Evaluation Metics. We use six metics fo evaluation: hit ate, Speaman s ho [], eo, quey time, index time, and index size. Speaman s ho measues the ank coelation between an appoximate ank esult and the gound tuth. Hit ate and eo, defined as follows, measue the quality of an appoximate esult. Fo a quey Q = (q,, k), denote the exact esult as R = {u : d,..., u k : d k } in nondeceasing ode of thei distances, and d = d k as the uppe bound distance of the esult R. Denote an appoximate esult set as R = {u : d,..., u k : d k} in nondeceasing ode of thei distances. The hit ate is defined as: Eo hit(r ) = {i [, k] dist(u i, q) d} /k and the eo is the aveage elative eo of the estimated distances w..t. the gound tuth: e(r ) = X d i/d i /k i k Hit ate, Speaman s ho and Eo. Figues (a) (c) show the hit ate, Speaman s ho, and eo on DBLP espectively when we vay k. Ou method impoves the hit ate of by 96%, and impoves Speaman s ho by % on aveage. The eo of is within 0.066 fo all k values, demonstating that the distance estimated by is vey close to the exact distance. Notably, educes the eo of by an ode of magnitude, i.e., fom 0.60 to 0.06 on aveage. Futhemoe, the eo of does not incease with k, while that of inceases by 0% with the incease of k. Note when k =, Speaman s ho is constantly. Figues (d) (f) show the hit ate, Speaman s ho, and eo on FLARN espectively. impoves both the aveage hit ate and http://wiki.opensteetmap.og/wiki/main Page Quey Time(µs) Quey Time(µs) 0 6 0 0 0 0 600 00 00 00 00 BFS -gs -gs 8 6 6 8 Quey Time(µs) (a) DBLP (b) FLARN Figue : Quey Time in Micoseconds by Vaying k -gs 670 888 766 7 9998 (a) DBLP 0 0 0 0 Quey Time(µs) BFS -gs -gs 8 6 6 8 0 00 0 00 0 00 0 -gs 98 0 7 8 (b) FLARN Figue : Quey Time of Vaying Keywod Fequency Speaman s ho of by %. The eo of is below 0.68 fo all k values and is times smalle than that of on aveage. Note that the pefomance of BFS is omitted in Figue, as it etuns the exact esult. Futhemoe, the esult quality between using and not using global stoage does not diffe substantially, fo the sake of claity, the global stoage methods -gs and -gs ae also omitted in Figue. But we do obseve that global stoage technique impoves the hit ate of / by.% on DBLP and 0.7% on FLARN, and educes the eo by 6.7% on DBLP and 6.9% on FLARN on aveage. Given the memoy limit of GB fo index size, can only suppot k on DBLP and k 8 on FLARN in Figue as its index size inceases linealy with k. Quey Time. Figue shows the quey time of diffeent methods in log scale when we vay k. The quey time of BFS is 0 0 6 micoseconds, which is two to thee odes of magnitude slowe than the othe methods. Figue (a) shows the quey time on DBLP. The quey time of all methods inceases with the incease of k. is the most efficient. The quey time of, -gs, and -gs is less than times that of, which is quite close. Global stoage educes the quey time of by % and that of by %. Remakably, each of ou poposed appoaches can epot a esult within millisecond fo all k values. Figue (b) shows the quey time on FLARN. We can obseve that is the fastest, closely followed by and -gs, whose quey time is less than two times that of and one thid that of fo all k values. and -gs take a little longe as thei quey time depends on the tee depth which is lage on FLARN. But thei quey time is within milliseconds fo k = 8, which is still quite efficient. Global stoage helps educe the quey time of by 0% and that of by %. Figue futhe plots the quey time of and -gs on the 00 k-nk queies in ascending ode of the quey keywod fequency in the gaph. We set k = in this expeiment. Fo illustation, we also label a few quey keywod fequencies on the x axis. The quey time shows a shape inceasing tend on DBLP than FLARN, as the fequency diffeence between DBLP keywods is lage. These empiical esults ae consistent with the theoetical esult, i.e., the quey time complexity of depends on log V, whee V is the fequency of keywod. Index Time and Index Size. Figue shows the total index time (IT) and index size (IS) fo indexing all keywods by diffeent methods. We obseve that the index time of is.6 times that of on DBLP, and 8. times on FLARN. The index constuction time of is longe on FLARN than on DBLP. This is because the complexity of gows linealy with the tee depth, 9
Index Time(s) 0 0 0 9 8 076 -gs 0 7 0 9 88 07 6.8 6..9 7. -gs 6.7.6 9.9 9. 8..9 IT(DBLP) IT(FLARN) IS(DBLP) IS(FLARN) 6 6 8 Figue : Index Time and Index Size and the lage diamete of FLARN leads to a lage tee depth. All methods can finish the index constuction fo all keywods in a gaph within. hous. Given the memoy limit of GB fo index size, can only suppot k on DBLP and k 8 on FLARN, as its index size inceases linealy with k. In contast, /-gs have no such limitation. The index size of is. times that of on DBLP and 7.9 times on FLARN, due to the lage diamete of FLARN. By keeping a global candidate list and emoving duplicate index items, global stoage educes the index size of by 6% on DBLP and % on FLARN. It also educes the index size of by % on DBLP and % on FLARN. Remakably, the index size of -gs is 6.7GB on DBLP, which is even smalle than that of (6.8GB). This esult poves the supeioity of global stoage. 9. RELATED WORK The most elated wok to ou study include neaest keywod seach on XML documents [] and top-k neaest keywod seach on gaphs [], both of which have been intoduced in details in Section. In the sequel, we eview the existing wok on othe topics elated to ou study. Keywod seach in a gaph finds a substuctue of the gaph containing the quey keywods. The answe substuctue can be a tee [,,, 8, 0, 9], a subgaph [6, 7] o a -clique []. A suvey on keywod seach in databases and gaphs can be found in []. Keywod seach has substantial diffeences fom the k-nk quey studied in this pape. In tems of poblem definition, keywod seach looks fo a netwok stuctue, the nodes in which jointly contain all the quey keywods, wheeas a k-nk quey looks fo k neaest answe nodes, each one of which contains all the quey keywods. In tems of solution, keywod seach pefoms BFS o Dijksta s algoithm to find the answe netwoks, wheeas ou poposed solutions build an index stuctue based on distance oacles and compact tees fo keywods. Theefoe, ou quey time efficiency is much highe than BFS and Dijksta s algoithm, which has also been confimed in ou expeiments. [] and [] study keywod outing on a oad netwok. Given a keywod set, a souce and a taget locations, the goal is to find the shotest path that passes though at least one matching object fo each keywod. Distance oacle is an appoximate distance estimation technique. [] is a seminal wok on distance oacle that estimates distance with k stetch using an O( V + k ) sized index. Hemelin et al. [] adapt the distance oacle [] to answe -NK queies with k stetch in O(k) time using an O(k V + k ) sized index. Ou methods build on the distance oacle by Das Sama et al. [0]. K neaest neighbo (k-nn) seach has been extensively studied in spatial netwoks [,, 6, 8, 9, 7]. [] uses netwok Voonoi polygons to divide a gaph into disjointed subsets fo k- NN seach. [, 6] use R-tee to embed textual infomation on nodes, and augment a tee node with inveted index fo spatial document within the MBR. [8] answes k-nn queies with a shotest path quadtee. [9] answes k-nn queies based on ε-appoximated distance estimated by an index temed path-distance oacle. [7] pefoms Dijksta-like expansion fom the quey node. Howeve the Index Size(GB) above appoaches designed fo spatial netwoks cannot apply to gaphs without coodinates. 0. CONCLUSIONS In this pape, we study top-k neaest keywod (k-nk) seach on lage gaphs. We popose two exact k-nk algoithms on tees to handle a bounded k and an abitay k espectively. We extend tee based algoithms to gaphs and popose a global stoage technique to futhe educe the index size and quey time. We conducted extensive pefomance studies on eal lage gaphs to demonstate the effectiveness and efficiency of ou algoithms. Acknowledgments This wok is suppoted by the Hong Kong Reseach Gants Council (RGC) Geneal Reseach Fund (GRF) Poject No. CUHK, 0, 8, and the Chinese Univesity of Hong Kong Diect Gant No. 00.. REFERENCES [] B. Bahmani and A. Goel. Patitioned multi-indexing: Binging ode to social seach. In WWW, pages 99 08, 0. [] M. A. Bende and M. Faach-colton. The lca poblem evisited. In In Latin Ameican Theoetical Infomatics, pages 88 9. Spinge, 000. [] G. Bhalotia, A. Hulgei, C. Nakhe, S. Chakabati, and S. Sudashan. Keywod seaching and bowsing in databases using banks. In ICDE, pages 0, 00. [] X. Cao, L. Chen, G. Cong, and X. Xiao. Keywod-awae optimal oute seach. PVLDB, ():6 7, 0. [] Y.-Y. Chen, T. Suel, and A. Makowetz. Efficient quey pocessing in geogaphic web seach engines. In SIGMOD, pages 77 88, 006. [6] M. Chistofoaki, J. He, C. Dimopoulos, A. Makowetz, and T. Suel. Text vs. space: Efficient geo-seach quey pocessing. In CIKM, pages, 0. [7] K. Deng, X. Zhou, H. T. Shen, S. W. Sadiq, and X. Li. Instance optimal quey pocessing in spatial netwoks. VLDB J., 8():67 69, 009. [8] B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin. Finding top-k min-cost connected tees in databases. In ICDE, pages 86 8, 007. [9] K. Golenbeg, B. Kimelfeld, and Y. Sagiv. Keywod poximity seach in complex data gaphs. In SIGMOD, pages 97 90, 008. [0] H. He, H. Wang, J. Yang, and P. S. Yu. Blinks: Ranked keywod seaches on gaphs. In SIGMOD, pages 0 6, 007. [] D. Hemelin, A. Levy, O. Weimann, and R. Yuste. Distance oacles fo vetex-labeled gaphs. In ICALP (), pages 90 0, 0. [] V. Histidis and Y. Papakonstantinou. Discove: Keywod seach in elational databases. In VLDB, pages 670 68, 00. [] V. Kacholia, S. Pandit, S. Chakabati, S. Sudashan, R. Desai, and H. Kaambelka. Bidiectional expansion fo keywod seach on gaph databases. In VLDB, pages 0 6, 00. [] M. Kaga and A. An. Keywod seach in gaphs: Finding -cliques. PVLDB, (0):68 69, 0. [] M. R. Kolahdouzan and C. Shahabi. Voonoi-based k neaest neighbo seach fo spatial netwok databases. In VLDB, pages 80 8, 00. [6] G. Li, B. C. Ooi, J. Feng, J. Wang, and L. Zhou. Ease: Efficient and adaptive keywod seach on unstuctued, semi-stuctued and stuctued data. In SIGMOD, pages 90 9, 008. [7] L. Qin, J. X. Yu, L. Chang, and Y. Tao. Queying communities in elational databases. In ICDE, pages 7 7, 009. [8] H. Samet, J. Sankaanaayanan, and H. Albozi. Scalable netwok distance bowsing in spatial databases. In SIGMOD, pages, 008. [9] J. Sankaanaayanan and H. Samet. Quey pocessing using distance oacles fo spatial netwoks. IEEE Tans. Knowl. Data Eng., (8):8 7, 00. [0] A. D. Sama, S. Gollapudi, M. Najok, and R. Panigahy. A sketch-based distance oacle fo web-scale gaphs. In WSDM, pages 0 0, 00. [] C. Speaman. The poof and measuement of association between two things. Ame. J. Psychol., ():7 0, 90. [] Y. Tao, S. Papadopoulos, C. Sheng, and K. Stefanidis. Neaest keywod seach in xml documents. In SIGMOD, pages 89 600, 0. [] M. Thoup and U. Zwick. Appoximate distance oacles. In STOC, pages 8 9, 00. [] B. Yao, M. Tang, and F. Li. Multi-appoximate-keywod outing in gis data. In GIS, pages 0 0, 0. [] J. X. Yu, L. Qin, and L. Chang. Keywod seach in elational databases: A suvey. IEEE Data Eng. Bull., ():67 78, 00. 9