Supporting Efficient Top-k Queries in Type-Ahead Search

Transcription

1 Suppoting Efficient Top-k Queies in Type-Ahead Seach Guoliang Li Jiannan Wang Chen Li Jianhua Feng Depatment of Compute Science, Tsinghua National Laboatoy fo Infomation Science and Technology (TNList), Tsinghua Univesity, Beijing 84, China. Depatment of Compute Science, UC Ivine, CA , USA ABSTRACT Type-ahead seach can on-the-fly find answes as a use types in a keywod quey. A main challenge in this seach paadigm is the high-efficiency equiement that queies must be answeed within milliseconds. In this pape we study how to answe top-k queies in this paadigm, i.e., as a use types in a quey lette by lette, we want to efficiently find the k best answes. Instead of inventing completely new algoithms fom scatch, we study challenges when adopting existing top-k algoithms in the liteatue that heavily ely on two basic list-access methods: andom access and soted access. We pesent two algoithms to suppot andom access efficiently. We develop novel techniques to suppot efficient soted access using list puning and mateialization. We extend ou techniques to suppot fuzzy type-ahead seach which allows mino eos between quey keywods and answes. We epot ou expeimental esults on seveal eal lage data sets to show that the poposed techniques can answe top-k queies efficiently in type-ahead seach. Categoies and Subject Desciptos H.3.3 [Infomation Seach and Retieval]: models Geneal Tems Algoithms, Expeimentation, Pefomance Keywods Type-ahead seach, top-k seach, fuzzy seach Retieval. INTRODUCTION To give instant feedback when uses fomulate seach queies, many infomation systems suppot autocomplete seach, which shows esults immediately afte a use types in a patial keywod quey. As an example, almost all the majo seach engines nowadays automatically suggest possible keywod queies as a use types in patial keywods. Most autocomplete systems teat a quey with multiple keywods as asingle sting, and find answes with text that matches the sting Pemission to make digital o had copies of all o pat of this wok fo pesonal o classoom use is ganted without fee povided that copies ae not made o distibuted fo pofit o commecial advantage and that copies bea this notice and the full citation on the fist page. To copy othewise, to epublish, to post on seves o to edistibute to lists, equies pio specific pemission and/o a fee. SIGIR 2, August 2 6, 22, Potland, Oegon, USA. Copyight 22 ACM /2/8...$5.. exactly. To ovecome this limitation, a new type-ahead seach paadigm has emeged ecently [2, 3]. Using this paadigm, a system teats a quey as a set of keywods, and does a full-text seach on the undelying data to find answes including the keywods. We teat the last keywod in the quey as a patial keywod the use is completing. Fo instance, a quey gaph sig on a publication table can find publication ecods with the keywod gaph andakeywod that has sig as a pefix, such as sigi, sigmod, and signatue. In this way, a use can get instant feedback afte typing keywods, thus can obtain moe knowledge about the undelying data to fomulate a quey moe easily. Ji et al. [3] extended type-ahead seach by allowing mino eos between queies and answes. As a use types in quey keywods, the system can find elevant ecods with keywods simila to the quey keywods. This featue is especially impotant when the use has limited knowledge about the exact epesentation of entities she is looking fo. Fo instance, if a use types in a patial quey chitos falut, the system can find ecods appoximately matching the two keywods despite the typo in the quey, such as a ecod with keywods Chistos Faloutsos. Clealy these featues can futhe impove use seach expeiences. In this pape we study how to answe anking queies in type-ahead seach on lage amounts of data. That is, as a use types in a keywod quey lette by lette, we want to on-the-fly find the most elevant (o top-k ) ecods. One appoach fist finds ecods matching those quey keywods, and then computes thei anking scoes to find the most elevant ones. This appoach is not efficient when thee ae a lage numbe of candidate answes to compute and stoe. Existing type-ahead seach appoaches assume an index stuctue with a tie fo the keywods in the undelying data, and each leaf node has an inveted list of ecods with this keywod, with the weight of this keywod in the ecod [3, 9]. As an example, Table shows a sample collection of publication ecods. Fo simplicity, we only list some of the keywods fo each ecod. Figue shows the coesponding index stuctue. (Moe details about the index ae in Section 3.) Suppose a use types in a quey gaph icdm li. Fo exact seach, we find ecods containing the fist two keywods and a wod with pefix of li, e.g., ecod 5. Fo fuzzy seach, we compute ecods with keywods simila to quey keywods, and ank them to find the best answes. Fo each complete keywod, we find keywods simila to the quey keywod. Fo instance, both keywods icdm and icdl ae simila to the second quey keywod. The last keywod 355

2 Table : Publication ecods with sample keywods. Recod ID Recod gaph icdm... gaph goup lui... 2 gay icdl liu... 3 gaph icdl lin lui... 4 gaph goup icdm lin liu... 5 gaph gay goss icdm lin liu... 6 gay goup icdm lin liu... 7 gay goss goup icdl lin... 8 goss icdl liu... 9 icdm liu... li is teated as a pefix condition, since the use is still typing at the end of this keywod. We find keywods that have a pefix simila to li, such as lin, liu, and lui. We access the inveted lists of these simila keywods to find ecods and ank them to find the best answes fo the use. A key question is: how to access inveted lists on tie leaf nodes efficiently to answe top-k queies? Instead of inventing completely new algoithms fom scatch, we study how to adopt a plethoa of algoithms in the liteatue fo answeing top-k queies by accessing lists (e.g., [2, 2]). These algoithms shae the same famewok poposed by Fagin [6], in which we have lists of ecods soted based on vaious conditions. An aggegation function takes the scoes of a ecod fom these lists and computes the final scoe of the ecod. Thee ae two methods to access these lists: () Random Access: Given a ecod id, we can etieve the scoe of the ecod on each list; (2) Soted Access: We etieve the ecod ids on each list following the list ode. In this pape we study technical challenges when adopting these algoithms, and focus on new optimization oppotunities that aise in ou poblem. In paticula, we study how to suppot the two types of access opeations efficiently by utilizing chaacteistics specific to ou index stuctues and access methods. We make the following contibutions: ) In Section 3, we pesent a fowad-list-based method fo suppoting andom access on the inveted lists, and develop a heap-based method and list-mateialization techniques to suppot soted access efficiently. 2) In Section 4 we study fuzzy type-ahead seach. We popose a list-puning technique to impove the pefomance of soted access, and study how to impove the techniques based on fowad lists and list mateialization fo fuzzy seach. Due to the challenging natue of the poblem, ou extensions ae technically nontivial. 3) In Section 5 we pesent ou expeimental esults on eal lage data sets to show the efficiency of ou techniques. We have deployed seveal systems using this paadigm, which have been used egulaly and well accepted by uses due to its fiendly inteface and high efficiency. 2. FORMULATION AND PRELIMINARIES Type-Ahead Seach: LetR be a collection of ecods such as the tuples in a elational table. Let D be the set of wods in R. Let Q be a quey the use has typed in, which is a sequence of keywods w,w 2,...,w m. We teat the last keywod w m as a patial keywod the use is completing, and othe keywods as complete keywods the use has completed 2. As a use types in a keywod quey lette by lette, type-ahead seach on-the-fly finds ecods that contain the fist m keywods and a wod with the last keywod as a pefix. 2 Ou method can be easily extended to the case that evey keywod is taken as a patial keywod. a p y h 5,9 4,7 3,4,3,2 g i c i o d n s u l m s p 3,9 8,9 7,4,9 4,9 7,8 7,8 8,2 7,8 5,8 6,4 5,4 3,2 6,4 6,5 5,3 2,2 4,3 9,4 4,2,3 5,8 4,7 2,9 6,5 5,8 9,4 6,4 2,3 7,3 8, Figue : Tie index stuctue. u l u i,6 3,4 Without loss of geneality, each sting in the data set and a quey is assumed to use lowe-case lettes. Fo example, in Table, R = {,,..., 9}, D = {gaph, icdm, goup, lui,...}. Suppose a use types in a quey icdm ga. We teat icdm as a complete keywod and ga as a patial keywod. Recods, 4, 5, and 6 ae potentially elevant answes. Fo example, contains complete keywod icdm andwod gaph witha pefixof ga. When the use types in moe lettes and submits quey icdm gaph li, we teat icdm and gaph as complete keywods and li as a patial keywod. Recods 4 and 5 ae potentially elevant answes. Top-k Answes: We ank each ecod in R based on its elevance to the quey. Given a positive intege k, ou goal is to compute the best k ecods in R anked by thei elevance to Q. Notice that ou poblem setting allows an impotant ecod to be in the answe, even if not all quey keywods appea in the ecod (the OR semantics). Thus the algoithms in [3] cannot be used diectly in ou poblem. Ranking: In the liteatue thee ae many algoithms fo answeing top-k queies by accessing lists (e.g., [2, 2]). These algoithms shae the same famewok poposed by Fagin [6], in which we have lists of ecods soted based on vaious conditions, such as tem fequency and invese document fequency ( tf*idf ). Each ecod has a scoe on a list, and we use an aggegation function to combine the scoes of the ecod on diffeent lists to compute its oveall elevance to the quey. The aggegation function needs to be monotonic, i.e., deceasing the scoe of a ecod on a list cannot incease the ecod s oveall scoe. This appoach has the advantage of allowing a geneal class of anking functions. In this pape, we focus on an impotant class of anking functions with the following popety: the scoe F (, Q) of a ecod to a quey Q is a monotonic combination of scoes of the quey keywods with espect to the ecod. Fomally, we compute the scoe F (, Q) intwo steps. In the fist step, fo each keywod w, wecomputea scoeofthekeywodwithespecttotheecod, denoted by F (, w). In the second step, we compute the scoe F (, Q) by applying a monotonic function on the F (, w) s fo all the keywods w. The intuition of this popety is that the moe elevant an individual quey keywod is to a ecod, the moe likely this ecod is a good answe to this quey. Fo example, we compute the scoe of a ecod to quey icdm gaph li by aggegating the scoes of each of keywods with espect to the ecod. Each complete keywod w has a weight associated with aecod, denoted by W (, w). This weight could depend 356

3 Quey Keywods Patial keywod w w 2 w m Tie vitual list Inveted lists Figue 2: Type-ahead seach fo Q = w,w 2,...,w m. on the keywod, such as the tf*idf value of the keywod in the ecod. As a specific case, it can also be independent fom the keywod. Fo instance, if a ecod is a URL with tokenized keywods, its weight could be a ank scoe of the coesponding Web page. If a ecod is an autho, we can use the numbe of publications of the autho as a weight of this ecod. Fo the last patial keywod w m, thee could be multiple complete wods. We compute the elevance scoe of w m in the ecod, i.e., F (, w m), based on the following popety: F (, w m) is the maximal value of the W (, d) weights fo all the keywods d with espect to w m in, whee d is a keywod in ecod and has a pefix of w m. This popety states that we only look at the most elevant keywod in a ecod to the patial keywod when computing the elevance of the keywod to the ecod. It means that the anking function is geedy to find the most elevant keywod in the ecod as an indicato of how impotant this ecod is to the patial keywod. As we can see in Section 3, this popety allows us to do effective puning when accessing the multiple lists of a quey keywod. The following is an example function. m F (, Q) = F (, w i), () whee i= { W (, w F (, w i) if i<m, i)= max complete wod d of wm {W (, d)} if i = m. (2) In Figue, conside quey icdm gaph li and ecod 5. F ( 5, icdm ) = W ( 5, icdm )=8andF ( 5, gaph ) = W ( 5, gaph ) = 9. The patial keywod li hastwo complete wods lin and liu. F ( 5, li ) = max{w ( 5, lin ), W ( 5, liu )}=8. F ( 5, icdm gaph li ) = EXACT TYPE-AHEAD SEARCH In this section, we study efficient list-access methods to suppot exact type-ahead seach, i.e., no mismatches between quey keywods and answes. Indexing: We constuct a tie fo the data keywods in the data D. A tie node has a chaacte label. Each keywod in D coesponds to a unique path fom the oot to a leaf node 3 on the tie. Fo simplicity, a tie node is mentioned intechangeably with the keywod coesponding to the path fom the oot to the node. A leaf node has an inveted list of IDs of pais id, weight, wheeid is the ID of a ecod containing the leaf-node sting, and weight is the weight of the keywod in the ecod. Figue shows the index stuctue in ou unning example. Fo instance, fo the leaf node of keywod gaph, its inveted list has five elements. 3 A common tick to make each leaf node coesponds to a complete wod and vice vesa is to add a special mak to the end of each wod. Fo simplicity we did not use this tick. Fowad index [,4] [5,6] [7,9] Recod Fowad list i l,2 ;6,3 [5,6] [7,8] [9,9] [,2] c i u,3 ;4,9 ;9,6 [3,4] [5,6] 2 2,9 ;5,2 ;8,3 [3,3] [4,4] d n u i 3,4 ;5,2 ;7,9;9,4 l m ,7 ;4,3;6,9;7,2;8, ,9 ;2,8;3,4;6,8;7,3;8, Figue 3: Fowad lists. g [,4] a o [,] p y s u 2 h s p The fist element 5, 9 indicates that the ecod 5 has this keywod, and the weight of this keywod in this ecod is 9, i.e., W ( 5, gaph ) = 9. Seaching: We compute the top-k answes to a quey Q in two steps. As illustated in Figue 2, in the fist step, fo each complete keywod w i( i m ), we get its inveted list. Fo the last patial keywod, we locate the tie node of w m and etieve the inveted lists of the tie node s leaf descendants. Fo example, in Figue, conside a quey icdm li. The patial keywod li has two leaf-node keywods: lin and liu. In the second step, we access the inveted lists to compute the k best answes. Many algoithms have been poposed fo answeing top-k queies by accessing soted lists [2, 6]. When adopting these algoithms to solve ou poblem, we need to efficiently suppot two basic types of access used in these algoithms: andom access and soted access on the lists. 3. Efficient Random Access To suppot andom access, we constuct a fowad index in which each ecod has a fowad list of IDs of its keywods. We assume each keywod has a unique ID with espect to its leaf node on the tie, and the IDs of the keywods follow thei alphabetical ode. Figue 3 shows the fowad lists. The element, 9 onthefowadlistofecod 5 shows that this ecod has a keywod with ID and weight 9, which is keywod gaph as shown on the tie. Given a ecod and a complete keywod, we can get the coesponding weight by doing a binay-seach on the fowad list. Fo example, to get the weight of keywod icdm with ID 6 in 5, we can do a binay seach on 5 s fowad list and get the coesponding weight 8. Fo the patial keywod, as it has multiple complete wods, we need fist locate its tie node and then enumeate its leaf-descendants to get the coesponding weights. This method could be expensive if the tie node has many leaf-descendants. To impove the pefomance, we can use an altenative method. Fo each tie node n, we can maintain a keywod ange [l n,u n], whee l n and u n ae the minimal and maximal keywod IDs of its leaf nodes, espectively [3]. An inteesting obsevation is that a complete wod with n as a pefix must have an ID in this keywod ange, and each complete wod in the data set with an ID in this ange must have a pefix of n. In Figue 3, the keywod ange of node g is [, 4], since is the smallest ID of its leaf nodes and 4 is the lagest one. Based on this obsevation, this method veifies whethe ecod contains a keywod with a pefix of w m as follows. We fist locate the tie node w m and then check if thee is a keywod ID on the fowad list of in the keywod ange [l wm,u wm ]. Since we can keep the fowad list of soted, this checking can be done efficiently. Fo instance, conside quey gaph icdm l. Fo the fist element on the inveted list of gaph, 5, 9, we can check whethe 357

4 Vitual soted list 3,9 5,8 7,8 4,7,6 6,5 9,4 2,3 8, Patial keywod l 3,9 3,9 3,9 5,8 3,9 5,8,6 7,8 4,7 3,4 6,4 6,5 lui 5,3 9,4 4,2 2,3 8, lin liu Figue 4: A heap-based method to compute the vitual soted list of patial keywod l.,6 5 contains othe two keywods as follows. Fo complete U(v) fo node v with espect to patial keywod w m. keywod icdm withid6,wedoabinayseachon 5 s an answe, i.e., 27. We get the next elements of gaph and fowad list and get weight 8. Fo patial keywod l with icdm, 4, 7 and 5, 8. We incement the cuso of the keywod ange [7, 9], using a binay seach on 5 s fowad list (, 9 ; 2, 8 ; 3, 4 ; 6, 8 ; 7, 3 ; 8, 8 ) list that poduces the top element, push it into the heap, and, we find keywod etieve the next top element: 5, 8. Based on the accessed IDs 7 and 8 in this ange. Thus we know that the ecod elements, we have ) The scoe of ecod 5 is = 25; indeed contains keywods with pefix l, and compute the coesponding scoe F ( 5, l ) = max { 2) The maximal scoe of ecod 3 is = 24, and F ( 5, lin ),F( 5, liu ), F ( 5, lui ) } that of 4 is = 24, while those of othe ecods ae =8.ThusF ( 5, gaph icdm l ) = 25. at most = 23. Thus, ecod 5 is the best answe. 3.2 Efficient Soted Access To suppot soted access, we can keep the elements on the inveted lists soted based on thei weights in a descending ode. Thus, fo the complete keywod, we can get an odeed list. Fo the patial keywod w m, it has multiple leaf descendants and coesponding inveted lists. We use U(w m)todenote the union of those inveted lists, called union list of w m.we need to suppot soted access on U(w m) to etieve the next most elevant ecod ID fo w m. Fully computing U(w m) using the keywod lists could be expensive in tems of time and space. In this section, we popose two techniques to suppot soted access efficiently Heap-Based Method We can suppot soted access on U(w m) by building a max heap on the inveted lists of its leaf nodes. In paticula, we maintain a cuso on each inveted list. The max heap initially consists of the ecod IDs pointed by the cusos so fa, soted on the weights of the keywods in these ecods. Notice that each inveted list is aleady soted based on the weights of its keywod in the ecods. To etieve the next best ecod, we pop the top element fom the heap, incement the cuso of the list of the popped element by, and push the new element of this list to the heap. When popping all elements fom the heap, we can get a soted list fo the patial keywod. Fo example, conside the patial keywod l. It has thee complete keywods lin, liu, and lui. We can compute its union list as shown in Figue 4. Note that since ou method does not need to compute the entie list of U(w m), U(w m)isavitual soted list of patial keywod w m. On top of the inveted lists of complete keywods and the max heap of the patial keywod, we can adopt an existing top-k algoithm to find the k best ecods. As an example, suppose we want to compute the top- best answe fo quey gaph icdm l using soted access only. We get the fist elements of gaph and icdm, 5, 9 and 4, 9, pop the top element of the max heap in Figue 4, 3, 9, and compute an uppe bound on the oveall scoe of Legend: Figue 5: M(v): Mateialized descendants of v v Max heap of w m T(v): subtie of v N(v): othe leaf nodes (of v) without mateialized ancestos Benefits of mateializing the union list List Mateialization We can futhe impove the pefomance of soted access fo the patial keywod w m by pecomputing and stoing the unions of some of the inveted lists on the tie. Let v be a tie node, and U(v) be the union of the inveted lists of v s leaf nodes, soted by thei ecod weights. If a ecod appeas moe than once on these lists, we choose its maximal weight as its weight on list U(v). Fo example, U( li ) = { 3, 9, 5, 8, 7, 8 ; 4, 7, 6, 5, 9, 4, 2, 3, 8, }. When using a max heap to etieve ecods soted by thei scoes fo the patial keywod, this mateialized list could help us build a max heap with fewe lists and educe the cost of push/pop opeations on the heap. Theefoe, this method allows us to utilize additional memoy space to answe top-k queies moe efficiently. Fo instance, conside the index in Figue and a quey icdm g. Fo the patial keywod g, we access its data keywods gaph, gay, goss, and goup, and build a max heap on thei inveted lists based on ecod scoes with espect to this quey keywod. If we mateialize the union lists of ga and go, we can use thei mateialized lists, saving the time to tavese the fou leaf nodes and some push/pop opeations on the heap. We next give a detailed cost-based analysis to quantify the benefit of mateializing a node on the pefomance of opeations on the max heap of w m, fo exact type-ahead seach. Let B be a budget of stoage space we ae given to mateialize union lists. Given a tie node v, letu(v) bethe union of inveted lists of leaf nodes in the subtie of v. Ou goal is to select tie nodes to mateialize thei union lists fo maximizing the pefomance of queies. The following ae naive algoithms fo choosing tie nodes: Random: We andomly select tie nodes. TopDown: We select nodes top down fom the tie oot. BottomUp: We select nodes bottom up fom leaf nodes. Each naive appoach keeps choosing tie nodes to mateialize thei union lists until the sum of thei list sizes eaches the space limit B. One main limitation of these appoaches is that they do not quantitatively conside the benefits of 358

5 mateializing a union list. To ovecome this limitation, we popose a cost-based method called CostBased to do list mateialization. Its main idea is the following. Fo simplicity we say a node has been mateialized if its union list has been mateialized. Fo a quey Q with a pefix keywod w m, suppose some of the tie nodes have thei union lists mateialized. Let v be such a mateialized node. If we can use U(v) to constuct the heap of w m,we need not visit v s descendants and access the inveted lists of v s leaf descendants, and thus achieve the benefit of educing the time of tavesing the subtie ooted at v and push/pop opeations on the max heap of w m. We say the mateialized node v is usable fo patial keywod w m. Next we discuss how to check whethe a node v is usable fo patial keywod w m. If v is not a descendant of w m, mateializing v is unusable to w m; othewise, if no node on the path fom v to w m (including w m) has been mateialized, mateializing v is usable to w m. Notice that if v has a mateialized ancesto v on the path fom v to w m, then we can use the mateialized list U(v ) instead of U(v), and the list U(v) will no longe be usable to w m. To summaize, a mateialized node v is usable fo patial keywod w m if,. v is a descendant of w m;and 2. v has no mateialized ancesto between v and w m. Fo example, conside a quey icdm g, mateializing node l is unusable fo patial keywod g as l is not a descendant of g. Mateializing g is usablefo g if g isnot mateialized. If g is mateialized, then mateializing ga is unusable fo g as we will use the mateialized list of g to build the max heap of g, instead of using ga. If v is usable fo w m, mateializing U(v) has the following benefits fo the heap of w m. () We do not need to tavese the tie to access these leaf nodes and use them to constuct the max heap; (2) Each push/pop opeation on the heap is moe efficient since it has fewe lists. Hee we pesent an analysis of the benefits of mateializing the usable node v. In geneal, fo a tie node v, lett (v) denote its subtie and T (v) denote the numbe of nodes in T (v). The total time of tavesing this subtie is O ( T (v) ). Now we analyze the benefit of mateializing node v. As illustated in Figue 5, suppose v has mateialized descendants. Let M(v) be the set of highest mateialized descendants of v. These mateialized nodes can help educe the time of accessing the inveted lists of v s leaf nodes in two ways. Fist, we do not need to tavese the descendants of a mateialized node d M(v). We can just tavese T (v) d M(v) T (d) tie nodes. Second, when inseting lists to the max heap of w m, we inset the union list of v into the heap and need not inset the union list of each d M(v) and the inveted lists of d N(v) into the heap, whee N(v) denotes the set of v s leaf descendants having no ancestos in M(v). Let S(v) =M(v) N(v). We quantify benefits of mateializing node v:. Reducing tavesal time: Since we do not tavese v s descendants, the time eduction is B = O ( T (v) d M(v) T (d) ). 2. Reducing heap-constuction time: When constucting the max heap fo keywod w m, we inset the union list U(v) into the heap, instead of the inveted lists of those nodes in S(v). The time eduction is B 2 = S(v). 3. Reducing soted-access time: If we inset the union list U(v) tothemaxheapofw m, the numbe of leaf nodes in the heap is S(w m). Othewise, it is S(w m) + S(v). The time eduction of a soted access is B ( 3=O log( S(w ) m) + S(v) ) O ( log( S(w ) m) ). The following is the oveall benefit of mateializing v fo the patial keywod w m: B v = B + B 2 + A v B 3, (3) whee A v is the numbe of soted accesses on U(v). A v can be computed using the numbe of ecods in the union list U(v), and the numbe of keywods in the quey. The analysis above is on a quey wokload. If thee is no quey wokload, we can use the tie stuctue to count the pobability of each node to be queied and use such infomation to compute the benefit of mateializing a node. In this pape, we employ a no quey wokload setting. 4. FUZZY TYPE-AHEAD SEARCH In this section, we fist define the poblem of top-k queies in fuzzy type-ahead seach [3]. We then develop new techniques to suppot efficient list access to answe such queies by extending techniques developed in exact seach. 4. Ranking As a use types in a quey lette by lette, fuzzy type-ahead seach on-the-fly finds ecods with wods simila to the quey keywods. Fo example, conside the data in Table. Suppose a use types in a quey gaph gose. We etun 5 as a elevant answe since it has a keywod goss simila to quey keywod gose. We use edit distance to measue the similaity between stings. Fomally, the edit distance between two stings s and s 2, denoted by ed(s, s 2), is the minimum numbe of single-chaacte edit opeations (i.e., insetion, deletion, and substitution) needed to tansfom s to s 2. Fo example, ed(goss, gose) =. Similaity Function: Let π be a function that computes the similaity between a data sting s and a quey keywod w in Q = w,w 2,...,w m. An example is: π(s, w) = ed(s, w), w whee w is the length of the quey keywod w. We nomalize the edit distance based on the quey-keywod length in ode to allow moe eos fo longe quey keywods. Ou esults in the pape focus on this function, and they can be genealized to othe functions using edit distance. Let d be a keywod in the data set D. Foeachcomplete keywod w i (i =, 2,...,m ) in the quey, we define the similaity of d to w i as: Sim(d, w i)=π(d, w i). Since the last keywod w m is teated as a pefix condition, we define the similaity of d to w m as the maximal similaity of d s pefixes using function π, i.e.: Sim(d, w m)= max {π(p, wm)}. pefix p of d Let τ be a similaity theshold. We say a keywod d in D is simila to a quey keywod w if Sim(d, w) τ. Wesaya pefix p of a keywod in D is simila to the quey keywod w m if π(p, w m) τ. We want to find the keywods in the data set that ae simila to quey keywods, since ecods with such a keywod could be of inteest to the use. 359

6 Quey Keywods Legend: w w 2 w m Tie Simila pefixes Inveted lists Patial keywod Simila complete wods Figue 6: Keywods simila to those in quey Q = w,w 2,...,w m. Each quey keywod w i has simila keywods on leaf nodes. The last pefix keywod w m has simila pefixes. Let Φ(w i)(i =,...,m)denotethesetofkeywodsin D simila to w i,andp (w m) denote the set of pefixes (of keywods in D) simila to w m. We compute the top-k answes to the quey Q in two steps. In the fist step, fo each keywod w i in the quey, we fist compute an edit-distance uppe bound based on the similaity function, i.e., ( τ) w i, and then compute the simila keywods Φ(w i)and simila pefixes P (w m) on the tie (shown in Figue 6). Ji et al. [3] developed an efficient algoithm fo incementally computing these simila stings as the use modifies the cuent quey. A simila algoithm is developed in [5]. In the second step, we access the inveted lists of these simila data keywods to compute the k best answes. Fo example, assume a use types in a quey gose li lette by lette on the data shown in Table. Suppose the similaity theshold τ is.45. The set of pefixes simila to the patial keywod li isp ( li ) = {l, li, lin, liu, lu, lui, i}, and the set of data keywods simila to the patial keywod li isφ( li ) = {lin, liu, lui, icdl, icdm}. In paticula, lui is simila to li since Sim(lui, li) = ed(lui,li) li =.5 τ. The set of simila wods fo the complete keywod gose is Φ( gose ) = {goss}. Then we compute top-k answes using the inveted lists of those wods in Φ( gose ) and Φ( li ). Ranking: We still assume the anking function has the fist popety descibed in Section 2, which computes the scoe F (, Q) by applying a monotonic function on the F (, w i) s fo all the keywods w i in the quey. Given a complete keywod w i and a ecod, fo exact seach, we can use the weight of w i in, i.e., W (, w i), to denote thei elevancy F (, w i). But fo fuzzy seach, the keywod w i can be simila to multiple keywods in the ecod, and diffeent simila wods have diffeent similaities to w i and diffeent weights in. A question is how to compute the elevance value of keywod w i in ecod, F (, w i). Let d be a keywod in ecod such that d is simila to the quey keywod w i, i.e., d Φ(w i). We use F (, w i,d)to denote the elevance of this quey keywod w i in the ecod with espect to keywod d. The value should depend on both the weight of d in, i.e., W (, d), as well as the similaity between w i and d, i.e., Sim(d, w i). Intuitively, the moe simila they ae, the moe elevant w i is to in tems of d. Fo instance, F (, w i,d)=sim(d, w i) W (, d) isanexample anking function to evaluate the elevancy of w i in the ecod with espect to keywod d. We use the following function with the second popety in Section 2 to compute F (, w i): F (, w i)= max {F (, w i,d)}. (4) keywod d (in ) simila to w i 4.2 Efficient Random Access We fist study how to suppot efficient andom access fo fuzzy type-ahead seach. Fo simplicity, in the discussion we focus on how to veify whethe the ecod has a keywod with a pefix simila to the patial keywod w m. With mino modifications the discussion extends to the case whee we want to veify whethe has a keywod simila to a complete keywod w i( i m ). In each andom access, given an ID of a ecod, wewant to etieve infomation elated to a quey keywod w i,which allows us to etieve W (, d) fo each of w i s simila wod d so as to compute the scoe F (, w i). In paticula, fo a keywod w i in the quey, does the ecod have a keywod simila to w i? One naive way to get the infomation is to etieve the oiginal ecod and go though its keywods. This appoach has two limitations. Fist, if the data is too lage to fit into memoy and has to eside on had disks, accessing the oiginal data fom the disks may slow down the pocess significantly. This costly opeation will pevent us fom achieving an inteactive-seach speed. The second limitation is that it may equie a lot of computation of sting similaities based on edit distance, which could be time consuming. In this section, we pesent two efficient appoaches fo solving this poblem. Method : Pobing on Fowad Lists: This method veifies whethe ecod contains a keywod with a pefix simila to w m as follows. Fo each pefix p on the tie simila to w m (computed in the fist step of the algoithm as discussed above), we check if thee is a keywod ID on the fowad list of in the keywod ange [l p,u p] of the tie node of p as discussed in Section 3. Method 2: Pobing on Tie Leaf Nodes: Using this method, fo each pefix p simila to w m, we tavese the subtie of p and identify its leaf nodes. Fo each leaf node d, we stoe the fact that fo the quey Q, thiskeywodd has a pefix simila to w m in the quey. Specifically, we stoe Quey ID, patial keywod w m, Sim(p, w m). We stoe the quey ID in ode to diffeentiate it fom othe queies in case multiple queies ae answeed concuently. We stoe the similaity between w m and p to compute the scoe of this keywod in a candidate ecod. In case the leaf node has seveal pefixes simila to w m, we only keep thei maximal similaity to w m. Fo each complete keywod w i, we also stoe the same infomation fo those tie nodes simila to w i. Theefoe, a leaf node might have multiple enties coesponding to diffeent keywods in the same quey. We call these enties fo the leaf node as its collection of elevant quey keywods. Notice that this stuctue needs vey little stoage space, since the enties of old queies can be quickly eused by new queies, and the numbe of keywods in a quey tends to be small. We use this additional infomation to efficiently check if a ecod contains a complete wod with a pefix simila to the patial keywod w m. We scan the fowad list of. Fo each of its keywod IDs, we locate the coesponding leaf node, and test whethe its collection of elevant quey keywods includes this quey and 36

7 p [,4] a [,4] y g [,2] [3,4] [,] s o [3,3] [4,4] u Fowad index [5,6] [7,9] Recod Fowad list l [5,6] [7,8] [9,9],2 ;6,3 i u [5,6],3 ;4,9 ;9,6 2,9 ;5,2 ;8,3 i c d n u,4 ;5,2 ;7,9;9,4 l m ,7 ;4,3;6,9;7,2;8,7 h s p q,lin,.66,9 ;2,8;3,4;6,8;7,3;8,8 q,gose,.8 q,lin, q 2, liu,... q 2,goss, q 2, liu,.66 Figue 7: Pobing on tie leaf nodes. i the keywod w m. If so, we use the stoed sting similaity to compute the scoe of this keywod in the quey. Figue 7 shows how we use this method in ou unning example, whee the use types in a keywod quey q = lin, gose. When computing the simila wods of gose, i.e., goss, we inset the quey ID (shown as q ), the patial keywod gose, and the coesponding pefix similaity to its collection of elevant quey keywods. To veify whethe ecod 5 has a wod with a pefix simila to gose, we scan its fowad list. Its thid keywod is goss. We access its coesponding leaf node, and see that the node s collection of elevant quey keywods includes gose. Thus we know that 5 indeed contains a keywod simila to gose, and can etieve the coesponding pefix similaity. Compaison: The time complexity of the fowad-list based method (Method ) is O ( G log( ) ), whee G is the total numbe of simila pefixes of w m and simila complete wods of w i s fo i m, and is the numbe of distinct keywods in ecod. Since the simila pefixes of w m could have ancesto-descendant elationships, we can optimize the step of accessing them by consideing the highest ones. The time complexity of the second method is O( T (p) + Q ). smila pefix p of w m The fist tem coesponds to the time of tavesing the subties of simila pefixes, whee T (p) is the subtie ooted at a simila pefix p. The second tem coesponds to the time of pobing the leaf nodes, whee Q is the numbe of quey keywods. Notice that to identify the answes, we need access the inveted lists of complete wods, thus the fist tem can be emoved fom the complexity. Method is pefeed fo data sets whee ecods have a lot of keywods such as long documents, while Method 2 is pefeed fo data sets whee ecods have a small numbe of keywods such as elational tables with elatively shot attibute values. 4.3 Efficient Soted Access Heap-Based Method: Fo a quey keywod w, wewant to suppot soted access that can access ecod IDs based on the elevance of w to these ecods. As w has multiple simila wods, we can suppot soted access efficiently by building a max heap on the inveted lists of such simila wods, as descibed in Section 3. Notice that, in exact seach, each leaf node has the same similaity to w; but fo fuzzy seach, diffeent leaf nodes could have diffeent similaities. Thus, when pushing a ecod fom an inveted list of a simila wod d to the heap, we maintain, F (, d) in the heap. We push/pop the ecod on the heap with the maximal F (, d). Conside the quey icdm li. Figue 8 shows the two heaps fo the two keywods. Fo illustation puposes, fo icdm li 4,9 3,9 3,9 4,9 5,8 5,8 3,9 4,4.5 6,5 7,8 7,3 4,9 9,4 3,9 5,8,3 *3/4 4,4.5 * 4,7 7,3 6,5 * * 7,4 4,9 4,4.5 7,3,3 9,4 3, 9 5,8 */2 8,2 5,8 4,7 */2 */2 8,.5,3 7,8 3,2 6,5 6,5,6 4,9 7,6 3,.5 2,3 6,4 2,2 9,4 2,.5,3,.5 9,4 3,4 5,8 8,5 5,3 6,5 2,3 3,4 8, 4,2 9,4 8, 2,3 icdl icdm lin liu,3 lui icdm icdl Figue 8: Max heaps fo the quey keywods icdm and li. Each shaded list is meged fom the undelying lists. It is vitual since we do not need to compute the entie list. each keywod we also show the vitual meged list of ecods with thei scoes, and this list is only patially computed duing the tavesal of the undelying lists. Each ecod on a heap has an associated scoe of this keywod with espect to the quey keywod, computed using Equation 4. List Puning: As thee may be a lage numbe of simila wods fo a quey keywod, especially fo the patial keywod, it could be expensive to constuct a heap on the fly. We futhe impove the pefomance of soted access on the vitual soted list U(w) by using the idea of on-demand heap constuction, i.e., we want to avoid constucting a heap fo all the inveted lists of keywods simila to a quey keywod. Suppose w has t simila wods. Each push/pop opeation on the heap of these lists takes O(log(t)) time. If we can educe the numbe of lists on the heap, we can educe the cost of its push/pop opeations. We have two obsevations about this puning method. () As a special case, if those keywods matching quey keywods exactly have the highest elevance scoes, this method allows us to conside these ecods pio to consideing othe ecods with mismatching keywods. (2) The puning can be moe poweful if w is the last patial keywod w m,sincemanyof its simila keywods shae the same pefix p on the tie. Conside quey icdm li, Figue 8 illustates how we can pune low-scoe lists and do on-demand heap constuctions. The pefix li has seveal simila keywods. Among them, the two wods lin and liu have the highest similaity value to the quey keywod, mainly because they have a pefix matching the keywod exactly. We build a heap using these two lists. To compute the top- best answe, the lists of lui, icdm, and icdl ae neve included in the heap since thei uppe bounds ae always smalle than the scoes of popped ecods befoe the tavesal teminates. We next intoduce how to do list puning fo the max-heap based methods in fuzzy type-ahead seach. Given a keywod w, letd,...,d t be its simila wods and L,...,L t be the coesponding inveted lists, espectively. We need not use all the inveted lists to build the max heap of w. Instead, we use those with highe similaities to w to on-demand build the max heap. We fist sot these inveted lists based on the similaities of thei keywods to w, without loss of geneality, suppose Sim(d,w) >...>Sim(d t,w). We fist constuct the max heap using the lists with the highest similaity values and then include othe lists on-demand. Suppose L i is a list not included in the heap so fa. We can deive an uppe bound u i on the scoe of a ecod fom L i (with espect to the quey keywod w) using the lagest 36

8 weight on the list and the sting similaity Sim(d i,w). Let be the top ecod on the heap, with a scoe F (, w). If F (, w) u i, then this list does not need to be included in the heap, since it cannot have a ecod with a highe scoe. Othewise, this list needs to be included in the heap. Based on this analysis, each time we pop a ecod fom the heap and push a new ecod, we compae the scoe of the new ecod with the uppe bounds of those lists not included in the heap so fa. Fo those lists with an uppe bound geate than this scoe, they need to be included in the heap fom now on. Notice that this checking can be done vey efficiently by stoing the maximal value of these uppe bounds, and odeing these lists based on thei uppe bounds. The puning powe can be even moe significant if the keywod w is the patial keywod w m, since many of its simila keywods shae the same pefix p on the tie simila to w m. We can compute an uppe bound of the ecod scoes fom these lists and stoe the bound on the tie node p. In this way, we can pune the lists moe effectively by compaing the value F (, w) with this uppe bound stoed on the tie, without needing to on-the-fly compute the bound. List Mateialization: Fo fuzzy seach, the patial keywod w m has multiple simila pefixes and each simila pefix has multiple simila wods. The max heap of w m is built on top of inveted lists of such simila wods. Let d be such a simila wod. Recall that the value F (, w m,d) of a ecod on the list of a simila wod d with espect to w m is based on both W (d, ) andsim(d, w m). Let v be a mateialized node. To use U(v) to eplace the lists of v s leaf nodes in the max heap, the following two conditions need to be satisfied: All the leaf nodes of v have the same similaity to w m. All the leaf nodes of v ae simila to w m, i.e., thei similaity to w m is no less than the theshold τ. When the conditions ae satisfied, the soting ode of the union list U(v) is also the ode of the scoes of the ecods on the leaf-node lists with espect to w m. A mateialized node v that satisfies the two conditions must be a descendant of a simila pefix of patial keywod w m. We can pove this by contadiction. Suppose node v is not a descendant of any simila pefix of patial keywod w m. Then node v and its ancestos ae not simila pefixes of w m,thatis the leaf nodes of v ae not simila keywods of w m. This is contadicted with the second condition. Thus a mateialized node v that satisfies the two conditions must be a descendant of a simila pefix of patial keywod w m. Suppose p,p 2,...,p n ae simila pefixes of w m. We check whethe thei mateialized descendants satisfy the two conditions as follows. Conside a mateialized node v which has ancestos among p,p 2,...,p n.ifnodev has no descendants that ae simila pefixes of w m, v must satisfy the two conditions; othewise suppose p j is a descendant of v that is a simila pefix of w m and has the lagest similaity to v among all such descendants. Without loss of geneality, let p i be an ancesto of v and has the lagest similaity with v among all simila pefixes. If Sim(v, p j) Sim((v, p i), v satisfies the two conditions; othewise v will not. Thus we can find usable mateialized nodes to constuct the max heap of w m and use ou poposed techniques in Section to do a cost-based analysis to select high-quality nodes fo mateialization. 5. EXPERIMENTS We implemented ou poposed techniques and compaed with existing methods on thee eal data sets. () DBLP : It included compute science publication ecods 4. (2) URL 5 : It included million URLs. (3) Enon : It was an collection 6. Table 2 shows details of the data. Table 2: Data sets and index costs. Data Set URL DBLP Enon # of Recods (millions).5 Data size. GB 5 MB.4 GB Avg. # of wods/ecod # of distinct keywods (millions) Tie size 42 MB 3 MB 28 MB Size of inveted lists 379 MB 83 MB 342 MB Fo the DBLP data set, we selected eal queies fom the logs of ou deployed systems and each quey contained -6 keywods 7. Fo the othe two data sets, we geneated queies with keywods andomly selected fom the set of wods used in the collection. We assumed the lettes of a quey wee typed in one by one. Fo each keystoke, we measued the time of computing the top-k answes to this quey. Fo exact seach, we measued the total unning time. Fo fuzzy seach, we measued the time in two steps: in step we computed keywods on the tie simila to the quey keywods (using the algoithm descibed in [3]); in step 2 we found the top-k answes using the inveted lists of these simila keywods. Unless othewise specified, k =. We compaed ou method with state-of-the-at method [3]. We implemented the NRA algoithm descibed in [6] if we only do soted access, and the Theshold Algoithm ( TA ) if we can do both soted access and andom access. All the indexes wee built off-line and pe-loaded and full-esident in memoy duing all queying opeations. All expeiments wee un on a Ubuntu Linux machine with an Intel Coe pocesso (X545 3.GHz and 4 GB RAM). 5. Exact Seach Soted Access Only: We implemented the following methods. () BinayPobe [3]: We consideed the inveted lists of the complete quey keywods, and the union of the inveted lists fo the complete keywods of the patial keywod. We chose the shotest list, and fo each of its ecod IDs, we did binay pobings on othe lists. (2) NRA(Heap): We implemented the NRA algoithm using the heap-based technique. (3) NRA(Heap+Mateialization 8 ): We implemented the NRA algoithm using the heap-and-mateialization-based techniques. Figue 9 shows the esults on the Enon dataset, which showed that ou method impoved seach efficiency. Fo instance, fo queies with a patial keywod of length 2, NRA(Heap) educed the quey time of BinayPobe fom 28 ms to ms. NRA(Heap+Mateialization) futhe educed the time to 2 ms. This is because ) BinayPobe fist computed all esults and then anked them; 2) BinayPobe on-the-fly computed the union list of the patial keywod. NRA(Heap) used the max heap to geneate a soted patial list and NRA(Heap+Mateialization) used mateialized lists to save push/pop opeations on the heap. Soted Access + Random Access: We implemented the following methods. () BinayPobe (Fowad List)[3], we chose the shotest list, and fo each of its ecod IDs, we veified whethe the ecod ID contained othe keywods enon/ 7 Details ae omitted due to double-blind eview. 8 We used additional 5% space with espect to inveted index fo mateialization in the expeiments. 362

9 Quey Time (ms) BinayPobe NRA(Heap) NRA(Heap+Mateialization) Quey Time (ms) BinayPobe NRA(Heap) NRA(Heap+Mateialization) Quey Time (ms) BinayPobe(Fowad List) TA(Fowad List+Heap) TA(Fowad List+Heap+Mateialization) Quey Time (ms) BinayPobe(Fowad List) TA(Fowad List+Heap) TA(Fowad List+Heap+Mateialization) # of ecods (*K) Length of the pefix keywod (a) Vaying Data Size (b) Vaying pefix length Figue 9: Exact seach using soted access (Enon). using the fowad list. (2) TA(Fowad List+Heap): We implemented the TA algoithm using fowad list fo andom access and max heap fo soted access. (3) TA(Fowad List+Heap+Mateialization): We implemented the TA algoithm using fowad list, max heap, and list mateialization. Figue shows the esults on the DBLP dataset. We can see that the andom-access techniques indeed impoved efficiency. 5.2 Fuzzy Seach Soted Access Only: We fist evaluated the effect of the list-puning technique. Figue shows the expeimental esults (including two steps). We can obseve that list puning indeed impoved seach efficiency. Fo the Enon dataset with.5m ecods, the method with puning can educe the time fom 3 ms to 7 ms. The puning technique was moe effective on the Enon dataset than on the othe two datasets mainly due to two easons. Fist, the Enon dataset had moe tie nodes due to its lage numbe of distinct keywods in the s. Thus a quey keywod can have moe simila pefixes on the tie. Second, the Enon dataset had fewe ecods, and the inveted lists wee elatively shote. Duing the list tavesal, the NRA algoithm visited fewe ecods, and its highe scoe of the top ecod fom the max heap helped us pune moe lists. List Mateialization: We evaluated the impovement on soted access using list mateialization fo fuzzy type-ahead seach. We measued the amount of stoage space fo stoing mateialized lists as a pecentage of the total size of the inveted lists on the tie. We vaied this amount, and measued the aveage time of finding the top- answes using the NRA algoithm. Figue 2 shows the esults. We can see that list mateialization impoved the seach pefomance. We implemented the diffeent methods fo list mateialization, namely Random, TopDown, BottomUp, and CostBased as discussed in Section Figue 3 shows the esults. Among the thee naive methods, Random gave the best esults. The CostBased algoithm outpefomed all the naive methods. This is because CostBased selected high-quality nodes fo mateialization using a cost-based analysis. Soted Access + Random Access: We implemented the TA algoithm using the two methods fo andom access and list puning fo soted access (descibed in Section 4). Figue 4 shows the scalability esults on the thee datasets. The two andom-access methods scaled well. Method 2 (pobing on tie leaf nodes) outpefomed Method (pobing on fowad lists). This is because fo the thee data sets, thee wee many pefixes simila to the patial keywod, and Method needed to conside all simila pefixes fo each ecod on fowad lists. 6. RELATED WORK Thee ae many studies on autocomplete and phase pediction fo use queies [22, 5, 9, 23, 7]. Google instant seach was # of ecods (*K) Length of the pefix keywod (a) Vaying Data Size (b) Vaying pefix length Figue :Exact seach using andom access(dblp). launched to suppot type-ahead seach. It fist suggested elevant queies based on use pofiles and quey logs and then answeed the top queies. Chaudhui et al. [5] studied how to find simila stings inteactively as uses type in a quey sting, using an appoach simila to that in [3, 2]. They did not study the case whee a quey has multiple keywods that need list-intesection opeations. The seach paadigm studied in this pape is diffeent since we suppot fuzzy, full-text seach as uses type in queies. Bast et al. poposed techniques to suppot type-ahead seach in thei CompleteSeach systems [2, 3, ]. Anothe study [9] is about type-ahead seach on elational data gaphs. Ji et al. [3] developed algoithms fo fuzzy type-ahead seach. Ou wok extends these studies by developing efficient algoithms to suppot top-k seach. Khoussainova et al. [4] poposed to suggest elevant SQL snippets as uses type in SQL queies. Li et al. [8] studied how to use SQLs to suppot type-ahead seach in databases. Feng et al. [8] studied fuzzy seach on XML data. Thee have been many studies on suppoting fuzzy seach (e.g., [, 7, 4,, 24, 6]). Howeve these algoithms ae inefficient fo type-ahead seach since they have low puning powe fo shot stings (patial keywods). The expeiments in [3, 5] showed that these appoaches ae not as efficient as tie-based methods fo fuzzy type-ahead seach. Theobald et al. [25] poposed a heap-based method fo quey expansion. They used WodNet wods and only utilized soted access. conside both soted access and andom access. We 7. CONCLUSION In this pape we studied how to efficiently answe top-k queies in type-ahead seach. We focused on an index stuctue with a tie of keywods in a data set and inveted lists of ecods on the tie leaf nodes. We studied technical challenges when adopting existing top-k algoithms in the liteatue: how to efficiently suppot andom access and soted access on inveted lists? We pesented two algoithms fo suppoting andom access, and poposed optimization techniques using list puning and mateialization to suppot soted access. Ou techniques can be easily extended to suppot lage datasets though data patition. Fo example, we have built a system to seach on 2 million MEDLINE publication ecods using two machines. Acknowledgement. The authos have financial inteest in Bimaple Technology Inc., a company cuently commecializing some of the techniques descibed in this publication. Chen Li is patially suppoted by the NIH gant R2LM43-A and the National Natual Science Foundation of China (No. 6292). Guoliang Li, Jianan Wang, and Jianhua Feng wee patly suppoted by the National Natual Science Foundation of China (No. 634), the National Gand Fundamental Reseach 973 Pogam of China (No. 2CB3226), Tsinghua Univesity (No. 2873), and the NExT Reseach Cente funded by MDA, Singapoe (No. WBS:R ). 363

10 Quey Time (ms) Without Puning Puning Computing Simila Keywods Quey Time (ms) Without Puning Puning Computing Simila Keywods Quey Time (ms) Without Puning Puning Computing Simila Keywods # of ecods (*M) # of ecods (*K) # of ecods (*K) (a) URL (b) DBLP (c) Enon Figue : Fuzzy seach using list puning (similaity theshold τ =.6). Quey Time (ms) keywod queies 4-keywod queies 3-keywod queies 2-keywod queies -keywod queies % % 2% 3% 4% 5% Additional Space/Inveted-Index Size Quey Time (ms) keywod queies 4-keywod queies 3-keywod queies 2-keywod queies -keywod queies % % 2% 3% 4% 5% Additional Space/Inveted-Index Size Quey Time (ms) keywod queies 4-keywod queies 3-keywod queies 2-keywod queies -keywod queies 5 % % 2% 3% 4% 5% Additional Space/Inveted-Index Size (a) URL (b) DBLP (c) Enon Figue 2: Fuzzy seach using list mateialization (soted access only, with list puning, theshold τ =.6). Quey Time (ms) 5 5 TopDown BottomUp Random CostBased % % 2% 3% 4% 5% Additional Space/Inveted-Index Size Quey Time (ms) TopDown BottomUp Random CostBased % % 2% 3% 4% 5% Additional Space/Inveted-Index Size Quey Time (ms) TopDown BottomUp Random CostBased % % 2% 3% 4% 5% Additional Space/Inveted-Index Size (a) URL (b) DBLP (c) Enon Figue 3: Compaison of diffeent mateialization methods (similaity theshold τ =.6). Quey Time (ms) SA+RA(Pobing on Fowad Lists) SA+RA(Pobing on Leaf Nodes) SA Computing Simila Keywods # of ecods (*M) Quey Time (ms) SA+RA(Pobing on Fowad Lists) SA+RA(Pobing on Leaf Nodes) SA Computing Simila Keywods # of ecods (*K) Quey Time (ms) SA+RA(Pobing on Fowad Lists) SA+RA(Pobing on Leaf Nodes) SA Computing Simila Keywods # of ecods (*K) (a) URL (b) DBLP (c) Enon Figue 4: Fuzzy seach with soted access ( SA ) and andom access ( RA ) (similaity theshold τ =.6). 8. REFERENCES [] H. Bast, A. Chitea, F. M. Suchanek, and I. Webe. Este: efficient seach on text, entities, and elations. In SIGIR, pages , 27. [2] H. Bast and I. Webe. Type less, find moe: fast autocompletion seach with a succinct index. In SIGIR, pages , 26. [3] H. Bast and I. Webe. The completeseach engine: Inteactive, efficient, and towads i& db integation. In CIDR, pages 88 95, 27. [4] S. Chaudhui, V. Ganti, and R. Kaushik. A pimitive opeato fo similaity joins in data cleaning. In ICDE, pages 5 6, 26. [5] S. Chaudhui and R. Kaushik. Extending autocompletion to toleate eos. In SIGMOD Confeence, pages 77 78, 29. [6] R. Fagin, A. Lotem, and M. Nao. Optimal aggegation algoithms fo middlewae. In PODS, pages 2 3, 2. [7] J. Fan, G. Li, and L. Zhou. Inteactive SQL quey suggestion: Making databases use-fiendly. ICDE, pages , 2. [8] J. Feng, and G. Li. Efficient Fuzzy Type-Ahead Seach in XML Data. IEEE TKDE, 24(5): , 22. [9] K. Gabski and T. Scheffe. Sentence completion. In SIGIR, pages , 24. [] L. Gavano, P. G. Ipeiotis, H. V. Jagadish, N. Koudas, S. Muthukishnan, and D. Sivastava. Appoximate sting joins in a database (almost) fo fee. In VLDB, pages 49 5, 2. [] M. Hadjieleftheiou, A. Chandel, N. Koudas, and D. Sivastava. Fast indexes and algoithms fo set similaity selection queies. In ICDE, pages , 28. [2] I. F. Ilyas, G. Beskales, and M. A. Soliman. A suvey of top-k quey pocessing techniques in elational database systems. ACM Comput. Suv., 4(4), 28. [3] S. Ji, G. Li, C. Li, and J. Feng. Efficient inteactive fuzzy keywod seach. In WWW, pages 37 38, 29. [4] N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-awae autocompletion fo sql. PVLDB, 4():22 33, 2. [5] K. Kukich. Techniques fo automatically coecting wods in text. ACM Comput. Suv., 24(4): , 992. [6] H. Lee, R. T. Ng, and K. Shim. Extending q-gams to estimate selectivity of sting matching with low edit distance. In VLDB, pages 95 26, 27. [7] C. Li, J. Lu, and Y. Lu. Efficient meging and filteing algoithms fo appoximate sting seaches. In ICDE, pages , 28. [8] G. Li, J. Feng, and C. Li. Suppoting seach-as-you-type using sql in databases. IEEE TKDE, 22. [9] G. Li, S. Ji, C. Li, and J. Feng. Efficient type-ahead seach on elational data: a tastie appoach. In SIGMOD Confeence, pages , 29. [2] G. Li, S. Ji, C. Li, and J. Feng. Efficient fuzzy full-text type-ahead seach. VLDB J., 2(4):67-64, 2. [2] N. Mamoulis, K. H. Cheng, M. L. Yiu, and D. W. Cheung. Efficient aggegation of anked inputs. In ICDE, page 72 83, 26. [22] H. Motoda and K. Yoshida. Machine leaning techniques to make computes easie to use. Atif. Intell., 3(-2):295 32, 998. [23] A. Nandi and H. V. Jagadish. Effective phase pediction. In VLDB, pages 29 23, 27. [24] J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similaity quey pocessing with the asymmetic signatue scheme. In SIGMOD Confeence, pages 33 44, 2. [25] M. Theobald, R. Schenkel, and G. Weikum. Efficient and self-tuning incemental quey expansion fo top-k quey pocessing. In SIGIR, pages ,