Supporting Efficient Topk Queries in TypeAhead Search


 Dorcas Davidson
 1 years ago
 Views:
Transcription
1 Suppoting Efficient Topk Queies in TypeAhead Seach Guoliang Li Jiannan Wang Chen Li Jianhua Feng Depatment of Compute Science, Tsinghua National Laboatoy fo Infomation Science and Technology (TNList), Tsinghua Univesity, Beijing 84, China. Depatment of Compute Science, UC Ivine, CA , USA ABSTRACT Typeahead seach can onthefly find answes as a use types in a keywod quey. A main challenge in this seach paadigm is the highefficiency equiement that queies must be answeed within milliseconds. In this pape we study how to answe topk queies in this paadigm, i.e., as a use types in a quey lette by lette, we want to efficiently find the k best answes. Instead of inventing completely new algoithms fom scatch, we study challenges when adopting existing topk algoithms in the liteatue that heavily ely on two basic listaccess methods: andom access and soted access. We pesent two algoithms to suppot andom access efficiently. We develop novel techniques to suppot efficient soted access using list puning and mateialization. We extend ou techniques to suppot fuzzy typeahead seach which allows mino eos between quey keywods and answes. We epot ou expeimental esults on seveal eal lage data sets to show that the poposed techniques can answe topk queies efficiently in typeahead seach. Categoies and Subject Desciptos H.3.3 [Infomation Seach and Retieval]: models Geneal Tems Algoithms, Expeimentation, Pefomance Keywods Typeahead seach, topk seach, fuzzy seach Retieval. INTRODUCTION To give instant feedback when uses fomulate seach queies, many infomation systems suppot autocomplete seach, which shows esults immediately afte a use types in a patial keywod quey. As an example, almost all the majo seach engines nowadays automatically suggest possible keywod queies as a use types in patial keywods. Most autocomplete systems teat a quey with multiple keywods as asingle sting, and find answes with text that matches the sting Pemission to make digital o had copies of all o pat of this wok fo pesonal o classoom use is ganted without fee povided that copies ae not made o distibuted fo pofit o commecial advantage and that copies bea this notice and the full citation on the fist page. To copy othewise, to epublish, to post on seves o to edistibute to lists, equies pio specific pemission and/o a fee. SIGIR 2, August 2 6, 22, Potland, Oegon, USA. Copyight 22 ACM /2/8...$5.. exactly. To ovecome this limitation, a new typeahead seach paadigm has emeged ecently [2, 3]. Using this paadigm, a system teats a quey as a set of keywods, and does a fulltext seach on the undelying data to find answes including the keywods. We teat the last keywod in the quey as a patial keywod the use is completing. Fo instance, a quey gaph sig on a publication table can find publication ecods with the keywod gaph andakeywod that has sig as a pefix, such as sigi, sigmod, and signatue. In this way, a use can get instant feedback afte typing keywods, thus can obtain moe knowledge about the undelying data to fomulate a quey moe easily. Ji et al. [3] extended typeahead seach by allowing mino eos between queies and answes. As a use types in quey keywods, the system can find elevant ecods with keywods simila to the quey keywods. This featue is especially impotant when the use has limited knowledge about the exact epesentation of entities she is looking fo. Fo instance, if a use types in a patial quey chitos falut, the system can find ecods appoximately matching the two keywods despite the typo in the quey, such as a ecod with keywods Chistos Faloutsos. Clealy these featues can futhe impove use seach expeiences. In this pape we study how to answe anking queies in typeahead seach on lage amounts of data. That is, as a use types in a keywod quey lette by lette, we want to onthefly find the most elevant (o topk ) ecods. One appoach fist finds ecods matching those quey keywods, and then computes thei anking scoes to find the most elevant ones. This appoach is not efficient when thee ae a lage numbe of candidate answes to compute and stoe. Existing typeahead seach appoaches assume an index stuctue with a tie fo the keywods in the undelying data, and each leaf node has an inveted list of ecods with this keywod, with the weight of this keywod in the ecod [3, 9]. As an example, Table shows a sample collection of publication ecods. Fo simplicity, we only list some of the keywods fo each ecod. Figue shows the coesponding index stuctue. (Moe details about the index ae in Section 3.) Suppose a use types in a quey gaph icdm li. Fo exact seach, we find ecods containing the fist two keywods and a wod with pefix of li, e.g., ecod 5. Fo fuzzy seach, we compute ecods with keywods simila to quey keywods, and ank them to find the best answes. Fo each complete keywod, we find keywods simila to the quey keywod. Fo instance, both keywods icdm and icdl ae simila to the second quey keywod. The last keywod 355
2 Table : Publication ecods with sample keywods. Recod ID Recod gaph icdm... gaph goup lui... 2 gay icdl liu... 3 gaph icdl lin lui... 4 gaph goup icdm lin liu... 5 gaph gay goss icdm lin liu... 6 gay goup icdm lin liu... 7 gay goss goup icdl lin... 8 goss icdl liu... 9 icdm liu... li is teated as a pefix condition, since the use is still typing at the end of this keywod. We find keywods that have a pefix simila to li, such as lin, liu, and lui. We access the inveted lists of these simila keywods to find ecods and ank them to find the best answes fo the use. A key question is: how to access inveted lists on tie leaf nodes efficiently to answe topk queies? Instead of inventing completely new algoithms fom scatch, we study how to adopt a plethoa of algoithms in the liteatue fo answeing topk queies by accessing lists (e.g., [2, 2]). These algoithms shae the same famewok poposed by Fagin [6], in which we have lists of ecods soted based on vaious conditions. An aggegation function takes the scoes of a ecod fom these lists and computes the final scoe of the ecod. Thee ae two methods to access these lists: () Random Access: Given a ecod id, we can etieve the scoe of the ecod on each list; (2) Soted Access: We etieve the ecod ids on each list following the list ode. In this pape we study technical challenges when adopting these algoithms, and focus on new optimization oppotunities that aise in ou poblem. In paticula, we study how to suppot the two types of access opeations efficiently by utilizing chaacteistics specific to ou index stuctues and access methods. We make the following contibutions: ) In Section 3, we pesent a fowadlistbased method fo suppoting andom access on the inveted lists, and develop a heapbased method and listmateialization techniques to suppot soted access efficiently. 2) In Section 4 we study fuzzy typeahead seach. We popose a listpuning technique to impove the pefomance of soted access, and study how to impove the techniques based on fowad lists and list mateialization fo fuzzy seach. Due to the challenging natue of the poblem, ou extensions ae technically nontivial. 3) In Section 5 we pesent ou expeimental esults on eal lage data sets to show the efficiency of ou techniques. We have deployed seveal systems using this paadigm, which have been used egulaly and well accepted by uses due to its fiendly inteface and high efficiency. 2. FORMULATION AND PRELIMINARIES TypeAhead Seach: LetR be a collection of ecods such as the tuples in a elational table. Let D be the set of wods in R. Let Q be a quey the use has typed in, which is a sequence of keywods w,w 2,...,w m. We teat the last keywod w m as a patial keywod the use is completing, and othe keywods as complete keywods the use has completed 2. As a use types in a keywod quey lette by lette, typeahead seach onthefly finds ecods that contain the fist m keywods and a wod with the last keywod as a pefix. 2 Ou method can be easily extended to the case that evey keywod is taken as a patial keywod. a p y h 5,9 4,7 3,4,3,2 g i c i o d n s u l m s p 3,9 8,9 7,4,9 4,9 7,8 7,8 8,2 7,8 5,8 6,4 5,4 3,2 6,4 6,5 5,3 2,2 4,3 9,4 4,2,3 5,8 4,7 2,9 6,5 5,8 9,4 6,4 2,3 7,3 8, Figue : Tie index stuctue. u l u i,6 3,4 Without loss of geneality, each sting in the data set and a quey is assumed to use lowecase lettes. Fo example, in Table, R = {,,..., 9}, D = {gaph, icdm, goup, lui,...}. Suppose a use types in a quey icdm ga. We teat icdm as a complete keywod and ga as a patial keywod. Recods, 4, 5, and 6 ae potentially elevant answes. Fo example, contains complete keywod icdm andwod gaph witha pefixof ga. When the use types in moe lettes and submits quey icdm gaph li, we teat icdm and gaph as complete keywods and li as a patial keywod. Recods 4 and 5 ae potentially elevant answes. Topk Answes: We ank each ecod in R based on its elevance to the quey. Given a positive intege k, ou goal is to compute the best k ecods in R anked by thei elevance to Q. Notice that ou poblem setting allows an impotant ecod to be in the answe, even if not all quey keywods appea in the ecod (the OR semantics). Thus the algoithms in [3] cannot be used diectly in ou poblem. Ranking: In the liteatue thee ae many algoithms fo answeing topk queies by accessing lists (e.g., [2, 2]). These algoithms shae the same famewok poposed by Fagin [6], in which we have lists of ecods soted based on vaious conditions, such as tem fequency and invese document fequency ( tf*idf ). Each ecod has a scoe on a list, and we use an aggegation function to combine the scoes of the ecod on diffeent lists to compute its oveall elevance to the quey. The aggegation function needs to be monotonic, i.e., deceasing the scoe of a ecod on a list cannot incease the ecod s oveall scoe. This appoach has the advantage of allowing a geneal class of anking functions. In this pape, we focus on an impotant class of anking functions with the following popety: the scoe F (, Q) of a ecod to a quey Q is a monotonic combination of scoes of the quey keywods with espect to the ecod. Fomally, we compute the scoe F (, Q) intwo steps. In the fist step, fo each keywod w, wecomputea scoeofthekeywodwithespecttotheecod, denoted by F (, w). In the second step, we compute the scoe F (, Q) by applying a monotonic function on the F (, w) s fo all the keywods w. The intuition of this popety is that the moe elevant an individual quey keywod is to a ecod, the moe likely this ecod is a good answe to this quey. Fo example, we compute the scoe of a ecod to quey icdm gaph li by aggegating the scoes of each of keywods with espect to the ecod. Each complete keywod w has a weight associated with aecod, denoted by W (, w). This weight could depend 356
3 Quey Keywods Patial keywod w w 2 w m Tie vitual list Inveted lists Figue 2: Typeahead seach fo Q = w,w 2,...,w m. on the keywod, such as the tf*idf value of the keywod in the ecod. As a specific case, it can also be independent fom the keywod. Fo instance, if a ecod is a URL with tokenized keywods, its weight could be a ank scoe of the coesponding Web page. If a ecod is an autho, we can use the numbe of publications of the autho as a weight of this ecod. Fo the last patial keywod w m, thee could be multiple complete wods. We compute the elevance scoe of w m in the ecod, i.e., F (, w m), based on the following popety: F (, w m) is the maximal value of the W (, d) weights fo all the keywods d with espect to w m in, whee d is a keywod in ecod and has a pefix of w m. This popety states that we only look at the most elevant keywod in a ecod to the patial keywod when computing the elevance of the keywod to the ecod. It means that the anking function is geedy to find the most elevant keywod in the ecod as an indicato of how impotant this ecod is to the patial keywod. As we can see in Section 3, this popety allows us to do effective puning when accessing the multiple lists of a quey keywod. The following is an example function. m F (, Q) = F (, w i), () whee i= { W (, w F (, w i) if i<m, i)= max complete wod d of wm {W (, d)} if i = m. (2) In Figue, conside quey icdm gaph li and ecod 5. F ( 5, icdm ) = W ( 5, icdm )=8andF ( 5, gaph ) = W ( 5, gaph ) = 9. The patial keywod li hastwo complete wods lin and liu. F ( 5, li ) = max{w ( 5, lin ), W ( 5, liu )}=8. F ( 5, icdm gaph li ) = EXACT TYPEAHEAD SEARCH In this section, we study efficient listaccess methods to suppot exact typeahead seach, i.e., no mismatches between quey keywods and answes. Indexing: We constuct a tie fo the data keywods in the data D. A tie node has a chaacte label. Each keywod in D coesponds to a unique path fom the oot to a leaf node 3 on the tie. Fo simplicity, a tie node is mentioned intechangeably with the keywod coesponding to the path fom the oot to the node. A leaf node has an inveted list of IDs of pais id, weight, wheeid is the ID of a ecod containing the leafnode sting, and weight is the weight of the keywod in the ecod. Figue shows the index stuctue in ou unning example. Fo instance, fo the leaf node of keywod gaph, its inveted list has five elements. 3 A common tick to make each leaf node coesponds to a complete wod and vice vesa is to add a special mak to the end of each wod. Fo simplicity we did not use this tick. Fowad index [,4] [5,6] [7,9] Recod Fowad list i l,2 ;6,3 [5,6] [7,8] [9,9] [,2] c i u,3 ;4,9 ;9,6 [3,4] [5,6] 2 2,9 ;5,2 ;8,3 [3,3] [4,4] d n u i 3,4 ;5,2 ;7,9;9,4 l m ,7 ;4,3;6,9;7,2;8, ,9 ;2,8;3,4;6,8;7,3;8, Figue 3: Fowad lists. g [,4] a o [,] p y s u 2 h s p The fist element 5, 9 indicates that the ecod 5 has this keywod, and the weight of this keywod in this ecod is 9, i.e., W ( 5, gaph ) = 9. Seaching: We compute the topk answes to a quey Q in two steps. As illustated in Figue 2, in the fist step, fo each complete keywod w i( i m ), we get its inveted list. Fo the last patial keywod, we locate the tie node of w m and etieve the inveted lists of the tie node s leaf descendants. Fo example, in Figue, conside a quey icdm li. The patial keywod li has two leafnode keywods: lin and liu. In the second step, we access the inveted lists to compute the k best answes. Many algoithms have been poposed fo answeing topk queies by accessing soted lists [2, 6]. When adopting these algoithms to solve ou poblem, we need to efficiently suppot two basic types of access used in these algoithms: andom access and soted access on the lists. 3. Efficient Random Access To suppot andom access, we constuct a fowad index in which each ecod has a fowad list of IDs of its keywods. We assume each keywod has a unique ID with espect to its leaf node on the tie, and the IDs of the keywods follow thei alphabetical ode. Figue 3 shows the fowad lists. The element, 9 onthefowadlistofecod 5 shows that this ecod has a keywod with ID and weight 9, which is keywod gaph as shown on the tie. Given a ecod and a complete keywod, we can get the coesponding weight by doing a binayseach on the fowad list. Fo example, to get the weight of keywod icdm with ID 6 in 5, we can do a binay seach on 5 s fowad list and get the coesponding weight 8. Fo the patial keywod, as it has multiple complete wods, we need fist locate its tie node and then enumeate its leafdescendants to get the coesponding weights. This method could be expensive if the tie node has many leafdescendants. To impove the pefomance, we can use an altenative method. Fo each tie node n, we can maintain a keywod ange [l n,u n], whee l n and u n ae the minimal and maximal keywod IDs of its leaf nodes, espectively [3]. An inteesting obsevation is that a complete wod with n as a pefix must have an ID in this keywod ange, and each complete wod in the data set with an ID in this ange must have a pefix of n. In Figue 3, the keywod ange of node g is [, 4], since is the smallest ID of its leaf nodes and 4 is the lagest one. Based on this obsevation, this method veifies whethe ecod contains a keywod with a pefix of w m as follows. We fist locate the tie node w m and then check if thee is a keywod ID on the fowad list of in the keywod ange [l wm,u wm ]. Since we can keep the fowad list of soted, this checking can be done efficiently. Fo instance, conside quey gaph icdm l. Fo the fist element on the inveted list of gaph, 5, 9, we can check whethe 357
4 Vitual soted list 3,9 5,8 7,8 4,7,6 6,5 9,4 2,3 8, Patial keywod l 3,9 3,9 3,9 5,8 3,9 5,8,6 7,8 4,7 3,4 6,4 6,5 lui 5,3 9,4 4,2 2,3 8, lin liu Figue 4: A heapbased method to compute the vitual soted list of patial keywod l.,6 5 contains othe two keywods as follows. Fo complete U(v) fo node v with espect to patial keywod w m. keywod icdm withid6,wedoabinayseachon 5 s an answe, i.e., 27. We get the next elements of gaph and fowad list and get weight 8. Fo patial keywod l with icdm, 4, 7 and 5, 8. We incement the cuso of the keywod ange [7, 9], using a binay seach on 5 s fowad list (, 9 ; 2, 8 ; 3, 4 ; 6, 8 ; 7, 3 ; 8, 8 ) list that poduces the top element, push it into the heap, and, we find keywod etieve the next top element: 5, 8. Based on the accessed IDs 7 and 8 in this ange. Thus we know that the ecod elements, we have ) The scoe of ecod 5 is = 25; indeed contains keywods with pefix l, and compute the coesponding scoe F ( 5, l ) = max { 2) The maximal scoe of ecod 3 is = 24, and F ( 5, lin ),F( 5, liu ), F ( 5, lui ) } that of 4 is = 24, while those of othe ecods ae =8.ThusF ( 5, gaph icdm l ) = 25. at most = 23. Thus, ecod 5 is the best answe. 3.2 Efficient Soted Access To suppot soted access, we can keep the elements on the inveted lists soted based on thei weights in a descending ode. Thus, fo the complete keywod, we can get an odeed list. Fo the patial keywod w m, it has multiple leaf descendants and coesponding inveted lists. We use U(w m)todenote the union of those inveted lists, called union list of w m.we need to suppot soted access on U(w m) to etieve the next most elevant ecod ID fo w m. Fully computing U(w m) using the keywod lists could be expensive in tems of time and space. In this section, we popose two techniques to suppot soted access efficiently HeapBased Method We can suppot soted access on U(w m) by building a max heap on the inveted lists of its leaf nodes. In paticula, we maintain a cuso on each inveted list. The max heap initially consists of the ecod IDs pointed by the cusos so fa, soted on the weights of the keywods in these ecods. Notice that each inveted list is aleady soted based on the weights of its keywod in the ecods. To etieve the next best ecod, we pop the top element fom the heap, incement the cuso of the list of the popped element by, and push the new element of this list to the heap. When popping all elements fom the heap, we can get a soted list fo the patial keywod. Fo example, conside the patial keywod l. It has thee complete keywods lin, liu, and lui. We can compute its union list as shown in Figue 4. Note that since ou method does not need to compute the entie list of U(w m), U(w m)isavitual soted list of patial keywod w m. On top of the inveted lists of complete keywods and the max heap of the patial keywod, we can adopt an existing topk algoithm to find the k best ecods. As an example, suppose we want to compute the top best answe fo quey gaph icdm l using soted access only. We get the fist elements of gaph and icdm, 5, 9 and 4, 9, pop the top element of the max heap in Figue 4, 3, 9, and compute an uppe bound on the oveall scoe of Legend: Figue 5: M(v): Mateialized descendants of v v Max heap of w m T(v): subtie of v N(v): othe leaf nodes (of v) without mateialized ancestos Benefits of mateializing the union list List Mateialization We can futhe impove the pefomance of soted access fo the patial keywod w m by pecomputing and stoing the unions of some of the inveted lists on the tie. Let v be a tie node, and U(v) be the union of the inveted lists of v s leaf nodes, soted by thei ecod weights. If a ecod appeas moe than once on these lists, we choose its maximal weight as its weight on list U(v). Fo example, U( li ) = { 3, 9, 5, 8, 7, 8 ; 4, 7, 6, 5, 9, 4, 2, 3, 8, }. When using a max heap to etieve ecods soted by thei scoes fo the patial keywod, this mateialized list could help us build a max heap with fewe lists and educe the cost of push/pop opeations on the heap. Theefoe, this method allows us to utilize additional memoy space to answe topk queies moe efficiently. Fo instance, conside the index in Figue and a quey icdm g. Fo the patial keywod g, we access its data keywods gaph, gay, goss, and goup, and build a max heap on thei inveted lists based on ecod scoes with espect to this quey keywod. If we mateialize the union lists of ga and go, we can use thei mateialized lists, saving the time to tavese the fou leaf nodes and some push/pop opeations on the heap. We next give a detailed costbased analysis to quantify the benefit of mateializing a node on the pefomance of opeations on the max heap of w m, fo exact typeahead seach. Let B be a budget of stoage space we ae given to mateialize union lists. Given a tie node v, letu(v) bethe union of inveted lists of leaf nodes in the subtie of v. Ou goal is to select tie nodes to mateialize thei union lists fo maximizing the pefomance of queies. The following ae naive algoithms fo choosing tie nodes: Random: We andomly select tie nodes. TopDown: We select nodes top down fom the tie oot. BottomUp: We select nodes bottom up fom leaf nodes. Each naive appoach keeps choosing tie nodes to mateialize thei union lists until the sum of thei list sizes eaches the space limit B. One main limitation of these appoaches is that they do not quantitatively conside the benefits of 358
5 mateializing a union list. To ovecome this limitation, we popose a costbased method called CostBased to do list mateialization. Its main idea is the following. Fo simplicity we say a node has been mateialized if its union list has been mateialized. Fo a quey Q with a pefix keywod w m, suppose some of the tie nodes have thei union lists mateialized. Let v be such a mateialized node. If we can use U(v) to constuct the heap of w m,we need not visit v s descendants and access the inveted lists of v s leaf descendants, and thus achieve the benefit of educing the time of tavesing the subtie ooted at v and push/pop opeations on the max heap of w m. We say the mateialized node v is usable fo patial keywod w m. Next we discuss how to check whethe a node v is usable fo patial keywod w m. If v is not a descendant of w m, mateializing v is unusable to w m; othewise, if no node on the path fom v to w m (including w m) has been mateialized, mateializing v is usable to w m. Notice that if v has a mateialized ancesto v on the path fom v to w m, then we can use the mateialized list U(v ) instead of U(v), and the list U(v) will no longe be usable to w m. To summaize, a mateialized node v is usable fo patial keywod w m if,. v is a descendant of w m;and 2. v has no mateialized ancesto between v and w m. Fo example, conside a quey icdm g, mateializing node l is unusable fo patial keywod g as l is not a descendant of g. Mateializing g is usablefo g if g isnot mateialized. If g is mateialized, then mateializing ga is unusable fo g as we will use the mateialized list of g to build the max heap of g, instead of using ga. If v is usable fo w m, mateializing U(v) has the following benefits fo the heap of w m. () We do not need to tavese the tie to access these leaf nodes and use them to constuct the max heap; (2) Each push/pop opeation on the heap is moe efficient since it has fewe lists. Hee we pesent an analysis of the benefits of mateializing the usable node v. In geneal, fo a tie node v, lett (v) denote its subtie and T (v) denote the numbe of nodes in T (v). The total time of tavesing this subtie is O ( T (v) ). Now we analyze the benefit of mateializing node v. As illustated in Figue 5, suppose v has mateialized descendants. Let M(v) be the set of highest mateialized descendants of v. These mateialized nodes can help educe the time of accessing the inveted lists of v s leaf nodes in two ways. Fist, we do not need to tavese the descendants of a mateialized node d M(v). We can just tavese T (v) d M(v) T (d) tie nodes. Second, when inseting lists to the max heap of w m, we inset the union list of v into the heap and need not inset the union list of each d M(v) and the inveted lists of d N(v) into the heap, whee N(v) denotes the set of v s leaf descendants having no ancestos in M(v). Let S(v) =M(v) N(v). We quantify benefits of mateializing node v:. Reducing tavesal time: Since we do not tavese v s descendants, the time eduction is B = O ( T (v) d M(v) T (d) ). 2. Reducing heapconstuction time: When constucting the max heap fo keywod w m, we inset the union list U(v) into the heap, instead of the inveted lists of those nodes in S(v). The time eduction is B 2 = S(v). 3. Reducing sotedaccess time: If we inset the union list U(v) tothemaxheapofw m, the numbe of leaf nodes in the heap is S(w m). Othewise, it is S(w m) + S(v). The time eduction of a soted access is B ( 3=O log( S(w ) m) + S(v) ) O ( log( S(w ) m) ). The following is the oveall benefit of mateializing v fo the patial keywod w m: B v = B + B 2 + A v B 3, (3) whee A v is the numbe of soted accesses on U(v). A v can be computed using the numbe of ecods in the union list U(v), and the numbe of keywods in the quey. The analysis above is on a quey wokload. If thee is no quey wokload, we can use the tie stuctue to count the pobability of each node to be queied and use such infomation to compute the benefit of mateializing a node. In this pape, we employ a no quey wokload setting. 4. FUZZY TYPEAHEAD SEARCH In this section, we fist define the poblem of topk queies in fuzzy typeahead seach [3]. We then develop new techniques to suppot efficient list access to answe such queies by extending techniques developed in exact seach. 4. Ranking As a use types in a quey lette by lette, fuzzy typeahead seach onthefly finds ecods with wods simila to the quey keywods. Fo example, conside the data in Table. Suppose a use types in a quey gaph gose. We etun 5 as a elevant answe since it has a keywod goss simila to quey keywod gose. We use edit distance to measue the similaity between stings. Fomally, the edit distance between two stings s and s 2, denoted by ed(s, s 2), is the minimum numbe of singlechaacte edit opeations (i.e., insetion, deletion, and substitution) needed to tansfom s to s 2. Fo example, ed(goss, gose) =. Similaity Function: Let π be a function that computes the similaity between a data sting s and a quey keywod w in Q = w,w 2,...,w m. An example is: π(s, w) = ed(s, w), w whee w is the length of the quey keywod w. We nomalize the edit distance based on the queykeywod length in ode to allow moe eos fo longe quey keywods. Ou esults in the pape focus on this function, and they can be genealized to othe functions using edit distance. Let d be a keywod in the data set D. Foeachcomplete keywod w i (i =, 2,...,m ) in the quey, we define the similaity of d to w i as: Sim(d, w i)=π(d, w i). Since the last keywod w m is teated as a pefix condition, we define the similaity of d to w m as the maximal similaity of d s pefixes using function π, i.e.: Sim(d, w m)= max {π(p, wm)}. pefix p of d Let τ be a similaity theshold. We say a keywod d in D is simila to a quey keywod w if Sim(d, w) τ. Wesaya pefix p of a keywod in D is simila to the quey keywod w m if π(p, w m) τ. We want to find the keywods in the data set that ae simila to quey keywods, since ecods with such a keywod could be of inteest to the use. 359
6 Quey Keywods Legend: w w 2 w m Tie Simila pefixes Inveted lists Patial keywod Simila complete wods Figue 6: Keywods simila to those in quey Q = w,w 2,...,w m. Each quey keywod w i has simila keywods on leaf nodes. The last pefix keywod w m has simila pefixes. Let Φ(w i)(i =,...,m)denotethesetofkeywodsin D simila to w i,andp (w m) denote the set of pefixes (of keywods in D) simila to w m. We compute the topk answes to the quey Q in two steps. In the fist step, fo each keywod w i in the quey, we fist compute an editdistance uppe bound based on the similaity function, i.e., ( τ) w i, and then compute the simila keywods Φ(w i)and simila pefixes P (w m) on the tie (shown in Figue 6). Ji et al. [3] developed an efficient algoithm fo incementally computing these simila stings as the use modifies the cuent quey. A simila algoithm is developed in [5]. In the second step, we access the inveted lists of these simila data keywods to compute the k best answes. Fo example, assume a use types in a quey gose li lette by lette on the data shown in Table. Suppose the similaity theshold τ is.45. The set of pefixes simila to the patial keywod li isp ( li ) = {l, li, lin, liu, lu, lui, i}, and the set of data keywods simila to the patial keywod li isφ( li ) = {lin, liu, lui, icdl, icdm}. In paticula, lui is simila to li since Sim(lui, li) = ed(lui,li) li =.5 τ. The set of simila wods fo the complete keywod gose is Φ( gose ) = {goss}. Then we compute topk answes using the inveted lists of those wods in Φ( gose ) and Φ( li ). Ranking: We still assume the anking function has the fist popety descibed in Section 2, which computes the scoe F (, Q) by applying a monotonic function on the F (, w i) s fo all the keywods w i in the quey. Given a complete keywod w i and a ecod, fo exact seach, we can use the weight of w i in, i.e., W (, w i), to denote thei elevancy F (, w i). But fo fuzzy seach, the keywod w i can be simila to multiple keywods in the ecod, and diffeent simila wods have diffeent similaities to w i and diffeent weights in. A question is how to compute the elevance value of keywod w i in ecod, F (, w i). Let d be a keywod in ecod such that d is simila to the quey keywod w i, i.e., d Φ(w i). We use F (, w i,d)to denote the elevance of this quey keywod w i in the ecod with espect to keywod d. The value should depend on both the weight of d in, i.e., W (, d), as well as the similaity between w i and d, i.e., Sim(d, w i). Intuitively, the moe simila they ae, the moe elevant w i is to in tems of d. Fo instance, F (, w i,d)=sim(d, w i) W (, d) isanexample anking function to evaluate the elevancy of w i in the ecod with espect to keywod d. We use the following function with the second popety in Section 2 to compute F (, w i): F (, w i)= max {F (, w i,d)}. (4) keywod d (in ) simila to w i 4.2 Efficient Random Access We fist study how to suppot efficient andom access fo fuzzy typeahead seach. Fo simplicity, in the discussion we focus on how to veify whethe the ecod has a keywod with a pefix simila to the patial keywod w m. With mino modifications the discussion extends to the case whee we want to veify whethe has a keywod simila to a complete keywod w i( i m ). In each andom access, given an ID of a ecod, wewant to etieve infomation elated to a quey keywod w i,which allows us to etieve W (, d) fo each of w i s simila wod d so as to compute the scoe F (, w i). In paticula, fo a keywod w i in the quey, does the ecod have a keywod simila to w i? One naive way to get the infomation is to etieve the oiginal ecod and go though its keywods. This appoach has two limitations. Fist, if the data is too lage to fit into memoy and has to eside on had disks, accessing the oiginal data fom the disks may slow down the pocess significantly. This costly opeation will pevent us fom achieving an inteactiveseach speed. The second limitation is that it may equie a lot of computation of sting similaities based on edit distance, which could be time consuming. In this section, we pesent two efficient appoaches fo solving this poblem. Method : Pobing on Fowad Lists: This method veifies whethe ecod contains a keywod with a pefix simila to w m as follows. Fo each pefix p on the tie simila to w m (computed in the fist step of the algoithm as discussed above), we check if thee is a keywod ID on the fowad list of in the keywod ange [l p,u p] of the tie node of p as discussed in Section 3. Method 2: Pobing on Tie Leaf Nodes: Using this method, fo each pefix p simila to w m, we tavese the subtie of p and identify its leaf nodes. Fo each leaf node d, we stoe the fact that fo the quey Q, thiskeywodd has a pefix simila to w m in the quey. Specifically, we stoe Quey ID, patial keywod w m, Sim(p, w m). We stoe the quey ID in ode to diffeentiate it fom othe queies in case multiple queies ae answeed concuently. We stoe the similaity between w m and p to compute the scoe of this keywod in a candidate ecod. In case the leaf node has seveal pefixes simila to w m, we only keep thei maximal similaity to w m. Fo each complete keywod w i, we also stoe the same infomation fo those tie nodes simila to w i. Theefoe, a leaf node might have multiple enties coesponding to diffeent keywods in the same quey. We call these enties fo the leaf node as its collection of elevant quey keywods. Notice that this stuctue needs vey little stoage space, since the enties of old queies can be quickly eused by new queies, and the numbe of keywods in a quey tends to be small. We use this additional infomation to efficiently check if a ecod contains a complete wod with a pefix simila to the patial keywod w m. We scan the fowad list of. Fo each of its keywod IDs, we locate the coesponding leaf node, and test whethe its collection of elevant quey keywods includes this quey and 36
7 p [,4] a [,4] y g [,2] [3,4] [,] s o [3,3] [4,4] u Fowad index [5,6] [7,9] Recod Fowad list l [5,6] [7,8] [9,9],2 ;6,3 i u [5,6],3 ;4,9 ;9,6 2,9 ;5,2 ;8,3 i c d n u,4 ;5,2 ;7,9;9,4 l m ,7 ;4,3;6,9;7,2;8,7 h s p q,lin,.66,9 ;2,8;3,4;6,8;7,3;8,8 q,gose,.8 q,lin, q 2, liu,... q 2,goss, q 2, liu,.66 Figue 7: Pobing on tie leaf nodes. i the keywod w m. If so, we use the stoed sting similaity to compute the scoe of this keywod in the quey. Figue 7 shows how we use this method in ou unning example, whee the use types in a keywod quey q = lin, gose. When computing the simila wods of gose, i.e., goss, we inset the quey ID (shown as q ), the patial keywod gose, and the coesponding pefix similaity to its collection of elevant quey keywods. To veify whethe ecod 5 has a wod with a pefix simila to gose, we scan its fowad list. Its thid keywod is goss. We access its coesponding leaf node, and see that the node s collection of elevant quey keywods includes gose. Thus we know that 5 indeed contains a keywod simila to gose, and can etieve the coesponding pefix similaity. Compaison: The time complexity of the fowadlist based method (Method ) is O ( G log( ) ), whee G is the total numbe of simila pefixes of w m and simila complete wods of w i s fo i m, and is the numbe of distinct keywods in ecod. Since the simila pefixes of w m could have ancestodescendant elationships, we can optimize the step of accessing them by consideing the highest ones. The time complexity of the second method is O( T (p) + Q ). smila pefix p of w m The fist tem coesponds to the time of tavesing the subties of simila pefixes, whee T (p) is the subtie ooted at a simila pefix p. The second tem coesponds to the time of pobing the leaf nodes, whee Q is the numbe of quey keywods. Notice that to identify the answes, we need access the inveted lists of complete wods, thus the fist tem can be emoved fom the complexity. Method is pefeed fo data sets whee ecods have a lot of keywods such as long documents, while Method 2 is pefeed fo data sets whee ecods have a small numbe of keywods such as elational tables with elatively shot attibute values. 4.3 Efficient Soted Access HeapBased Method: Fo a quey keywod w, wewant to suppot soted access that can access ecod IDs based on the elevance of w to these ecods. As w has multiple simila wods, we can suppot soted access efficiently by building a max heap on the inveted lists of such simila wods, as descibed in Section 3. Notice that, in exact seach, each leaf node has the same similaity to w; but fo fuzzy seach, diffeent leaf nodes could have diffeent similaities. Thus, when pushing a ecod fom an inveted list of a simila wod d to the heap, we maintain, F (, d) in the heap. We push/pop the ecod on the heap with the maximal F (, d). Conside the quey icdm li. Figue 8 shows the two heaps fo the two keywods. Fo illustation puposes, fo icdm li 4,9 3,9 3,9 4,9 5,8 5,8 3,9 4,4.5 6,5 7,8 7,3 4,9 9,4 3,9 5,8,3 *3/4 4,4.5 * 4,7 7,3 6,5 * * 7,4 4,9 4,4.5 7,3,3 9,4 3, 9 5,8 */2 8,2 5,8 4,7 */2 */2 8,.5,3 7,8 3,2 6,5 6,5,6 4,9 7,6 3,.5 2,3 6,4 2,2 9,4 2,.5,3,.5 9,4 3,4 5,8 8,5 5,3 6,5 2,3 3,4 8, 4,2 9,4 8, 2,3 icdl icdm lin liu,3 lui icdm icdl Figue 8: Max heaps fo the quey keywods icdm and li. Each shaded list is meged fom the undelying lists. It is vitual since we do not need to compute the entie list. each keywod we also show the vitual meged list of ecods with thei scoes, and this list is only patially computed duing the tavesal of the undelying lists. Each ecod on a heap has an associated scoe of this keywod with espect to the quey keywod, computed using Equation 4. List Puning: As thee may be a lage numbe of simila wods fo a quey keywod, especially fo the patial keywod, it could be expensive to constuct a heap on the fly. We futhe impove the pefomance of soted access on the vitual soted list U(w) by using the idea of ondemand heap constuction, i.e., we want to avoid constucting a heap fo all the inveted lists of keywods simila to a quey keywod. Suppose w has t simila wods. Each push/pop opeation on the heap of these lists takes O(log(t)) time. If we can educe the numbe of lists on the heap, we can educe the cost of its push/pop opeations. We have two obsevations about this puning method. () As a special case, if those keywods matching quey keywods exactly have the highest elevance scoes, this method allows us to conside these ecods pio to consideing othe ecods with mismatching keywods. (2) The puning can be moe poweful if w is the last patial keywod w m,sincemanyof its simila keywods shae the same pefix p on the tie. Conside quey icdm li, Figue 8 illustates how we can pune lowscoe lists and do ondemand heap constuctions. The pefix li has seveal simila keywods. Among them, the two wods lin and liu have the highest similaity value to the quey keywod, mainly because they have a pefix matching the keywod exactly. We build a heap using these two lists. To compute the top best answe, the lists of lui, icdm, and icdl ae neve included in the heap since thei uppe bounds ae always smalle than the scoes of popped ecods befoe the tavesal teminates. We next intoduce how to do list puning fo the maxheap based methods in fuzzy typeahead seach. Given a keywod w, letd,...,d t be its simila wods and L,...,L t be the coesponding inveted lists, espectively. We need not use all the inveted lists to build the max heap of w. Instead, we use those with highe similaities to w to ondemand build the max heap. We fist sot these inveted lists based on the similaities of thei keywods to w, without loss of geneality, suppose Sim(d,w) >...>Sim(d t,w). We fist constuct the max heap using the lists with the highest similaity values and then include othe lists ondemand. Suppose L i is a list not included in the heap so fa. We can deive an uppe bound u i on the scoe of a ecod fom L i (with espect to the quey keywod w) using the lagest 36
8 weight on the list and the sting similaity Sim(d i,w). Let be the top ecod on the heap, with a scoe F (, w). If F (, w) u i, then this list does not need to be included in the heap, since it cannot have a ecod with a highe scoe. Othewise, this list needs to be included in the heap. Based on this analysis, each time we pop a ecod fom the heap and push a new ecod, we compae the scoe of the new ecod with the uppe bounds of those lists not included in the heap so fa. Fo those lists with an uppe bound geate than this scoe, they need to be included in the heap fom now on. Notice that this checking can be done vey efficiently by stoing the maximal value of these uppe bounds, and odeing these lists based on thei uppe bounds. The puning powe can be even moe significant if the keywod w is the patial keywod w m, since many of its simila keywods shae the same pefix p on the tie simila to w m. We can compute an uppe bound of the ecod scoes fom these lists and stoe the bound on the tie node p. In this way, we can pune the lists moe effectively by compaing the value F (, w) with this uppe bound stoed on the tie, without needing to onthefly compute the bound. List Mateialization: Fo fuzzy seach, the patial keywod w m has multiple simila pefixes and each simila pefix has multiple simila wods. The max heap of w m is built on top of inveted lists of such simila wods. Let d be such a simila wod. Recall that the value F (, w m,d) of a ecod on the list of a simila wod d with espect to w m is based on both W (d, ) andsim(d, w m). Let v be a mateialized node. To use U(v) to eplace the lists of v s leaf nodes in the max heap, the following two conditions need to be satisfied: All the leaf nodes of v have the same similaity to w m. All the leaf nodes of v ae simila to w m, i.e., thei similaity to w m is no less than the theshold τ. When the conditions ae satisfied, the soting ode of the union list U(v) is also the ode of the scoes of the ecods on the leafnode lists with espect to w m. A mateialized node v that satisfies the two conditions must be a descendant of a simila pefix of patial keywod w m. We can pove this by contadiction. Suppose node v is not a descendant of any simila pefix of patial keywod w m. Then node v and its ancestos ae not simila pefixes of w m,thatis the leaf nodes of v ae not simila keywods of w m. This is contadicted with the second condition. Thus a mateialized node v that satisfies the two conditions must be a descendant of a simila pefix of patial keywod w m. Suppose p,p 2,...,p n ae simila pefixes of w m. We check whethe thei mateialized descendants satisfy the two conditions as follows. Conside a mateialized node v which has ancestos among p,p 2,...,p n.ifnodev has no descendants that ae simila pefixes of w m, v must satisfy the two conditions; othewise suppose p j is a descendant of v that is a simila pefix of w m and has the lagest similaity to v among all such descendants. Without loss of geneality, let p i be an ancesto of v and has the lagest similaity with v among all simila pefixes. If Sim(v, p j) Sim((v, p i), v satisfies the two conditions; othewise v will not. Thus we can find usable mateialized nodes to constuct the max heap of w m and use ou poposed techniques in Section to do a costbased analysis to select highquality nodes fo mateialization. 5. EXPERIMENTS We implemented ou poposed techniques and compaed with existing methods on thee eal data sets. () DBLP : It included compute science publication ecods 4. (2) URL 5 : It included million URLs. (3) Enon : It was an collection 6. Table 2 shows details of the data. Table 2: Data sets and index costs. Data Set URL DBLP Enon # of Recods (millions).5 Data size. GB 5 MB.4 GB Avg. # of wods/ecod # of distinct keywods (millions) Tie size 42 MB 3 MB 28 MB Size of inveted lists 379 MB 83 MB 342 MB Fo the DBLP data set, we selected eal queies fom the logs of ou deployed systems and each quey contained 6 keywods 7. Fo the othe two data sets, we geneated queies with keywods andomly selected fom the set of wods used in the collection. We assumed the lettes of a quey wee typed in one by one. Fo each keystoke, we measued the time of computing the topk answes to this quey. Fo exact seach, we measued the total unning time. Fo fuzzy seach, we measued the time in two steps: in step we computed keywods on the tie simila to the quey keywods (using the algoithm descibed in [3]); in step 2 we found the topk answes using the inveted lists of these simila keywods. Unless othewise specified, k =. We compaed ou method with stateoftheat method [3]. We implemented the NRA algoithm descibed in [6] if we only do soted access, and the Theshold Algoithm ( TA ) if we can do both soted access and andom access. All the indexes wee built offline and peloaded and fullesident in memoy duing all queying opeations. All expeiments wee un on a Ubuntu Linux machine with an Intel Coe pocesso (X545 3.GHz and 4 GB RAM). 5. Exact Seach Soted Access Only: We implemented the following methods. () BinayPobe [3]: We consideed the inveted lists of the complete quey keywods, and the union of the inveted lists fo the complete keywods of the patial keywod. We chose the shotest list, and fo each of its ecod IDs, we did binay pobings on othe lists. (2) NRA(Heap): We implemented the NRA algoithm using the heapbased technique. (3) NRA(Heap+Mateialization 8 ): We implemented the NRA algoithm using the heapandmateializationbased techniques. Figue 9 shows the esults on the Enon dataset, which showed that ou method impoved seach efficiency. Fo instance, fo queies with a patial keywod of length 2, NRA(Heap) educed the quey time of BinayPobe fom 28 ms to ms. NRA(Heap+Mateialization) futhe educed the time to 2 ms. This is because ) BinayPobe fist computed all esults and then anked them; 2) BinayPobe onthefly computed the union list of the patial keywod. NRA(Heap) used the max heap to geneate a soted patial list and NRA(Heap+Mateialization) used mateialized lists to save push/pop opeations on the heap. Soted Access + Random Access: We implemented the following methods. () BinayPobe (Fowad List)[3], we chose the shotest list, and fo each of its ecod IDs, we veified whethe the ecod ID contained othe keywods enon/ 7 Details ae omitted due to doubleblind eview. 8 We used additional 5% space with espect to inveted index fo mateialization in the expeiments. 362
9 Quey Time (ms) BinayPobe NRA(Heap) NRA(Heap+Mateialization) Quey Time (ms) BinayPobe NRA(Heap) NRA(Heap+Mateialization) Quey Time (ms) BinayPobe(Fowad List) TA(Fowad List+Heap) TA(Fowad List+Heap+Mateialization) Quey Time (ms) BinayPobe(Fowad List) TA(Fowad List+Heap) TA(Fowad List+Heap+Mateialization) # of ecods (*K) Length of the pefix keywod (a) Vaying Data Size (b) Vaying pefix length Figue 9: Exact seach using soted access (Enon). using the fowad list. (2) TA(Fowad List+Heap): We implemented the TA algoithm using fowad list fo andom access and max heap fo soted access. (3) TA(Fowad List+Heap+Mateialization): We implemented the TA algoithm using fowad list, max heap, and list mateialization. Figue shows the esults on the DBLP dataset. We can see that the andomaccess techniques indeed impoved efficiency. 5.2 Fuzzy Seach Soted Access Only: We fist evaluated the effect of the listpuning technique. Figue shows the expeimental esults (including two steps). We can obseve that list puning indeed impoved seach efficiency. Fo the Enon dataset with.5m ecods, the method with puning can educe the time fom 3 ms to 7 ms. The puning technique was moe effective on the Enon dataset than on the othe two datasets mainly due to two easons. Fist, the Enon dataset had moe tie nodes due to its lage numbe of distinct keywods in the s. Thus a quey keywod can have moe simila pefixes on the tie. Second, the Enon dataset had fewe ecods, and the inveted lists wee elatively shote. Duing the list tavesal, the NRA algoithm visited fewe ecods, and its highe scoe of the top ecod fom the max heap helped us pune moe lists. List Mateialization: We evaluated the impovement on soted access using list mateialization fo fuzzy typeahead seach. We measued the amount of stoage space fo stoing mateialized lists as a pecentage of the total size of the inveted lists on the tie. We vaied this amount, and measued the aveage time of finding the top answes using the NRA algoithm. Figue 2 shows the esults. We can see that list mateialization impoved the seach pefomance. We implemented the diffeent methods fo list mateialization, namely Random, TopDown, BottomUp, and CostBased as discussed in Section Figue 3 shows the esults. Among the thee naive methods, Random gave the best esults. The CostBased algoithm outpefomed all the naive methods. This is because CostBased selected highquality nodes fo mateialization using a costbased analysis. Soted Access + Random Access: We implemented the TA algoithm using the two methods fo andom access and list puning fo soted access (descibed in Section 4). Figue 4 shows the scalability esults on the thee datasets. The two andomaccess methods scaled well. Method 2 (pobing on tie leaf nodes) outpefomed Method (pobing on fowad lists). This is because fo the thee data sets, thee wee many pefixes simila to the patial keywod, and Method needed to conside all simila pefixes fo each ecod on fowad lists. 6. RELATED WORK Thee ae many studies on autocomplete and phase pediction fo use queies [22, 5, 9, 23, 7]. Google instant seach was # of ecods (*K) Length of the pefix keywod (a) Vaying Data Size (b) Vaying pefix length Figue :Exact seach using andom access(dblp). launched to suppot typeahead seach. It fist suggested elevant queies based on use pofiles and quey logs and then answeed the top queies. Chaudhui et al. [5] studied how to find simila stings inteactively as uses type in a quey sting, using an appoach simila to that in [3, 2]. They did not study the case whee a quey has multiple keywods that need listintesection opeations. The seach paadigm studied in this pape is diffeent since we suppot fuzzy, fulltext seach as uses type in queies. Bast et al. poposed techniques to suppot typeahead seach in thei CompleteSeach systems [2, 3, ]. Anothe study [9] is about typeahead seach on elational data gaphs. Ji et al. [3] developed algoithms fo fuzzy typeahead seach. Ou wok extends these studies by developing efficient algoithms to suppot topk seach. Khoussainova et al. [4] poposed to suggest elevant SQL snippets as uses type in SQL queies. Li et al. [8] studied how to use SQLs to suppot typeahead seach in databases. Feng et al. [8] studied fuzzy seach on XML data. Thee have been many studies on suppoting fuzzy seach (e.g., [, 7, 4,, 24, 6]). Howeve these algoithms ae inefficient fo typeahead seach since they have low puning powe fo shot stings (patial keywods). The expeiments in [3, 5] showed that these appoaches ae not as efficient as tiebased methods fo fuzzy typeahead seach. Theobald et al. [25] poposed a heapbased method fo quey expansion. They used WodNet wods and only utilized soted access. conside both soted access and andom access. We 7. CONCLUSION In this pape we studied how to efficiently answe topk queies in typeahead seach. We focused on an index stuctue with a tie of keywods in a data set and inveted lists of ecods on the tie leaf nodes. We studied technical challenges when adopting existing topk algoithms in the liteatue: how to efficiently suppot andom access and soted access on inveted lists? We pesented two algoithms fo suppoting andom access, and poposed optimization techniques using list puning and mateialization to suppot soted access. Ou techniques can be easily extended to suppot lage datasets though data patition. Fo example, we have built a system to seach on 2 million MEDLINE publication ecods using two machines. Acknowledgement. The authos have financial inteest in Bimaple Technology Inc., a company cuently commecializing some of the techniques descibed in this publication. Chen Li is patially suppoted by the NIH gant R2LM43A and the National Natual Science Foundation of China (No. 6292). Guoliang Li, Jianan Wang, and Jianhua Feng wee patly suppoted by the National Natual Science Foundation of China (No. 634), the National Gand Fundamental Reseach 973 Pogam of China (No. 2CB3226), Tsinghua Univesity (No. 2873), and the NExT Reseach Cente funded by MDA, Singapoe (No. WBS:R ). 363
10 Quey Time (ms) Without Puning Puning Computing Simila Keywods Quey Time (ms) Without Puning Puning Computing Simila Keywods Quey Time (ms) Without Puning Puning Computing Simila Keywods # of ecods (*M) # of ecods (*K) # of ecods (*K) (a) URL (b) DBLP (c) Enon Figue : Fuzzy seach using list puning (similaity theshold τ =.6). Quey Time (ms) keywod queies 4keywod queies 3keywod queies 2keywod queies keywod queies % % 2% 3% 4% 5% Additional Space/InvetedIndex Size Quey Time (ms) keywod queies 4keywod queies 3keywod queies 2keywod queies keywod queies % % 2% 3% 4% 5% Additional Space/InvetedIndex Size Quey Time (ms) keywod queies 4keywod queies 3keywod queies 2keywod queies keywod queies 5 % % 2% 3% 4% 5% Additional Space/InvetedIndex Size (a) URL (b) DBLP (c) Enon Figue 2: Fuzzy seach using list mateialization (soted access only, with list puning, theshold τ =.6). Quey Time (ms) 5 5 TopDown BottomUp Random CostBased % % 2% 3% 4% 5% Additional Space/InvetedIndex Size Quey Time (ms) TopDown BottomUp Random CostBased % % 2% 3% 4% 5% Additional Space/InvetedIndex Size Quey Time (ms) TopDown BottomUp Random CostBased % % 2% 3% 4% 5% Additional Space/InvetedIndex Size (a) URL (b) DBLP (c) Enon Figue 3: Compaison of diffeent mateialization methods (similaity theshold τ =.6). Quey Time (ms) SA+RA(Pobing on Fowad Lists) SA+RA(Pobing on Leaf Nodes) SA Computing Simila Keywods # of ecods (*M) Quey Time (ms) SA+RA(Pobing on Fowad Lists) SA+RA(Pobing on Leaf Nodes) SA Computing Simila Keywods # of ecods (*K) Quey Time (ms) SA+RA(Pobing on Fowad Lists) SA+RA(Pobing on Leaf Nodes) SA Computing Simila Keywods # of ecods (*K) (a) URL (b) DBLP (c) Enon Figue 4: Fuzzy seach with soted access ( SA ) and andom access ( RA ) (similaity theshold τ =.6). 8. REFERENCES [] H. Bast, A. Chitea, F. M. Suchanek, and I. Webe. Este: efficient seach on text, entities, and elations. In SIGIR, pages , 27. [2] H. Bast and I. Webe. Type less, find moe: fast autocompletion seach with a succinct index. In SIGIR, pages , 26. [3] H. Bast and I. Webe. The completeseach engine: Inteactive, efficient, and towads i& db integation. In CIDR, pages 88 95, 27. [4] S. Chaudhui, V. Ganti, and R. Kaushik. A pimitive opeato fo similaity joins in data cleaning. In ICDE, pages 5 6, 26. [5] S. Chaudhui and R. Kaushik. Extending autocompletion to toleate eos. In SIGMOD Confeence, pages 77 78, 29. [6] R. Fagin, A. Lotem, and M. Nao. Optimal aggegation algoithms fo middlewae. In PODS, pages 2 3, 2. [7] J. Fan, G. Li, and L. Zhou. Inteactive SQL quey suggestion: Making databases usefiendly. ICDE, pages , 2. [8] J. Feng, and G. Li. Efficient Fuzzy TypeAhead Seach in XML Data. IEEE TKDE, 24(5): , 22. [9] K. Gabski and T. Scheffe. Sentence completion. In SIGIR, pages , 24. [] L. Gavano, P. G. Ipeiotis, H. V. Jagadish, N. Koudas, S. Muthukishnan, and D. Sivastava. Appoximate sting joins in a database (almost) fo fee. In VLDB, pages 49 5, 2. [] M. Hadjieleftheiou, A. Chandel, N. Koudas, and D. Sivastava. Fast indexes and algoithms fo set similaity selection queies. In ICDE, pages , 28. [2] I. F. Ilyas, G. Beskales, and M. A. Soliman. A suvey of topk quey pocessing techniques in elational database systems. ACM Comput. Suv., 4(4), 28. [3] S. Ji, G. Li, C. Li, and J. Feng. Efficient inteactive fuzzy keywod seach. In WWW, pages 37 38, 29. [4] N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Contextawae autocompletion fo sql. PVLDB, 4():22 33, 2. [5] K. Kukich. Techniques fo automatically coecting wods in text. ACM Comput. Suv., 24(4): , 992. [6] H. Lee, R. T. Ng, and K. Shim. Extending qgams to estimate selectivity of sting matching with low edit distance. In VLDB, pages 95 26, 27. [7] C. Li, J. Lu, and Y. Lu. Efficient meging and filteing algoithms fo appoximate sting seaches. In ICDE, pages , 28. [8] G. Li, J. Feng, and C. Li. Suppoting seachasyoutype using sql in databases. IEEE TKDE, 22. [9] G. Li, S. Ji, C. Li, and J. Feng. Efficient typeahead seach on elational data: a tastie appoach. In SIGMOD Confeence, pages , 29. [2] G. Li, S. Ji, C. Li, and J. Feng. Efficient fuzzy fulltext typeahead seach. VLDB J., 2(4):6764, 2. [2] N. Mamoulis, K. H. Cheng, M. L. Yiu, and D. W. Cheung. Efficient aggegation of anked inputs. In ICDE, page 72 83, 26. [22] H. Motoda and K. Yoshida. Machine leaning techniques to make computes easie to use. Atif. Intell., 3(2):295 32, 998. [23] A. Nandi and H. V. Jagadish. Effective phase pediction. In VLDB, pages 29 23, 27. [24] J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similaity quey pocessing with the asymmetic signatue scheme. In SIGMOD Confeence, pages 33 44, 2. [25] M. Theobald, R. Schenkel, and G. Weikum. Efficient and selftuning incemental quey expansion fo topk quey pocessing. In SIGIR, pages ,
On the Algorithmic Implementation of Multiclass Kernelbased Vector Machines
Jounal of Machine Leaning Reseach 2 (2001) 265292 Submitted 03/01; Published 12/01 On the Algoithmic Implementation of Multiclass Kenelbased Vecto Machines Koby Camme Yoam Singe School of Compute Science
More informationAccuracy at the Top. Abstract
Accuacy at the Top Stephen Boyd Stanfod Univesity Packad 64 Stanfod, CA 94305 boyd@stanfod.edu Mehya Mohi Couant Institute and Google 5 Mece Steet New Yok, NY 00 mohi@cims.nyu.edu Coinna Cotes Google Reseach
More informationOn the winnertakeall principle in innovation races
On the winnetakeall pinciple in innovation aces VincenzoDenicolòandLuigiAlbetoFanzoni Univesity of Bologna, Italy Novembe 2007 Abstact What is the optimal allocation of pizes in an innovation ace? Should
More informationCHAPTER 9 THE TWO BODY PROBLEM IN TWO DIMENSIONS
9. Intoduction CHAPTER 9 THE TWO BODY PROBLEM IN TWO DIMENSIONS In this chapte we show how Keple s laws can be deived fom Newton s laws of motion and gavitation, and consevation of angula momentum, and
More informationMaketoorder, Maketostock, or Delay Product Differentiation? A Common Framework for Modeling and Analysis
aetoode, aetostoc, o Dela Poduct Dieentiation? A Common Famewo o odeling and Analsis Diwaa Gupta Saiallah Benjaaa Univesit o innesota Depatment o echanical Engineeing inneapolis, N 55455 Second evision,
More informationTopk Set Similarity Joins
Topk Set Similarity Joins Chuan Xiao Wei Wang Xuemin Lin Haichuan Shang The University of New South Wales & NICTA {chuanx, weiw, lxue, shangh}@cse.unsw.edu.au Abstract Similarity join is a useful primitive
More informationEfficient Processing of Joins on Setvalued Attributes
Efficient Processing of Joins on Setvalued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Objectoriented
More informationApproximately Detecting Duplicates for Streaming Data using Stable Bloom Filters
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta fandeng@cs.ualberta.ca Davood Rafiei University of Alberta drafiei@cs.ualberta.ca ABSTRACT
More informationReal Time Personalized Search on Social Networks
Real Time Personalized Search on Social Networks Yuchen Li #, Zhifeng Bao, Guoliang Li +, KianLee Tan # # National University of Singapore University of Tasmania + Tsinghua University {liyuchen, tankl}@comp.nus.edu.sg
More informationRobust Set Reconciliation
Robust Set Reconciliation Di Chen 1 Christian Konrad 2 Ke Yi 1 Wei Yu 3 Qin Zhang 4 1 Hong Kong University of Science and Technology, Hong Kong, China 2 Reykjavik University, Reykjavik, Iceland 3 Aarhus
More informationCLoud Computing is the long dreamed vision of
1 Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data Cong Wang, Student Member, IEEE, Ning Cao, Student Member, IEEE, Kui Ren, Senior Member, IEEE, Wenjing Lou, Senior Member,
More informationRecovering Semantics of Tables on the Web
Recovering Semantics of Tables on the Web Petros Venetis Alon Halevy Jayant Madhavan Marius Paşca Stanford University Google Inc. Google Inc. Google Inc. venetis@cs.stanford.edu halevy@google.com jayant@google.com
More informationMining Templates from Search Result Records of Search Engines
Mining Templates from Search Result Records of Search Engines Hongkun Zhao, Weiyi Meng State University of New York at Binghamton Binghamton, NY 13902, USA {hkzhao, meng}@cs.binghamton.edu Clement Yu University
More informationSizeConstrained Weighted Set Cover
SizeConstrained Weighted Set Cover Lukasz Golab 1, Flip Korn 2, Feng Li 3, arna Saha 4 and Divesh Srivastava 5 1 University of Waterloo, Canada, lgolab@uwaterloo.ca 2 Google Research, flip@google.com
More informationKeyword Search over Relational Databases: A Metadata Approach
Keyword Search over Relational Databases: A Metadata Approach Sonia Bergamaschi University of Modena and Reggio Emilia, Italy sonia.bergamaschi@unimore.it Raquel Trillo Lado University of Zaragoza, Spain
More informationExtracting k Most Important Groups from Data Efficiently
Extracting k Most Important Groups from Data Efficiently Man Lung Yiu a, Nikos Mamoulis b, Vagelis Hristidis c a Department of Computer Science, Aalborg University, DK9220 Aalborg, Denmark b Department
More informationData integration: A theoretical perspective
Data integation: A theoetical esective Mauizio Lenzeini Diatimento di Infomatica e Sistemistica Antonio Rubeti Univesità di Roma La Saienza Tutoial at PODS 2002 Madison, Wisconsin, USA, June 2002 Data
More informationDiscovering Queries based on Example Tuples
Discovering Queries based on Example Tuples Yanyan Shen Kaushik Chakrabarti Surajit Chaudhuri Bolin Ding Lev Novik National University of Singapore Microsoft Research Microsoft shenyanyan@comp.nus.edu.sg
More informationAn efficient reconciliation algorithm for social networks
An efficient reconciliation algorithm for social networks Nitish Korula Google Inc. 76 Ninth Ave, 4th Floor New York, NY nitish@google.com Silvio Lattanzi Google Inc. 76 Ninth Ave, 4th Floor New York,
More informationEfficient Query Evaluation using a TwoLevel Retrieval Process
Efficient Query Evaluation using a TwoLevel Retrieval Process Andrei Z. Broder,DavidCarmel, Michael Herscovici,AyaSoffer, Jason Zien ( ) IBM Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532
More informationBenchmarking Cloud Serving Systems with YCSB
Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears Yahoo! Research Santa Clara, CA, USA {cooperb,silberst,etam,ramakris,sears}@yahooinc.com
More informationSECTION 1: DEVELOPMENT PROCESSES. 1.1 Performance Measurement Process
SECTION 1: DEVELOPMENT POCESSES 1.1 PEFOMANCE MEASUEMENT POCESS 1.1 Performance Measurement Process Introduction Performance measures are recognized as an important element of all Total Quality Management
More informationLess is More: Selecting Sources Wisely for Integration
Less is More: Selecting Sources Wisely for Integration Xin Luna Dong AT&T LabsResearch lunadong@research.att.com Barna Saha AT&T LabsResearch barna@research.att.com Divesh Srivastava AT&T LabsResearch
More informationApproximate Frequency Counts over Data Streams
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku Stanford University manku@cs.stanford.edu Rajeev Motwani Stanford University rajeev@cs.stanford.edu Abstract We present algorithms for
More informationDiscovering All Most Specific Sentences
Discovering All Most Specific Sentences DIMITRIOS GUNOPULOS Computer Science and Engineering Department, University of California, Riverside RONI KHARDON EECS Department, Tufts University, Medford, MA
More informationContinuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream
Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Xuemin Lin University of New South Wales Sydney, Australia Jian Xu University of New South Wales Sydney, Australia
More informationNo One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics Herodotos Herodotou Duke University hero@cs.duke.edu Fei Dong Duke University dongfei@cs.duke.edu Shivnath Babu Duke
More informationLIRS: An Efficient Low Interreference Recency Set Replacement Policy to Improve Buffer Cache Performance
: An Efficient Low Interreference ecency Set eplacement Policy to Improve Buffer Cache Performance Song Jiang Department of Computer Science College of William and Mary Williamsburg, VA 231878795 sjiang@cs.wm.edu
More informationQuery Processing over Incomplete Autonomous Databases
Processing over Incomplete Autonomous Databases Garrett Wolf Hemal Khatri Bhaumik Chokshi Jianchun Fan Yi Chen Subbarao Kambhampati Department of Computer Science and Engineering Arizona State University
More informationReDRIVE: ResultDriven Database Exploration through Recommendations
ReDRIVE: ResultDriven Database Exploration through Recommendations Marina Drosou Computer Science Department University of Ioannina, Greece mdrosou@cs.uoi.gr Evaggelia Pitoura Computer Science Department
More information