Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Transcription

1 Type Less, Fid More: Fast Autocompletio Search with a Succict Idex Holger Bast Max-Plack-Istitut für Iformatik Saarbrücke, Germay bast@mpi-if.mpg.de Igmar Weber Max-Plack-Istitut für Iformatik Saarbrücke, Germay iweber@mpi-if.mpg.de ABSTRACT We cosider the followig full-text search autocompletio feature. Imagie a user of a search egie typig a query. The with every letter beig typed, we would like a istat display of completios of the last query word which would lead to good hits. At the same time, the best hits for ay of these completios should be displayed. Kow idexig data structures that apply to this problem either icur large processig times for a substatial class of queries, or they use a lot of space. We preset a ew idexig data structure that uses o more space tha a state-of-the-art compressed iverted idex, but that yields a order of magitude faster query processig times. Eve o the large TREC Terabyte collectio, which comprises over 25 millio documets, we achieve, o a sigle machie ad with the idex o disk, average respose times of oe teth of a secod. We have built a full-fledged, iteractive search egie that realizes the proposed autocompletio feature combied with support for proximity search, semi-structured (XML) text, subword ad phrase completio, ad sematic tags. Categories ad Subject Descriptors H.3.1 [Cotet Aalysis ad Idexig]: Idexig Methods; H.3.3 [Cotet Aalysis ad Idexig]: Retrieval Models; H.5.2 [User Iterfaces]: Theory ad Methods Geeral Terms Algorithms, Desig, Experimetatio, Huma Factors, Performace, Theory Keywords Autocompletio, Empirical Etropy, Idex Data Structure 1. INTRODUCTION Autocompletio is a widely used mechaism to get to a desired piece of iformatio quickly ad with as little kowledge ad effort as possible. Oe of its early uses was i the Uix Shell, where pressig the tabulator key gives a list of all file ames that start with whatever has bee typed o Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. SIGIR 06, August 6 11, 2006, Seattle, Washigto, USA. Copyright 2006 ACM /06/ $5.00. the commad lie after the last space. Nowadays, we fid a similar feature i most text editors, ad i a large variety of browsig GUIs, for example, i file browsers, i the Microsoft Help suite, or whe eterig data ito a web form. Recetly, autocompletio has bee itegrated ito a umber of (web ad desktop) search egies like Google Suggest or Apple s Spotlight. We discuss more applicatios i Sectio 1.2. I the simpler forms of autocompletio, the list of completios is simply a rage from a (typically precomputed) list of words. For the Uix Shell, this is the list of all file ames i all directories listed i the PATH variable. For the text editors, this is the list of all words etered ito the file so far (ad maybe also words from related files). I Google Suggest, completios appear to come from a precompiled list of popular queries. For these kids of applicatios we ca easily achieve fast respose times by two biary or B-tree searches i the (pre)sorted list of cadidate strigs. More advaced forms of autocompletio take ito accout the cotext i which the to-be-completed word has bee typed. The problem we propose ad discuss i this paper is of this kid. The formal problem defiitio will be give i Sectio 2. More iformally, imagie a user of a search egie typig a query. The with every letter beig typed, we would like a istat display of completios of the last query word which would lead to good hits. At the same time, the best hits for ay of these completios should be displayed. All this should preferably happe i less time tha it takes to type a sigle letter. For example, assume a user has typed coferece sig. Promisig completios might the be sigir, sigmod, etc., but ot, for example, sigature, assumig that, although sigature by itself is a pretty frequet word, the query coferece sigature leads to oly few good hits. See Figure 1 for a screeshot of our search egie respodig to that query. For a live demo, see Our results We have developed a ew idexig data structure, amed HYB, which uses o more space tha a state-of-the-art compressed iverted idex, ad which ca respod to autocompletio queries as described above withi a small fractio of a secod, eve for collectio sizes i the Terabyte rage. Our mai competitor i this paper is the iverted idex, referred to as INV i the followig. Other data structures that could be directly applied to our problem either use a lot of space or have other limitatios; we discuss these i Sectio 1.2. We give a rigorous mathematical aalysis of HYB ad INV with respect to both space usage ad query processig times. Our aalysis accurately predicts the real behavior o our test collectios. Cocerig space usage, we defie a otio of empirical

2 Figure 1: A screeshot of our search egie for the query coferece sig searchig the Eglish Wikipedia. The list of completios ad hits is updated automatically ad istatly after each keystroke, hece the absece of ay kid of search butto. The umber i paretheses after each completio is the umber of hits that would be obtaied if that completio where typed. Query words eed ot be completed, however, because the search egie does a implicit prefix search: if, for example, the user cotiued typig coferece sig proc, completios ad hits for proc, e.g., proceedigs, would be from the 185 hits for coferece sig. etropy [11] [22], which captures the iheret space complexity of a idex idepedet of a particular compressio scheme. We prove that the empirical etropy of HYB is essetially equal to that of INV, ad we fid that the actual space usage of our implemetatio of the two idex structures is ideed almost equal, for each of our three test collectios. Cocerig processig times, we give a precise quatificatio of the umber of operatios eeded, from which we derive bouds for the worst, best, ad average-case behavior of INV ad HYB. We also take ito accout the differet latecies of sequetial ad radom access to data [1]. We compare INV ad HYB o three test collectios with differet characteristics. Oe of our collectios has bee (semi-)publicly searchable over the last year, so that we have autocompletio queries from real users for it. Our largest collectio is the TREC Terabyte bechmark with over 25 millio documets [7]. O all three collectios ad o all the queries we cosidered, HYB outperforms INV by a factor of i worst-case query processig time, ad by a factor of 3 10 i average case query processig time. I absolute terms, HYB achieves average query processig of oe teth of a secod or less o all collectios, o a sigle machie ad with the idex o disk (ad ot i mai memory). We have built a full-fledged search egie that supports autocompletio queries of the described kid combied with support for proximity/phrase search, XML tags, subword ad phrase completio, ad category iformatio. All of these extesios are described i Sectio Related work The autocompletio feature as described so far is remiiscet of stemmig, i the sese that by stemmig, too, prefixes istead of full words are cosidered [23]. But ulike stemmig, our autocompletio feature gives the user feedback o which completios of the prefix typed so far would lead to highly raked documets. The user ca the assess the relevace of these completios to his or her search desire, ad decide to (i) type more letters for the last query word, e.g., i the query from Figure 1, type i ad r so that the query is the coferece sigir, or to (ii) start with the ext query word, e.g., type a space ad the proc, or to (iii) stop searchig as, e.g., the user was actually lookig for oe of the hits show i Figure 1. There is o way to achieve this by a stemmig preprocessig step, because there is o way to foresee the user s itet. This kid of user iteractio is well kow to improve retrieval effectiveess i a variety of situatios [21]. While our autocompletio feature is for the purpose of fidig iformatio, autocompletio has also bee employed for the purpose of predictig user iput, for example, for typig messages with a mobile phoe, for users with disabilities cocerig typig, or for the compositio of stadard letters [6] [14] [20] [8] [15]. I [12], cotextual iformatio has bee used to select promisig extesios for a query. Payter et al. have devised a iterface with a zoomig-i property o the word level, ad based o the idetificatio of frequet phrases [18]. We get a related feature by the subword/phrase-completio mechaism described i Sectio 4.4. Our autocompletio problem is related to but distictly differet from multi-dimesioal rage searchig problems, where the collectio cosists of tuples (of some fixed dimesio, for example, pairs of word prefixes), ad queries are askig for all tuples that match a give tuple of rages [10] [2] [4] [13]. These data structures could be used for our autocompletio problem, provided that we were willig to limit the umber of query words. For fast processig times, however, the space cosumptio of ay of these structures is o the order of N 1+d, where N is the size of a iverted idex, ad d > 0 grows (fast) with the dimesio. For our au-

3 tocompletio queries, we ca achieve fast query processig times ad space efficiecy at the same time because we have the set of documets matchig the part of the query before the last word already computed (amely whe this part was beig typed). I a sese, our autocompletio problem is therefore a 1 1/2 - dimesioal rage searchig problem. Fially, there is a large variety of alteratives to the iverted idex i the literature. We have cosidered those we are aware of with regard to their applicability to our autocompletio problem, but foud them either usuitable or iferior to the iverted idex i that respect. For example, approaches that cosider documet by documet are boud to be slow due to a poor locality of access; i cotrast, both INV ad HYB are mostly scaig log lists; see Sectio 3. Sigature files were foud to be i o way superior (but sigificatly more complicated) to the iverted idex i all major respects i [24]. Suffix arrays ad related data structures address the issue of full substrig search, which is ot what we wat here (but see Sectio 4.4); a direct applicatio of a data structure like [11] would have the same efficiecy problems as INV, whereas multi-dimesioal variats like [10] require super-liear space, as explaied above. 2. FORMAL PROBLEM DEFINITION AND DEFINITION OF EMPIRICAL ENTROPY The followig defiitio of our autocompletio problem takes either positioal iformatio, or rakig of the completios or of the documets ito accout. We will first, i Sectio 3, aalyze our data structures for this basic settig. I Sectio 4, we the show how to geeralize the data structures ad their aalysis to cope with positioal iformatio, rakig, ad a umber of other useful ehacemets. This geeralizatio will be straightforward. Defiitio 1. A autocompletio query is a pair (D, W ), where W is a rage of words (all possible completios of the last word which the user has started typig) ad D is a set of documets (the hits for the precedig part of the query). To process the query meas to compute the subset W W of words that occur i at least oe documet from D, as well as the subset D D of documets that cotai at least oe of these words. For our example coferece sig, D is the set of all documets cotaiig a word startig with coferece (computed whe the last letter of this word was typed), ad W is the rage of all words from the collectio startig with sig. For queries with oly a sigle word, e.g., cofer, D is simply the set of all documets. To aalyze the iheret space complexity of INV ad HYB idepedetly of the specialties of a particular compressio scheme, we itroduce a otio of empirical etropy. Both INV ad HYB are essetially a collectio of (multi)sets ad sequeces. The followig defiitio gives a atural otio of etropy for each such buildig block, ad for arbitrary combiatios of them (similar defiitios have bee made i [11] [22]). The reader might first wat to skip the followig defiitio ad come back to it whe it is first used i the aalysis that follows. Defiitio 2. We defie empirical etropy for the followig etities, where H(p 1,..., p l ) = l (pi log 2 pi) is the l-ary etropy fuctio. (a) For a subset of size with elemets from a uiverse of size, the empirical etropy is H( /, 1 /) (iclude each elemet of the uiverse ito the subset with probability /), which is log 2 + ( ) log 2. (b) For a multisubset of size with elemets from a uiverse of size, the empirical etropy is ( + ) H( /( + ), /( + )) (cosider a bitvector of size +, ad let a bit be 0 with probability /( + ) ad 1 otherwise; the prefix sums at the 0-bits give the multisubset), which is + + log 2 + log 2. (c) For a sequece of elemets from a uiverse of size l, where the ith elemet occurs i times ( l = ), the empirical etropy is H( 1/,..., l /) (for each positio, pick elemet i with probability i/), which is 1 log l log 2. 1 l (d) For a collectio of l etities with empirical etropies H 1,..., H l, the empirical etropy is simply H H l. 3. INV, HYB, AND THEIR ANALYSIS I this sectio we will describe INV ad HYB, ad aalyze them with respect to their empirical etropy ad their processig time for autocompletio queries accordig to Defiitio 1. Query processig times will be quatified i terms of all relevat parameters; from this we ca easily derive worstcase, best-case, ad average-case bouds. Our average-case bouds make simplifyig assumptios o the distributio of words i the documets, but evertheless tur out to predict the actual behavior quite well. Implemetatio issues ad the actual performace of our implemetatios of INV ad HYB will be discussed i Sectio 5. We briefly commet o idex costructio times i Sectio The iverted idex (INV) The iverted idex is the data structure of choice for most search applicatios: it is relatively easy to implemet ad exted by other features, it ca be compressed well, it is very efficiet for short queries, ad it has a excellet locality of access [23]. I this paper, by INV we mea the followig data structure: for each word store the list of all (ids of) documets cotaiig that word, sorted i ascedig order. We do ot cosider ehacemets such as skip poiters [17], which we would expect to give similar beefits for both INV ad HYB, however at the price of a icreased space usage. I the followig, we first estimate the iheret space efficiecy (empirical etropy) of INV. We the aalyze the time complexity of processig autocompletio queries with INV, ad poit out two iheret problems. Lemma 1. Cosider a istace of INV with documets ad m words, ad where the ith words occurs i i distict documets (so that m is the total umber of word-i-documet pairs). Let H iv be the empirical etropy accordig to Defiitio 2. The m ( ) 1 H iv i l 2 + i log 2, ad for all collectios cosidered i this paper (where most i are much smaller tha ) this boud is tight up to 2%. Proof. Accordig to Defiitio 2 (a) ad (d), we have H iv = m ( i log 2 i i + ( i) log 2 ). i To prove the lemma, it suffices to observe that because 1 + x e x for ay real x, ( i) log 2 = ( ) i l 1 + i i i l 2 i l 2.

4 Lemma 1 tells us that if the documets i each list were picked uiformly at radom, the a Golomb-ecodig of the gaps [23] from oe documet id to the ext (for list i, the expected size of a gap would be / i) would achieve a space usage very close to H iv bits. I our implemetatio, we opted to ecode gaps with the Simple-9 ecodig from [3], which is easy to implemet, yet achieves very fast decompressio speeds at the price of oly a moderate loss i compressio efficacy; details are reported i Sectio 5. Lemma 2. With INV, a autocompletio query (D, W ) ca be processed i the followig time, where D w deotes the iverted list for word w: D W + w W D w + w W D D w log W. Assumig that the elemets of W, D, ad the D w are picked uiformly at radom from the set of m words ad the set of documets, respectively, this boud has a expected value of D W + W m N + D W N log W. m Remark. By pickig the elemets of a set S at radom from a set U, we mea that each subset of U of size S is equally likely for S. We are ot makig ay radomess assumptio o the sizes of W, D, ad D w above. Proof sketch. The obvious way to use a iverted idex to process a autocompletio query (D, W ) is to compute, for each w W, the itersectios D D w. The, W is simply the set of all w for which the itersectio was o-empty, ad D is the uio of all (o-empty) itersectios. The itersectios ca be computed i time liear i the total iput volume w W ( D + Dw ).1 The uio ca be computed by a W -way merge, which requires o the order of log W time per elemet scaed. With the radomess assumptios, the expected size of D w is N/m, ad the expected size of D D w is D / N/m. Lemma 2 highlights two problems of INV. The first is that the term D W ca become prohibitively large: i the worst case, whe D is o the order of (i.e., the first part of the query is ot very discrimiative) ad W is o the order of m (i.e., oly few letters of the last query word have bee typed), the boud is o the order of m, that is, quadratic i the collectio size. The secod problem is due to the required mergig. While the volume w W D Dw will typically be small oce the first query word has bee completed, it will be large for the first query word, especially whe oly few letters have bee typed. As we will see i Sectio 5, INV frequetly takes secods for some queries, which is quite udesirable i a iteractive settig, ad is exactly what motivated us to develop a more efficiet idex data structure. 3.2 Our ew data structure (HYB) The basic idea behid HYB is simple: precompute iverted lists for uios of words. Assume a autocompletio query (D, W ), where the uio of all lists for word rage W have bee precomputed. We would the get D with a sigle itersectio (of D with the precomputed list). However, from this precomputed list aloe we ca o loger ifer the set W of completios leadig to a hit. Sice W ca be a arbitrary word rage, it is also ot clear which uios should 1 There are asymptotically faster algorithms for the itersectio of two lists [5], but i our experimets, we got the best results with the simple liear-time itersect, which we attribute to its compact code ad perfect locality of access. be precomputed, especially whe we do ot wat to use more space tha a (optimally compressed) iverted idex. The aalysis give i this sectio suggests the followig approach: group the words i blocks so that the legths of the iverted lists i each block sum to (approximately) c, for some costat c < 1 (we will later choose c 0.2). For each block, store the uio of the covered iverted lists as a compressed multiset, usig a effective gap ecodig scheme just as doe for INV (repetitios of the same elemet i the multiset correspod to a gap of zero). I parallel to each multiset, for each elemet x store the id of the word that led to the iclusio of (this occurrece of) x i the multiset. This gives a sequece of word ids, the legth of which is exactly the size of the multiset. Ecode these word ids with code legth (approximately) log 2 (( l )/ i) for the ith word, where i is the umber of documets cotaiig the ith word, ad l is the umber of words i the respective block. Here is a example. Let oe of the blocks comprise four words A, B, C, ad D, with iverted lists A : 3, 5, 6, 8, 9, 11, 12, 15 B : 5, 11 C : 3, 7, 11, 13 D : 3, 8 We would the like to store, i compressed form, the multiset (of documet ids) ad the sequece (of word ids) A C D A B A C A D A A B C A C A The optimal ecodig of the words A, B, C, D would use code legths log 2 (16/8) = 1, log 2 (16/2) = 3, log 2 (16/4) = 2, log 2 (16/2) = 3, respectively, for example A = 0, B = 110, C = 10, D = 111. A optimal ecodig of the four gaps 0, 1, 2, 3 that occur i the above multiset of documet ids would be 0, 10, 110, 111, respectively. What we actually store are the the two bit vectors (where the are solely for better readability; the codes i this example are prefix-free) Note that due to the two differet ecodigs the two lists ed up havig differet legths i compressed form, ad this is also what will happe i reality. The followig aalysis will make very clear that (i) oe should choose blocks of equal list volume (ad ot, for example, of equal umber of words), (ii) this volume should be a small but substatial fractio of the umber of documets (ad either smaller or larger), ad (iii) the lists of documet ids should be gap-ecoded while the lists of word ids should be etropy-ecoded. As for the space usage, we will first derive a very tight estimate of the etropy of HYB, ad the show that, somewhat surprisigly, if we oly choose the block volume to be a small eough fractio of the umber of documets, the etropy of HYB is almost exactly that of INV. We will the show how HYB, whe the blocks are chose of sufficietly large volume, ca be used to process autocompletio queries i time liear i the umber of documets, for ay reasoable word rage. Sice HYB essetially scas log lists, without the eed for ay mergig, except whe the word rage is huge, it also has a excellet locality of access. Lemma 3. Cosider a istace of HYB with words ad m documets, where the ith word occurs i i documets, ad where for each block the sum of the i with i from that block is c, for some c > 0. The the empirical

5 etropy H hyb, defied accordig to Defiitio 2, satisfies m ( H hyb i 1 + c/2 ) + i log l 2 2, i ad the boud is tight as c 0. Proof. Cosider a fixed block of HYB, ad let i deote the umber of documets cotaiig the ith word belogig to that block. Throughout this proof, let i i deote the sum over all these i (so that the sum over all i i from all blocks gives the m i from the lemma). Accordig to Defiitio 2 (b), (c), ad (d), the empirical etropy of this block is the i i log + i i + 2 i + log i 2 + i i i i log i i 2. i Now addig the first ad the last term, the argumets of the logarithms partially cacel out (!), ad we get i i log + i i + i 2 + log i 2. i Now usig that, by assumptio, i i = c, we obtai ( ) i i (1 + 1/c) log 2 (1 + c) + log 2. i Sice (1 + 1/c) l(1 + c) 1 + c/2 for all c > 0 (ot obvious, but true), we ca upper boud this (tightly, as c 0) by ( ) 1 + c/2 i i + log l 2 2. i This bouds the empirical etropy of a sigle block of HYB (the sum goes over all words from that block). Addig this over all blocks gives us the boud claimed i the lemma. Comparig Lemma 3 with Lemma 1, we see that if we let the blocks of HYB be of volume at most c, for some small fractio c, the the empirical etropy of HYB is essetially that of a iverted idex. I Sectio 4.2, we will see that whe we take positioal iformatio ito accout, the empirical etropy of HYB actually becomes less tha that of INV, for ay choice of block volumes. I our implemetatio of HYB, we compress the lists of documet ids by a Simple-9 ecodig of the gaps, just as described for INV above. For the lists of word ids, etropyoptimal compressio could be achieved by arithmetic ecodig [23], but for efficiecy reasos, we compress word ids as follows: assumig that the word frequecies i a block have a Zipf-like distributio, it is ot hard to see that a uiversal ecodig with log x bits for umber x [17] of the raks of the words, if sorted i order of descedig frequecy, is etropy-optimal, too. We agai opted for Simple-9 ecodig of these raks, which gives us a reasoable compressio ad very fast decompressio speed, without the eed for ay large codebook. We take block sizes as /5, but also take word/prefix boudaries ito accout such that frequet prefixes like pro, com, the get a block o their ow. This is to avoid that a query uecessarily spas more tha oe block. Lemma 4. Usig HYB with blocks of volume N, autocompletio queries (D, W ) ca be processed i the followig time, where D w is the iverted list for word w D w (1+ D /N )+ ( ) D D w log D w /N. w W w W w W For N = Θ() ad W m /N, ad assumig that the elemets of D, D w, ad W are picked uiformly at radom from the set of all documets or all m words, respectively, the expected processig time is bouded by O(). Proof sketch. Accordig to Defiitio 1, we have to compute, give (D, W ), the set W of words from W cotaied i documets from D, as well as the set D of documets cotaiig at least oe such word. For each block B, a straightforward itersectio of the give D with the list of documet-word pairs from B, gives us the set W B of all words from W from block B, as well as the set D B of all documet from D which cotai a word from B. From these, D ca be computed by a k-way merge, where k is the umber of blocks that cotai a word from W, ad W ca be computed by a simple liear-time sort ito W buckets (because W is a rage). The umber k of blocks is w Dw /N, which is O(1) i expectatio, give the radomess assumptios stated i the lemma. 3.3 Idex costructio time While gettig from a collectio of documets (files) to INV is essetially a matter of oe big exteral sort [23], HYB does ot require a full iversio of the data. For our experimets, however, we built the compressed idices for both INV ad HYB from a itermediate fully iverted text versio of the collectio, which takes essetially the same time for both. 4. EXTENSIONS I this sectio, we describe a umber of extesios of the basic autocompletio facility we have described ad aalyzed so far. The first (rakig) is essetial for practical usability, the secod (proximity search) greatly wides the spectrum of search tasks for which autocompletio ca be useful, ad the others (support for XML tags, subword ad phrase completio, ad sematic tags) give advaced search facilities to the expert searcher. 4.1 Rakig So far, we have cosidered the followig problem (from Defiitio 1): while the user is typig a query, compute after each keystroke the list of all completios of the last query word that lead to at least oe hit, as well as the list of all hits that would be obtaied by ay of these completios. I practice, oly a selectio of items from these lists ca ad will be preseted to the user, ad it is of course crucial that the most relevat completios ad hits are selected. A stadard approach for this task i ad-hoc retrieval is to have a precomputed score for each word-i-documet pair, ad whe a query is beig processed, to aggregate these scores for each cadidate documet, ad retur documets with the highest such aggregated scores [23]. Both INV ad HYB ca be easily adapted to implemet ay such scorig ad aggregatio scheme: store by each word-i-documet pair its precomputed score, ad whe itersectig, aggregate the scores. A decisio has to be made o how to recocile scores from differet completios withi the same documet. We suggest the followig: whe mergig the itersectios (which gives the set D accordig to Defiitio 1), compute for each documet i D the maximal score achieved for some completio i W cotaied i that documet, ad compute for each completio i W the maximal score achieved for a hit from D achieved for this completio. Asymptotically, the iclusio of rakig does ot affect the time bouds derived i Lemmas 2 ad 4, ad our experimets show that rakig ever takes more tha half of the total query processig time; see Sectio 5.4. The icrease i space usage depeds o the selected scorig scheme, ad is the same for INV ad HYB. It is for these reasos, that we factored out the rakig aspect from our basic Defiitio 1

6 ad from our space ad time complexity aalysis i Sectio Proximity/Phrase searches With a properly chose scorig fuctio, such as BM25, mere rakig by score aggregatio ofte gives very satisfactory precisio/recall behavior [19]. There are may queries, however, where the decisive cue o whether a particular documet is relevat or ot lies i the fact whether certai of the query words occur close to each other i that documet. See [16] for a recet positive result o the use of proximity iformatio i ad-hoc retrieval. Our autocompletio feature icreases the beefits of a proximity operator, because the use of this operator will strogly arrow dow the list of completios displayed to the user, which i tur makes it easier for the user to filter out irrelevat completios. For example, whe searchig the Wikipedia collectio the most relevat completio for the o-proximity query max pl would be place (because max ad place are both frequet words), but for the proximity query max..pl it is plack. Here the two dots.. idicate that words should occur withi x words of each other, for some user-defiable parameter x. It is ot hard to exted both INV ad HYB to support proximity search: i the documet lists (INV ad HYB) as well as i the word lists (HYB oly), we duplicate each etry as may times as it occurs i the correspodig documet, ad store the positios i a parallel array of the same size. Word ad documet lists are compressed just as before, ad the lists of positios are gap-ecoded by Simple-9, just like the lists of documet ids. The itersectio routie is adapted to cosider a proximity widow as a additioal parameter. As we will see i Sectio 5.3, the positio lists icrease the idex size by a factor of 4-5, for both INV ad HYB (without ay kid of stopword removal). We ca exted our aalysis from Sectio 3 to predict this factor as follows. If we replace i, the umber of documets cotaiig the ith word, by N i, the total umber of occurreces of the ith words, ad, the umber of documets, by N, the total umber of word occurreces, we ca show that (details omitted) H hyb H iv m (N i/ l 2 + N i log 2 (N/N i)), where H iv ad H hyb deote the empirical etropy of INV ad HYB, respectively, with positioal iformatio. That is, with positioal iformatio, HYB is always more space-efficiet tha INV, irrespectively of how we divide ito blocks. It ca be show that N i log 2 (N/N i) 2N i log 2 (/ i), ad sice o average a word occurs about 2-3 times i a documet, this is just 4-5 times i log 2 (/ i), which was the correspodig term i the etropy boud for INV or HYB without positioal iformatio (Lemmas 1 ad 3). 4.3 Semistructured (XML) text May documets cotai sematic iformatio i the form of tag pairs. We briefly sketch how we ca make good use of such tags i our autocompletio sceario. Assume that i the archive of a mailig list, the subject of a mail is eclosed i a <subject>... <\subject> tag pair. We ca the easily implemet a operator = (i a way very similar to our implemetatio of the proximity operator), such that for a query subject=sig oly those completios of sig are displayed which actually occur i the subject lie of a mail, ad oly such documets are displayed as hits. 4.4 Completio to subwords ad phrases Aother simple yet ofte useful extesio to the basic autocompletio feature is to cosider as potetial matches ot oly the words as they occur i the collectio, but also meaigful subwords ad phrases. A example ivolvig a subword: for the query ormal..vec we might wat to see eigevector as oe of the relevat completios. A example ivolvig a phrase: for the query max plack we might wat to see the phrase max plack istitute as oe of the relevat completios. It is ot hard to see, that the autocompletio accordig to Defiitio 1 will automatically provide this feature if oly we add the correspodig subwords/phrases to the idex. 4.5 Category iformatio Our autocompletio feature ca be combied with a umber of other techologies that ehace the sematics of a corpus. To give just oe more example here, assume we have tagged all coferece ames i a collectio. The assume we duplicate all coferece ames i the idex, with a added prefix of, say, cof:, e.g., cof:sigir. By the way our autocompletio works, for the query seattle cof: we would the get a list of all ames of cofereces that occur i documets that also metio Seattle. 5. EXPERIMENTS We implemeted both INV ad HYB i compressed format, as described i Sectios 3.1 ad 3.2. Each idex is stored i a sigle file with the idividual lists cocateated ad a array of list offsets at the ed. The vocabulary (which is the same for INV as for HYB) is stored i a separate file. All our code is i C++. All our experimets were ru o a Dual Optero machie, with 2 Itel Xeo 3 GHz processors, 8 GB of mai memory, ad ruig Liux. We esured that the idex was ot cached i mai memory. 5.1 Test collectios We compared the performace of INV ad HYB o three collectios of differet characteristics. The first collectio is a mailig-list archive plus several ecyclopedias o homeopathic medicie ( This collectio has bee searchable via our egie over the past year by a audiece of several hudred people. The secod collectio cosists of the complete dumps of the Eglish ad Germa Wikipedia from December 2005 (search.mpi-if.mpg.de/ wikipedia). The third collectio is the large TREC Terabyte collectio [7], which served as a stress test for our idex structures (ad for the authors as well). Details about all three collectio are give i Table 1, where the Raw size of a collectio is the total size of the origial, ucompressed files i their origial formats (e.g., HTML or PDF). 5.2 Queries For the Homeopathy collectio, we picked 5,732 maximal queries (that is, queries, which are ot a true prefix of aother query) from a fixed time slice of our query log for that collectio. From each of these maximal queries, a sequece of autocompletio queries was geerated by typig the query from left to right, with a miimal prefix legth of 3. Like that, for example, the maximal query acidum phos gives rise to the 6 autocompletio queries aci, acid, acidu, acidum, acidum pho, ad acidum phos. For the Wikipedia collectio, autocompletio queries were geerated i the same maer from a set of 100 radomly geerated queries, with a distributio of the umber of query words ad of the term frequecy similar to that of the real queries for the Homeopathy collectio. For the Terabyte collectio, autocompletio queries were geerated, i agai the same way but with a miimal prefix legth of 4, from the (stemmed) 50 ad-hoc queries of the Robust Track Bechmark [7], e.g., squirrel cotrol protect. For all three collectios, we removed queries cotaiig words that had o completio at all i the respective collectio.

7 For Homeopathy ad Wikipedia, all queries were ru as proximity queries (usig a full positioal idex accordig to Sectio 4.2), while for Terabyte, they were executed as ordiary documet-level queries. For all collectios, both completios ad hits were raked as we described it i Sectio 4.1 (details of the aggregatio fuctio are omitted here). Each autocompletio query was processed accordig to Defiitio 1, e.g., for acidum pho, we compute all completios of pho that occur i a documet which also cotais a word startig with acidum, as well as the set of all such documets. The result for each autocompletio query is remembered i a history, so that we do ot eed to recompute the set of documets matchig the first part of the query. E.g., whe processig acidum pho, we ca take the set of documets matchig acidum from the history; see the explaatio followig Defiitio 1. The autocompletio queries with miimal prefix legths, like aci ad acidum pho for Homeopathy, are the most difficult oes. All other queries ca be easily processed by what we call filterig. For example, both the completios ad hits for the query acidum phos ca be obtaied by retrievig the list of matchig word-i-documet pairs for the previously processed query acidum pho from the history, ad by filterig out, i a liear sca over that list, all those pairs, where the word starts with phos. I practice, this is always faster tha processig such queries as full autocompletio queries accordig to Defiitio 1. Note that this filterig is idetical for INV ad HYB. We evertheless iclude the filtered queries i our experimets, because i reality we will always get a mix of both kids of queries. Table 3 will provide figures for just the difficult (ufiltered) queries. We remark that the history is useful also for cachig purposes, but i our experimets we used it solely for the purpose of filterig. 5.3 Idex space Table 1 shows that INV ad HYB use essetially the same space o all three test collectios, ad that HYB is slighter more compact tha INV for a full positioal idex. This is exactly what Lemmas 1 ad 3, ad the derivatio i Sectio 4.2 predicted! The sizes for both INV ad HYB exceed that predicted by the empirical etropy by about 50%. This is due to our use of the Simple-9 compressio scheme, which trades very fast decompressio time for about this icrease i space usage [3]. A combiatio of Golomb ad arithmetic ecodig would give us a space usage closer to the empirical etropy. However, decompressio would the become the computatioal bottleeck for almost all queries, see Table 3. We remark that, by the way we did our aalysis, ay ew compressio scheme with improved compressio ratio/decompressio speed profile, would immediately yield a correspodig improvemet for both INV ad HYB. 5.4 Query processig time Table 2 shows that i terms of query processig time, HYB outperforms INV by a large margi o all collectios. With respect to maximum processig time, which is especially critical for a iteractive applicatio, the improvemet is by a factor of With respect to average processig time, which is critical for throughput i a high-load sceario, the improvemet is by a factor of Table 3 gives iterestig isights ito where exactly INV loses agaist HYB. The table shows a breakdow of the ruig times of those queries for the Terabyte collectio, which were ot aswered by filterig as discussed above. (Note that the breakdow of the filtered queries would be idetical for both methods.) The table differetiates betwee 1-word queries like squi, squir, etc. ad multi-word queries like squirrel cotr or squirrel cotrol prot. For the 1-word queries, o itersectios have to be computed for either INV or HYB. Accordig to Lemma 2, the Collectio Homeopathy Wikipedia Terabyte Raw size 452 MB 7.4 GB 426 GB #documets 44,015 2,866,503 25,204,103 #words 263,817 6,700,119 25,263,176 #items 12 [27] millio 0.3 [0.8] billio 3.5 billio Vocabulary 2.9 MB 73 MB 239 MB Etropy 6.6 [13.1] bits 9.1 [14.0] bits 8.4 bits INV idex size 13 [70] MB 0.5 [2.2] GB 4.6 GB -per item 9.3 [21.5] bits 12.8 [23.2] bits 11.0 bits HYB idex size 14 [62] MB 0.5 [2.0] GB 4.9 GB -per item 9.4 [19.2] bits 13.0 [20.7] bits 11.6 bits -per doc 3.9 [15.4] bits 4.3 [14.8] bits 5.9 bits -per word 5.5 [3.8] bits 8.7 [5.9] bits 5.7 bits Table 1: Properties of our three test collectios, ad the space cosumptio of INV versus HYB. The etries i square brackets are for a full positioal idex, without ay word whatsoever removed. Collectio Method mea 90% 99% max Homeopathy Wikipedia Terabyte INV HYB INV HYB INV HYB Table 2: Average, 90%-ile, 99%-ile ad maximum processig times i secods for INV versus HYB o our three test collectios. mergig of the itersectios the domiates for INV, ad this ideed shows i the first colum of Table 3. For multiword queries, the result volume w D Dw (Lemmas 2 ad 4) goes dow, ad, accordig to Lemma 2, the itersectio costs domiate for INV, which shows i the third colum of Table 3. I cotrast, colums two ad four demostrate that HYB achieves a better balace of the costs for readig, ucompressig, ad itersectig, ad oe of these essetial operatios becomes the bottleeck. HYB avoids mergig altogether sice, by costructio, the potetial completios from the give word rage W always lie withi a sigle block. The read time of HYB is about 50% larger tha that of INV, because HYB always reads a whole block of size Θ(), eve for small word rages. This also partially explais why HYB speds more time decompressig tha INV; the other factor is that decompressio of the word ids is more expesive tha decompressio of the documet ids. As we remarked i Sectio 4.1, the absolute time for rakig is

8 the same for both methods. Rakig takes more time o average for the 1-word queries, because these ted to have larger result sets. The compariso with the time eeded for the maiteace of the history, which is othig but memory allocatio ad copyig, shows that all of HYB s operatio are essetially fast list scas. Query size 1-word multi-word Idex type INV HYB INV HYB average time secs secs secs secs read.024 7% %.032 1% % decompress.011 3% %.023 1% % itersect % % mergig %.010.4% rakig % %.007.3%.007 4% history % %.062 3% % Table 3: Breakdow of average processig times for INV ad HYB, for the difficult (ufiltered) queries o Terabyte. 6. CONCLUSIONS We have itroduced a autocompletio feature for fulltext search, ad preseted a ew compact idexig data structure for supportig this feature with very fast respose times. We have built a full-fledged search egie aroud this feature, ad we have give argumets, why we believe it to be practically useful. Give the iteractivity of this egie, the ext logical step followig this work would be to coduct a user study for verifyig that belief. We also see potetial for a further speed-up of query processig time by applyig techiques from top-k query processig [9], i order to display the most relevat hits ad completios without first computig ad rakig all of them. 7. ACKNOWLEDGEMENTS May thaks to our metor David Grossma for his ecouragemet ad may valuable commets. 8. REFERENCES [1] A. Aggarwal ad J. S. Vitter. The iput/output complexity of sortig ad related problems. Commuicatios of the ACM, 31(9): , [2] S. Alstrup, G. S. Brodal, ad T. Rauhe. New data structures for orthogoal rage searchig. I 41st Symposium o Foudatios of Computer Sciece (FOCS 00), pages , [3] V. N. Ah ad A. Moffat. Iverted idex compressio usig word-aliged biary codes. Iformatio Retrieval, 8: , [4] L. Arge, V. Samoladas, ad J. S. Vitter. O two-dimesioal idexability ad optimal rage search idexig. I 18th Symposium o Priciples of Database Systems (PODS 99), pages , [5] R. Baeza-Yates. A fast set itersectio algorithm for sorted sequeces. Lecture Notes i Computer Sciece, 3109: , [6] S. Bickel, P. Haider, ad T. Scheffer. Learig to complete seteces. I 16th Europea Coferece o Machie Learig (ECML 05), pages , [7] C. L. A. Clarke, N. Craswell, ad I. Soboroff. The TREC terabyte retrieval track. SIGIR Forum, 39(1):25, [8] J. J. Darragh, I. H. Witte, ad M. L. James. The reactive keyboard: A predictive typig aid. IEEE Computer, pages 41 49, [9] R. Fagi, A. Lotem, ad M. Naor. Optimal aggregatio algorithms for middleware. J. Comput. Syst. Sci., 66(4): , [10] P. Ferragia, N. Koudas, S. Muthukrisha, ad D. Srivastava. Two-dimesioal substrig idexig. Joural of Computer ad System Sciece, 66(4): , [11] P. Ferragia ad G. Mazii. Idexig compressed text. Joural of the ACM, 52(4): , [12] L. Fikelstei, E. Gabrilovich, Y. Matias, E. Rivli, Z. Sola, G. Wolfma, ad E. Ruppi. Placig search i cotext: The cocept revisited. I 10th Iteratioal World Wide Web Coferece (WWW10), pages , [13] V. Gaede ad O. Güther. Multidimesioal access methods. ACM Computig Surveys, 30(2): , [14] K. Grabski ad T. Scheffer. Setece completio. I 27th Coferece o Research ad Developmet i Iformatio Retrieval (SIGIR 04), pages , [15] M. Jakobsso. Autocompletio i full text trasactio etry: a method for humaized iput. I Coferece o Huma Factors i Computig Systems (CHI 86), pages , [16] D. Metzler, T. Strohma, H. Turtle, ad W. B. Croft. Idri at TREC 2004: Terabyte track. I 13th Text Retrieval Coferece (TREC 04), [17] A. Moffat ad J. Zobel. Self-idexig iverted files for fast text retrieval. ACM Trasactios o Iformatio Systems, 14(4): , [18] G. W. Payter, I. H. Witte, S. J. Cuigham, ad G. Buchaa. Scalable browsig for large collectios: A case study. I 5th Coferece o Digital Libraries (DL 00), pages , [19] S. E. Robertso, S. Walker, M. M. Beaulieu, M. Gatford, ad A. Paye. Okapi at TREC-4. I 4th Text Retrieval Coferece (TREC 95), pages 73 96, [20] T. Stocky, A. Faaborg, ad H. Lieberma. A commosese approach to predictive text etry. I Coferece o Huma Factors i Computig Systems (CHI 04), pages , [21] E. M. Voorhees. Query expasio usig lexical-sematic relatios. I 17th Coferece o Research ad Developmet i Iformatio Retrieval (SIGIR 94), pages , [22] H. Williams ad J. Zobel. Compressig itegers for fast file access. Computer Joural, 42(3): , [23] I. H. Witte, T. C. Bell, ad A. Moffat. Maagig Gigabytes: Compressig ad Idexig Documets ad Images, 2d editio. Morga Kaufma, [24] J. Zobel, A. Moffat, ad K. Ramamohaarao. Iverted files versus sigature files for text idexig. ACM Trasactios o Database Systems, 23(4): , 1998.