Optimizing Result Prefetching in Web Search Engines. with Segmented Indices. Extended Abstract. Department of Computer Science.

Transcription

1 Optiizig Result Prefetchig i Web Search Egies with Segeted Idices Exteded Abstract Roy Lepel Shloo Mora Departet of Coputer Sciece The Techio, Haifa 32000, Israel eail: frlepel,[email protected] Abstract We study the process i which search egies with segeted idices serve queries. I particular, we ivestigate the uber of result pages which search egies should prepare durig the query processig phase. Search egie users have bee observed to browse through very few pages of results for queries which they subit. This behavior of users suggests that prefetchig ay results upo processig a iitial query is ot eciet, sice ost of the prefetched results will ot be requested by the user who iitiated the search. However, a policy which abados result prefetchig i favor of retrievig just the rst page of search results ight ot ake optial use of syste resources as well. We argue that for a certai behavior of users, egies should prefetch a costat uber of result pages per query. We dee a cocrete query processig odel for search egies with segeted idices, ad aalyze the cost of such prefetchig policies. Based o these costs, we show how to deterie the costat which optiizes the prefetchig policy. Our results are ostly applicable to local idex partitios of the iverted les, but are also applicable to processig of short queries i global idex architectures. Perissio to copy without fee all or part of this aterial is grated provided that the copies are ot ade or distributed for direct coercial advatage, the VLDB copyright otice ad the title of the publicatio ad its date appear, ad otice is give that copyig is by perissio of the Very Large Data Base Edowet. To copy otherwise, or to republish, requires a fee ad/or special perissio fro the Edowet. Proceedigs of the 28th VLDB Coferece, Hog Kog, Chia, Itroductio The sheer size of the WWW ad the eorts of search egies to idex sigicat portios of it [14] have caused ay search egies to partitio their iverted idex of the Web ito several disjoit segets (partial idices). The partitioig of the idex ipacts the aer i which the egies process queries. Most egies also use soe for of query result cachig, where results of queries that were served are cached for soe tie. I particular, query results ay be prefetched i aticipatio of user requests. Such scearios occurs whe the egie retrieves (for a certai query) ore results tha will iitially be retured to the user. We exaie eciet prefetchig policies for search egies. These policies deped o the architecture of the search egie (which, i tur, aects its query processig schee) ad o the behavior patters of search egie users. 1.1 Search Egie Users Users subit queries to search egies. Fro a user's poit of view, a egie aswers each query with a liked set of raked result pages, typically with 10 results per page. All users browse the rst page of results (the results deeed by the egie's rakig schee to be the ost relevat to the query), ad soe sca additioal result pages, usually i the atural order i which those pages are preseted. Three studies have aalyzed the aer i which users query search egies ad view result pages: a study by Jase et al. [7], based o 51; 473 queries subitted to the search egie Excite 1 ; a study by Markatos [15], based o about a illio queries subitted to Excite; ad a study by Silverstei et al. [19], based o about a billio queries subitted to the search egie AltaVista 2. Three digs which these studies share are particularly relevat to this work:

2 The queries subitted to WWW search egies are very short, averagig less tha 2:4 ters per query, with over half of the queries cotaiig just oe or two ters. These results were reported by both [19] ad [7]. While the two studies dee query ters soewhat dieretly, the reported ter couts ay be loosely iterpreted as the uber of words per query. Users browse through very few result pages. The etioed studies dier i the reported distributio of page views, but agree that at least 58% of the users view oly the rst page (the top-10 results), ad that o ore tha 12% of users browse through ore tha 3 result pages. The uber of distict iforatio eeds of users is very large, as ca be see fro the huge variety of queries subitted to search egies. However, popular queries are repeated ay ties, ad the 25 ost popular queries accout for over 1% of all queries subitted to the egies. 1.2 Cachig ad Prefetchig of Search Results It is cooly believed that all ajor search egies perfor soe sort of search result cachig ad prefetchig. Cachig of results was oted i Bri ad Page's descriptio of the prototype of the search egie Google 3 [3] as a iportat optiizatio techique of search egies. Markatos [15] deostrated that cachig search results ca lead to hit ratios of close to 30%. I additio to storig results that were requested by users i the cache, search egies ay also prefetch results that they predict to be requested shortly. A iediate exaple is prefetchig the secod page of results wheever a ew query is subitted by a user. Sice studies [19, 7] idicate that the secod page of results is requested shortly after a ew query is subitted i at least 15% of cases, search egies ay prepare ad cache two (or ore) result pages per query. 1.3 Idex Structure ad Query Processig Models Iverted idices, or iverted lists/les, are regarded as the ost widely applied idexig techique [18, 8, 20, 17, 16], ad are believed to be used by the ajor search egies. As search egies idex hudreds of illios of Web pages [14], the size of their iverted idices is easured i terabytes. Ribeiro-Neto ad Barbosa [17] etio three hardware coguratios that ca hadle large digital libraries: a powerful cetral achie, a parallel achie, or a high-speed etwork of achies (workstatios ad high ed desktops). However, whe cosiderig the size of the idices which search egies ai- 3 tai, the growth rate of the Web ad the large uber of queries which search egies aswer each day, usig a etwork of achies is cosidered to be the ost cost-eective ad scalable architecture [6, 17]. Such etworks operate i a shared-othig eory orgaizatio [18] where each achie has its ow processig power (oe or several CPUs), its ow eory ad its ow secodary storage. The achies couicate by passig essages via the high speed etwork that coects the. There are two well-studied schees of partitioig a iverted idex across several achies: Global idex orgaizatio. I this schee, the iverted idex is partitioed by ters. Each achie holds postig lists for a distict set of ters (the ters ay be partitioed by lexicographic order, for exaple). The postig list for ter t holds etries for all docuets that iclude t. Local idex orgaizatio. I this schee, the iverted idex is partitioed by docuets. Each achie is resposible for idexig a distict set of docuets, ad will hold postig lists for all ters that appeared i its set of docuets. Soe works, such as those by Ribeiro-Neto ad Barbosa [17] ad Toasic ad Garcia-Molia [20] have copared the (ru-tie) eciecy of the above schees. Parallel geeratio of a global idex has bee studied i [18], while a syste which crawls the Web ad builds a distributed local idex was preseted i [16]. Cahoo et al. [4] evaluated the coputatioal perforace of local idices uder a variety ofworkloads, ad Hawkig [6] exaied scalability issues of local idex orgaizatios. The prototype of Google was reported as usig global idex partitioig [3]. May of the above etioed works [4, 18,6,17, 20] describe essetially the sae odel for processig queries i systes with segeted idices: User queries arrive to a certai desigated achie, which we will call the Query Itegrator, or QI. This achie was called hoe site i [20], cetral broker i [18] ad [17], user iterface (or UIF) i [6] ad coectio server i [4]. The QI issues each query to the separate idex segets, i a aer which depeds o the partitioig schee of the idex. With local idex partitioig, the QI will sed the query (as subitted by the user) to all segets. With global idex partitioig, the QI seds each seget a partial query cosistig oly of the set of ters whose postig lists are stored i the seget. The QI waits for the relevat segets to retur their result sets, ad erges these result sets with respect to the syste's rakig schee. Agai,

3 the two idex partitioig schees iply dieret erge operatios. With local idex partitioig, it is usually assued that each seget has the ability to calculate the global score of each docuet i its local idex with respect to all queries. Sice the result sets that are retured by dieret segets are disjoit, ergig the various result sets is straightforward. With global idex partitioig, each (relevat) seget returs a raked docuet list that ay overlap lists retured by other segets, ad where each score reects oly the score of the docuet with respect to the partial query which that seget received. The QI ay eed to perfor set operatios o the partial result sets (for queries cotaiig boolea operators), ad ight eed to weigh the scores retured fro each seget dieretly (for exaple, accordig to the dieret idf values of the ters i each partial query). The QI returs the erged results to the users. We cosider a cache-augeted process, i which the QI aitais a query-result cache. Upo receivig a query fro a user, the QI rst checks if the cache cotais results for that query. If so, the cached results are retured to the user, without forwardig the query to ay of the segets. If the query caot be aswered fro the cache, the QI processes the query as described above, ad upo copletio, caches the erged results This work Whe cosiderig the query processig odel described above i the cotext of Web search egies, we ote that erged results are retured to users i sall batches (typically 10 at a tie), i decreasig order of relevace (as raked by the search egie). The QI, however, ay prepare ore results tha are to populate the rst batch, ad cache the for future use. This raises the issue of optiizig the uber of prefetched results i systes where the cost of processig ucached queries icreases with the uber of results that are fetched: prefetchig a large uber of results per query will be costly at rst, but ay pay o should the user request additioal batches of results (sice these will already be cached). Note that with the cost of prefetchig we also associate the cache space that is occupied by the prefetched results. Assuig a xed-size cache, icreasig the uber of prefetched results per query ay decrease the uber of queries whose results ca be siultaeously cached. 4 The aiteace of the cache is ot cosidered i this work. I particular, we do ot exaie how cached etries are replaced or how the freshess of the results is aitaied. This ay lead to lower cache hit ratios, ad to a icrease i the load of the egie. Aother issue arisig fro the query processig odel, is the relatioship betwee the uber of results which the QI decides to prefetch per query, ad the uber of results which it should ask of each seget. As a exaple, cosider a egie which uses local partitioig ito segets, ad whose policy is to prefetch results per query. How ay top results (deoted by l) should the QI collect fro each seget for each query? It ay happe that all of the top results reside o a particular seget. Therefore, i order to be certai that ideed all top results are obtaied, it is ecessary to collect the top- results of each seget (settig l = ). However, assuig that docuets are partitioed radoly ad idepedetly ito the segets, the QI ay be able to collect cosiderably less results fro each seget ad still, with very high probability, obtai all of the top- results. Thus, whe optiizig the uber of prefetched results, the behavior of l with respect to ust also be cosidered. The tradeo betwee the aout (ad cost) of result prefetchig ad the possibility of servig subsequet queries fro the cache is the ai topic of this paper. As popular search egies process illios of queries every day, eciet prefetchig policies ca help reduce both the hardware requireets ad the respose tie of the egies. The rest of this paper is orgaized as follows. Sectio 2 forally presets the probles studied ad the otatios used throughout this paper (the otatios are suarized there i Table 1). We odel both the search egie's query service process ad the users' behavior. We the dee the cost of prefetchig a give uber of results i ters of a cost fuctio which is aalyzed ad optiized i later sectios. Sectio 3 presets a algorith which optiizes the prefetch cost fuctio for two special cases. The rst case deals with iverted idices that t o a sigle achie. This sigle achie sceario also odels servig sigleter queries (which are quite coo o the Web) with a globally-partitioed idex. The secod special case deals with a sceario whe the egie guaratees that the users receive absolutely optial results, usig worst-case assuptios o the distributio of relevat docuets i local-idex partitios. The ai body of work is cotaied i Sectio 4, which presets algoriths that solve ad approxiately solve the optiizatio proble for locally partitioed idices with a arbitrary uber of segets, aog which the docuets are radoly distributed. Sectios 5 tackles the cobiatorial proble of settig the uber of results which should be retrieved fro each seget i order to provide quality erged results to the users. Sectio 6 discusses the practical ipact that our results ay have o search egie egieerig. Coclu-

4 sios ad suggestios for future research are brought i Sectio 7. 2 Notatios ad Foral Model 2.1 The User Our work requires a odel for the way search egie users view result pages of their searches. Two studies [19,7]have reported o several aspects of such user behavior by exaiig the query logs of search egies. To our purposes, the aalysis of AltaVista's log [19] did ot report i suciet detail the exact distributio of result pages views (citig percetages of users viewig 1; 2 ad 3 pages oly). I additio, the statistics reported i that paper oly cosidered requests for additioal results which arrived withi 5 iutes of the previous request ade by the sae user. The study of Excite's users [7] brigs a ore elaborate distributio of result page views per query. 58% of the page views were of the rst result page, 19% of the views were of the secod result page, ad the views of result pages 3-9 (21:3% of the views) cofored to a Geoetric distributio with a paraeter betwee 0:288 ad 0:427. Wechose to odel the uber of result pages which users view per query as a Geoetric rado variable u G(1, p). Accordig to this odel, users view result pages i their atural order, ad the probabilityof a user viewig exactly result pages 1;:::;k (ot viewig result pages k + 1 ad beyod) equals (1, p)p k,1. I other words, upo viewig a result page, the user requests the ext page with probability p. A iportat property of the Geoetric distributio is the fact that it is eoryless: Pr(u s+t j u s)=p t 8s; t 2 IN Assue that the coplexity of retrievig raked results is also \eoryless", eaig that the coplexity of retrievig the results that rak i places ; +1;:::;+(k,1) depeds oly o the uber of results retrieved, k. As we will see, this assuptio holds whe the idetity of the result that raks i place, 1 is kow. The, the eoryless behavior of the users ad the eoryless cost of retrieval iplies that the optial uber of result pages r opt that should be prefetched for a query is idepedet of the uber of result pages requested so far: ay tie a query caot be served fro the cache, the QI should prepare the ext r opt result pages. 2.2 The Idex Architecture ad the Coplexity of Processig Queries The odel to which we refer i ost of this paper is that of a local idex partitioig schee i a sharedothig etwork. The idex is partitioed aog segets. We assue that docuets (URLs) are partitioed ito segets by a rado process which assigs each docuet to a seget accordig to the uifor distributio, ad idepedetly of all other docuets. Such a partitioig ca be achieved by hashig every URL ito a xed-size docuet ID, ad appig these IDs ito segets. Such aschee was etioed i [1] i the cotext of buildig URL repositories, ad the sae techique ca be applied whe assigig pages to the segets of a iverted idex. Sice the uber of docuets cosidered is i the hudreds of illios while is cosiderably saller (uch less tha the square root of the uber of docuets), the segets will cotai roughly the sae uber of docuets (with high probability). The query processig odel is as described i Sectio 1.3. Throughout the discussio we cosider the processig of a \broad topic" query that atches C docuets i each seget, where C is uch larger tha the uber of the results a user will actually browse. Let A deote the uber of results which the egie presets i each result page (a typical value is A = 10). Sice results should be prefetched i page uits, the uber of prefetched results per query should be a ultiple of A. I what follows we exaie the cost of prefetchig = ra results per query, so that i subsequet sectios we will be able to optiize the value of r - the uber of prefetched result pages. We will deote a user's query by a pair (t; k), where t is the search topic ad k 1 is the (ordial) uber of result page requested. A query ca either start a search of a ew topic (ad the k = 1), or ask for additioal results i a existig search (k>1). The followig discussio addresses both query types. Preliiaries Upo receivig a query (t; k) which caot be aswered fro the cache, the QI eeds to fetch results for t. The rst task is to set the value of l, the uber of results to retrieve fro each of the segets. Let B() deote the set of docuets that the egie should ideally retrieve for the query: the docuets that attai the best scores for t (accordig to the egie's rakig fuctio), out of all docuets that have ot bee retrieved for queries (t; k 0 );k 0 <k. Let R(l; ) deote the set of docuets that will be retrieved for the query t whe each of the segets returs its l ost relevat (ad previously uretrieved) atches for t. Ideally, R(l; ) would cotai B(), but esurig that eas settig l to equal. 5 Istead, we assue that the egie eploys the followig quality policy, based o a probability q: The QI sets the value of l with respect to such that Pr[B()R(l; )] q 5 This special case is discussed i Sectio 3.

5 I other words, the QI should collect eough (previously uretrieved) quality results fro each seget so that with probability q, the top- retrieved results will ideed be the best (previously uretrieved) results for t i the etire idex. The relatioship betwee ad l will be studied i Sectio 5. For the tie beig, it suces to ote that by the assuptio that docuets are uiforly distributed aog the segets, the above probability depeds oly o the values of ; ad l, ad is idepedet of the topic t. Let ~ l q (; ) deote the iial uber of docuets which should be retrieved by each of the segets so that the quality criterio is satised: ~l q (; ) 4 = if l j Pr[B()R(l; )] qg Collectig results The QI seds each seget the topic t ad a request for its ~ l q (; ) top results for the query. Wheever k>1 (this is ot the rst batch of results to be retrieved for t), seget i also receives s i (t; k, 1), the score of the lowest rakig docuet that it had cotributed to the results of (t; 1);:::;(t; k,1). 6 Weow estiate the cost of servig such requests. By our assuptio, the query atches C docuets i each seget, where C is uch larger tha the uber of results users will actually browse through, ad cosequetly is uch larger tha ~ l q (; ) (sice ~ l q (; ) is bouded by, ad is bouded by the uber of results that users browse through). We assue that idetifyig the C-sized set of cadidate docuets ca be doe i a tie that is liear i C. This assuptio holds for the iverted idex structure whe the uber of query ters is very sall, as is the case with broad topic queries o the Web (see Sectio 1.1). Recall that each seget receives the score of the lowest rakig docuet that it has retrieved so far for the query, ad ca thus discard previously retrieved results fro the set of cadidates. The top-scorig ~ l q (; ) docuets of the reaiig cadidates are the foud. Each seget will thus sped (C + ~ lq (; ) log C) processig steps (per query) i order to retur ~ l q (; ) sorted results to the QI. Mergig results The QI receives sorted result sets of legth ~ l q (; ). Readig ad buerig these sets takes ( ~ l q (; )) operatios. It the partially erges the results util it ideties the top = ra retrieved results that will populate the r result pages. By usig Tree Selectio Sortig [12] with the sorted result lists hagig fro the leaves of the tree, the erge ca be accoplished i tie (2 + log ). The overall coplexity of this step is thus ( ~ l q (; )+2+log ). 6 We assue that the results of the query (t; k, 1) are still cached whe the query (t; k) arrives. Cachig results The r result pages are cached, ad the rst of those pages is retured to the user. The scores s 1 (t; k);:::;s (t; k) are also oted. The overall space coplexityisthus (ra + ). The coplexity of the query processig odel Our odel requires two essages to be passed betwee the QI ad each of the segets: the QI seds the query to each seget, ad each seget returs ~l q (; ) results to the QI. The total uber of results received by the QI is ~ l q (; ), ad this aout of data ipacts its tie coplexity. Had we allowed ore rouds of couicatio, we could have aaged by sedig the QI oly +(,1) results, lowerig the coplexity of the erge step above to ( + + log ). We chose ot do so sice iiizig couicatio rouds betwee achies (eve at the expese of sedig larger essages) is likely to iprove perforace i distributed coputatios [6]. Note that the coplexity of the retrieval odel described above is ideed \eoryless" (see discussio i Sectio 2.1). The odel iplies the followig coputatioal loads o the various resources of the egie, whe followig a policy of prefetchig r result pages per query: The QI perfors ( ~ l q (ra; )+2+rA log ) coputatio steps. Each idex seget perfors (C + ~l q (ra; ) log C) coputatios. The cache space required is (ra + ). Additioally, we itroduce two o-egative coeciets ad which will allow us to assig dieret weights to the three resources which are cosued durig query processig. Specically, will ultiply the coputatios of the QI ad will ultiply the cache space required 7. Tuig the values of ad ca ephasize eory (cache) liitatios, coputatioal bottleecks (the QI vs. the segets) ad respose tie per query. More o this i Sectio 6.2. We are ow ready to forulate W (r), the expected cost (or work) of a policy which prefetches r pages for geoetric users with paraeter p. Result pages ir+1;ir+2;:::;(i+1)r will be tered as the i'th batch of result pages. For ease of otatio, we itroduce l q (r;) 4 = ~ l q (ra; ). 7 The coputatioal loads were expressed usig the () otatio. For cocreteess ad siplicity, we will cosider the give expressios as the exact coplexities. This allows us to avoid tedious otatios, ad does ot aect the esuig aalysis (ad results) of the paper.

6 W (r) = Cachig overhead + 1X i=1 [ Pr(preparig batch i) (batch preparatio coplexity) ] = (Ar + )+ 1X i=0 p ir [C + l q (r;) log C + (ra log + l q (r;)+2)] = (Ar + )+ C+l q(r;) log C 1, p r + (ra log + l q (r;)+2) 1,p r Rearragig the ters, ad igorig the costat additive ter (which does ot deped o r ad will ot aect the optiizatio of W (r)), we get W (r) = Ar + (C +2) + (log C + )l q(r;)+(a log )r 1, p r To ease the otatio, we dee the followig costats: a = A; b =(C+2); c = (log C + ) ad d = A log. With this otatio, W (r) =ar + b + cl q(r;)+dr 1, p r C, the uber of docuets per seget which atch a query, istypically a large uber, while A ad are typically uch saller. Thus, whe the proportioality costats ad are both about 1, typical values of b are large (tes of thousads ad beyod), while a; c; d are relatively sall (typically less tha 100). Our issio: Give a -way locally segeted idex, geoetric-p users ad soe quality criterio q, deterie r opt,aitegral value of r which iiizes W (r). I doig so, deterie l q (r opt ;). The QI will the prepare r opt result pages wheever a query caot be aswered fro the cache, askig each of the segets to retrieve its top l q (r opt ;) results (that score below a certai threshold) for the query beig processed. We will strive to obtai exact or alost exact values of r opt ad l q (r opt ;). 3 Siple Special Cases I this sectio we show that the proble for a sigle seget (= 1) ad the proble for ultiple segets with q = 1 behave siilarly, ad i both cases r opt ca be foud i (log r opt ) steps. Whe the idex is stored i a sigle seget, we ca igore the ters i the coplexity fuctio W (r) which deal with the ergig of results fro Sybol a b c d p q r ropt A C W (r) lq(r;) Deotes shorthad for A shorthad for (C + 2) shorthad for (log C + ) shorthad for A log uber of segets i idex probability of viewig result page k whe viewig page k, 1 quality criterio of QI uber of result pages to fetch optial itegral value of r uber of results per result page uber of relevat results per seget work required for fetchig r result pages per query uber of results to fetch fro each seget so that the best ra results are collected with probability at least q; equals ~ lq(ra; ) ultiplies the coputatios of the QI i W (r) ultiplies the required cachig space i W (r) Table 1: suary of otatios dieret segets (aely the ters ivolvig ). I additio, l q (r; 1) = ra regardless of q's value. Thus, W (r) becoes: W (r) =ar + C +(Alog C)r 1, p r Note that whe a idex is partitioed globally (each seget holds postig lists for a distict set of ters), sigle-ter queries are eectively queries to a sigle seget as described above. Studies [19, 7] idicate that the percetage of sigle-ter queries o the Web is quite large (25%, 30%). For the case where q =1we agai have l 1 (r;)= ra, ad the coplexity fuctio W (r) takes the followig for: W (r) =ar + b +(ca + d)r 1, p r Both cases iply a coplexity fuctio of the for W (r) =ar + b0 + d 0 r 1, p r ; b 0 ;d 0 >0 The derivative of W(r) is egative at zero ad icreases for all r > 0. Therefore, W (r) (for positive values of r) decreases at rst util reachig its (uique) iial value, ad the icreases. Relyig o this behavior, a optial itegral value of r (r o pt) ca be foud by applyig the followig procedure: 1. Fid the iial atural uber such that W (2 ) <W(2 +1 ). 2. Fid a optial value of r, usig biary search, i the rage 2,1 ;:::;2 +1. Sice will ot exceed 1 + dlog r opt e, the coplexity of dig r opt is (log r opt ).

7 4 Solutio for a -way Segeted Local Idex I this sectio we study the proble of settig the optial value of r give the quality criterio q (q <1), the egie's architecture paraeters A; C ad, ad the coplexity paraeters ad. Subsectio 4.1 presets a algorith for deteriig the optial value of r, which iiizes the retrieval coplexity fuctio W (r). Subsectio 4.2 presets a approxiatio algorith, which ds a value of r for which W (r) is approxiately optial. 4.1 Optiizig r i Idices With Segets First, recall the coplexity fuctio fro Sectio 2.2: W (r) =ar + b + cl q(r;)+dr 1, p r Clearly, the behavior of W (r) depeds o the behavior of l q (r;). While we will show how to precisely calculate l q (r;) i Sectio 5, for the purpose of this subsectio it suces to ote that if r 0 r the l q (r 0 ;) l q (r;). I order to facilitate the search for r opt,weow set forth to d, for every value of r, a upper boud o the set fr j W (r) W(r)g 8. Deitio 1 A fuctio g(r) will be called W - restrictive if for all r 0 g(r), W (r 0 ) >W(r). For exaple, g 1 (r) = 4 W (r) is W -restrictive, sice for a all r 0 g 1 (r), we havew(r 0 )>ar 0 ag 1 (r) =W(r). Cosequetly, r opt is ot larger tha g 1 (1). We will use W -restrictive fuctios to boud our search space for r opt. For this we ow seek a W - restrictive fuctio that is better tha g 1, providig tighter bouds o the size of the search space. The followig Propositio is proved i the full versio of this paper: Propositio 1 The fuctio is W -restrictive. g(r) =r+ pr (b+cl q (r;)+dr) (1, p r )(a + d) Note that the above fuctio reects all the architectural paraeters of the search egie's idex, ad also the user's behavior (represeted by p) ad the desired quality criterio q. Figure 1 displays Algorith OP for settig the optial value of r. All of the steps except the calculatio of l q (r;)i2(a) are trivial; that calculatio is the topic of Sectio 5. The correctess of the algorith follows fro the W -restrictiveess of g(r) (Propositio 1), sice we do ot eed to iterate through values of r for which W (r) iskow to be higher tha values we have already see. 8 Sice li r!1 W (r) =1, the set fr j W (r) W(r)g is ite for all r. 1. Iitializatios: W i W (1) ; r opt 1 ; liit 1; r While r<liit: (a) Calculate l = l q (r;), ad use the value to set W W (r) ; g g(r). (b) If liit >g: liit g. (c) If W<W i : W i W ; r opt r. (d) r r prit r opt. Figure 1: Algorith OP for optiizig the prefetch policy The coplexity of the algorith Algorith OP eeds to be executed relatively few ties, whe cogurig the prefetchig policy of the search egie (see discussio i Sectio 6.2). Therefore, its ow coplexity does ot ipact the perforace of the egie. Nevertheless, we ow prove that its ruig tie is polyoial. We do so by boudig r ax, the axial uber of iteratios which OP ay require throughout its course. For this, let a + d = b + ca + d Note that by our assuptios o the relative values of a; b; c ad d (see Sectio 2.2), is a sall costat. Sice l q (1;) A,wehave that g(1) 1+ p is (1,p) a boud o the uber of iteratios. Thus, r ax is bouded by 1+ 2p 1 wheever 0:5, ad by 1+ 1,p 1,p wheever p. Next, we boud r ax whe p> ad <0:5. Lea 1 r ax 3l log log p l Proof: Let r = log log p ad p r =2 rlog p 2 log =. Thus, g(r) = r + wheever p> ; < The, sice 1 >p>,r>1 p r b + cl q (r;)+dr (1, p r ) a + d b + cl q (r;)+dr r + (1, ) a + d b + cra + dr r + (1, ) a + d (sice obviously l q (r;) ra) r < r + (1, ) = 2, log 1, r 3r =3 log p To coplete the aalysis of the coplexity of algorith OP for dig r opt, we show i Sectio 5

8 that calculatig the values of l q (r;) for all r 2 f1;:::;r ax g requires O( 2 A 2 rax) 2 steps (regardless of the value of q). Sice we have already bouded r ax by siple fuctios of ; p ad, bouds o the coplexity of the algorith follow. Table 2 brigs saple results of the algorith. For every cobiatio of ad p, r opt ad rax act (the highest value of r for which W (r) was actually calculated durig executio) are show. Figure 2 plots W (r) as a fuctio of r, as calculated durig the algorith for three values of with p = 0:5. For all displayed results, we used q = 0:99; = = 1;A = 10 ad C =2 13. p 0:3 0:5 0:7 5 4(5) 7(9) 11(15) 25 4(5) 6(8) 12(14) 50 4(5) 6(8) 10(13) Table 2: r opt (r act ax) values as a fuctio of ad p 4.2 Approxiatig the Optial Solutio I the previous subsectio we have show how tode- terie r opt, the uber of pages which iiizes the coplexity fuctio W (r). However, if we are willig to settle for early optial solutios, aely dig values of r for which W (ropt) 1, for sall values W (r) of, we ca use the followig algorith: l 4 1. Let r ax = log. log p 2. Fid the value of r i the rage f1;:::;r ax g which iiizes W (r). Note that r ax depeds o the user's behavior (as odeled by p) but is idepedet of the egie's architecture ad quality policy (which are odeled by a; b; c; d ad q). Furtherore, the above algorith is applicable to ay work fuctio W(r) ~ such that (1, p r ) W ~ (r) is a icreasig fuctio of r. Note that W (r) satises this coditio, sice (1, p r )W (r) =(1,p r )ar + b + cl q (r;)+dr where a; b; c; d are positive costats, ad the fuctios (1, p r );l q (r;) are odecreasig fuctios of r. The correctess of the approxiatio algorith relies o the followig Propositio. Propositio 2 Let W (r) be ay positive fuctio such that (1, p r )W (r) is a icreasig fuctio of r. Let r;t 2 IN such that W (t) 1 W (r) 1,p t. The, for all r 0 t, W (r 0 ) >W(r). Proof: Sice (1, p t )W (t) W (r), we have for r 0 t W (r 0 ) > (1, p r0 )W (r 0 ) (1, p t )W (t) W (r) Corollary 1 Let 0 <s<t. The W (s) < W (t) (1,p s ). Proof: Sice (1, p r )W (r) icreases with r, we get W (s) < Corollary 2 For all s, W (s) (1, p t ) < W (t) (1, p s ) ifw (1);:::;W(s)g< W(r opt) 1, p s Proof: If 1 r opt s, the clai holds. Otherwise, the result is iplied by Corollary 1, with t = r opt. Substitutig s = r ax = d log e i the last Corollary yields the approxiatio log p algorith: ifw (1);:::;W(s)g < = W(r opt) 1, p d log log p e W (r opt ) 1, 2 log pd log log p e < W (r opt) 1, Table 3 shows the values of r ax for p = 0:1; 0:2;:::;0:9 ad =0:1;0:01 ad 0:001. As etioed earlier, calculatig the values of l q (r;) for all r 2f1;:::;r ax g requires O( 2 A 2 r 2 ax) coputatioal steps, ad thus the tie coplexity of the approxiatio algorith is O( 2 A 2 l log log p 2). Fially, we ote that the results of this subsectio ay be used i practice to iprove the ruig tie of Algorith OP (gure 1), by checkig (betwee W steps 2(b) ad 2(c)) whether 1 Wi 1,p r, ad settig liit r if so (thus teriatig the algorith). Propositio 2 asserts that all future iteratios with larger values of r will result i greater values of W (r), ad so OP ca safely teriate ad output the curret value of r opt. 5 Calculatig l q (r;) This sectio brigs recursive forulae with which l q (r;) ca be calculated i a tie which is polyoial i ; r ad A. We odel the distributio of the top results i the segets by the followig rado process: = 4 ra dieret balls (the top results for a query) are throw radoly ad idepedetly ito dieret cells (the P segets), where i balls are iserted to cell i ( i=1 i = ). We odel the queryig process by takig ifl; i g balls fro cell i for i = 1;:::;. Deote by e ;;l the uber of excess balls that reai i the cells after the queryig process is copleted. I Sectio 5.1 we calculate the probability that e ;;l = 0. This correspods to the case where o cell

9 Figure 2: W (r) as a fuctio of r, for =5;25 ad 50 (p =0:5) p 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 0: : : Table 3: r ax as a fuctio of p ad cotais ore tha l balls, so that the QI ideed aaged to collect the top results fro the segets. I the full versio of this work we also calculate the probability that e ;;l = k. This correspods to the case where the QI aaged to collect just, k of the top results. We study this case sice the QI ay choose to eploy a relaxed quality policy, requirig that with high probability, ost (but ot ecessarily all) of the top results are retured to the user. Subsectio 5.2 briey reviews previous work o related issues. We rst preset a rough boud o l q (r;) which ay suce whe precise calculatios are ot essetial. Clearly, isalower boud o l q(r;) for all q>0. We show that for q =1, 1 2,l q (r;) eed ot be larger tha axfd + log e; d 2eeg 9 : The probability that exactly i, of the results are iserted to a give seget is i, 1 i, 1, 1,i,., Sice i e i, i ( 1 i )i (1, 1 ),i < ( e i )i ( 1 )i =( e i )i : 9 Sharper (kow) asyptotic bouds o lq(r; ) are discussed i Sectio 5.2. Hece, the probability that ore tha ` results are iserted ito a give seget is bouded by X i=`+1 ( e i )i < 1X i=`+1 ( e ` )i = ( `)` e e ` 1, e ` Wheever ` axf( + log ; 2e g, this last expressio is bouded fro above by = , 1 +log Thus, by the uio boud, the probability that at least oe of the segets cotais ore tha ` results is saller tha 1 2. The results follow. 5.1 Precise Calculatio of l q (r;) Weow tur to the precise calculatio of l q (r;). For this we will calculate the probability P (; ; l) =Pr[e ;;l =0]; the probability of throwig dieret balls ito dieret cells so that o cell cotais ore tha l balls. The size of the proble space is. We will actually be coutig N(; ; l), the uber of ways to throw dieret balls ito dieret cells so that o

10 r Table 4: l q (r;) for q =0:99;A= 10 ad various values of r ad cell cotais ore tha l balls, ad the P (; ; l) =N(; ; l)= The followig recursive forulae ay be used to calculate the N(; ; l) values: Xl,1 N( +1;;l) = N(, j;, 1;l) j j=0 lx N(; +1;l) = N(, j; ; l) j j=0 However, the recursio that ost aturally ts i Algorith OP fro Sectio 4.1 is: N(; ; l) = X N(, jl;, j; l, 1) j l;:::;l;,jl j=0 First, we choose soe j cells to have exactly l balls. We the choose the balls to populate those cells (the ultioial coeciet has jl-ters). The reaiig,jl balls are distributed to the reaiig,j cells, with each such cell collectig o ore tha l, 1 balls. As r grows i subsequet iteratios OP, so will the value of l q (r;). This recursio aturally uses results of N(; ; l) fro previous iteratios i later iteratios. As for the iitial values: 1. For all ; l, N(0;;l)=1. Wheever >0, N(; 0;l)=N(; ; 0) = For all >0; >0: Wheever l<d e; N(; ; l) =0. N(; ; d e)=, k where k 4 = od.! d ek b, c,k Deotig by ax 4 = rax A the value of i the last iteratio of OP (ad by l ax the value of l foud i that iteratio), the total tie spet calculatig values of l q (r;)is( ax 2 l ax )=O( 2 ax 2 ). Table 4 shows saple values of l q (r;). 5.2 Previous Work The stochastic properties of the process which radoly throws balls ito cells have bee studied extesively. Two good refereces are [13] ad [10]. Aog the properties studied was the distributio of the axiu uber of balls i a cell, which we will deote by L(; ). For exaple, for (ore l balls tha cells), L(; ) = ( l d1+ l e + ) with probability1,o(1) [5]. Whe =, L(; ) behaves asyptotically as (1 + o(1)) l with probability 1,o(1) [2]. I [13], the distributio of L(; ) l l is exaied with regard to the behavior of the ratio as ;!1. Separate results are obtaied l for the three cases! l 0;! >0;ad l!1.i[9]itwas show that the distributio l of L(; ) ay be approxiated by the the distributio of s j ax P j=1 s ; j where each s j is a idepedet 2 variable with degrees of freedo. 2(,1) 6 Fro Theory to Practice This sectio attepts to bridge the gap betwee theory ad practice by highlightig the possible practical iplicatios of our odel ad results. 6.1 The Coplexity Fuctio W (r) We rst revisit two assuptios we have ade while foralizig W (r) (Sectio 2). These assuptios pertai to the aer by which users view result pages ad to the eoryless query processig schee. 1. \Users view search result pages accordig to a eoryless geoetric process". While this assuptio is extreely siplistic, the studies cited i Sectio 2.1 idicate that it ight reasoably approxiate the aggregate behavior of users. 2. \Whe a request for result page k arrives, result page k, 1 is still cached". We used this assuptio to sed each seget the score of the lowest result it had cotributed to page k, 1. This, i tur, allowed us to forulate a eoryless query processig schee. While igorig cache aageet issues i this work, the followig cosideratio justies the ituitio behid this assuptio: the ai of ay policy that prefetches r pages (ubered k;:::;k+ r,1) whe processig a request for result page k of soe query, is to rapidly aswer (fro the cache) subsequet requests for pages k +1;:::;k+ r, 1 of that query. Thus, the prefetchig policy iplicitly assues that the

11 life expectacy of cached etries will allow page k + r, 1tobecached util it is requested. I other words, every policy that prefetches r pages assues that pages will be cached log eough for r,1 subsequet requests. We require pages to be cached for r subsequet requests. The above assuptios allowed us to forulate a exact coplexity fuctio to our cocrete query processig odel. At the ed of Sectio 2.2, the coplexity fuctio was abbreviated to the for W (r) =ar + b + cl q(r;)+dr 1, p r : We clai that this abbreviated for (ad our results) ca accoodate ay retrieval odel that icurs the followig costs whe prefetchig r pages: Cache space that is liear i r, the uber of prefetched result pages. Retrieval coplexity that is the su of (1) a ter that depeds o the query's breadth (uber of atchig results), (2) a ter that is liear i l q (r;), ad (3) a ter that is liear i r. Thus, our results ay apply to idex structures ad query processig schees that dier fro our odel. Furtherore, the results of Sectio 4.2 apply to ay coplexity fuctio ~ W (r) where (1,pr ) ~ W (r) isaicreasig fuctio of r. Fially, the results of Sectio 5, where we deteried the uber of results that should be retrieved fro each seget (l q (r;)), are applicable to ay search egie that uses a locally segeted idex i which docuets are partitioed uiforly ad idepedetly. 6.2 Ipleetig a Prefetchig Policy Ipleetig a prefetchig policy for egies with locally segeted idices i the fraework of this research requires the followig two preprocessig steps: Settig the paraeters: a approxiate value of p is derived fro aalyzig the egie's query logs, the paraeter q is set accordig to the quality policy, ad the values of ; are set accordig to the egie's resources. Systes with sall caches should set to a high value; whe the QI is heavily loaded, should be set to a high value; etc. For a rage of query breadths (a rage of values for the paraeter C), a algorith (either optiizig or approxiatig r opt ) is executed. The QI ad each seget are the loaded with tables cotaiig the values of r opt (C) ad l q (r opt (C);) for values of C i the rage. Upo receivig a query t, each seget estiates that query's breadth (the value of C t that correspods to t). This ca be doe i two ways: May local idex ipleetatios icorporate global ter statistics i each seget i order to facilitate ter-based scorig [1]. These statistics ay help estiatig the breadth of certai types of queries. By our assuptio, each of the seget cotai approxiately the sae uber of results for broad topic queries (whe C ). Thus, a seget ca process the query, ad use the uber of atches it ds as a estiate of C. After estiatig C t, the seget forwards l q (r opt (C t )) results to the QI, which erges the retrieved results to produce r opt (C t ) result pages. 7 Coclusios ad Future Work This work exaied how search egies should prefetch search results for user queries. We started by presetig a cocrete query processig odel for search egies with locally segeted iverted idices. We argued that for a odel which assues that the uber of result pages that users view is distributed geoetrically, the optial egie policy is to prefetch a costat uber of result pages r. We expressed the coputatioal cost of a policy that prefetches r pages, ad suggested a algorith for dig the optial value of r (which iiizes the expected cost). We also suggested how to d values of r which iply policies whose cost is approxiately optial. Several extesios of this work are the followig: The odel preseted i this paper igores overlaps i the iforatio eeds of dieret users. We did ot cosider, for exaple, that popular queries ay be subitted by ultiple users durig a short tie spa, icreasig the probabilityof at least oe user requestig additioal results. By takig query popularityito accout, we ay d that popular queries warrat ore result prefetchig tha rare queries do. This work did ot address cache replaceet policies; i particular, we did ot suggest which result pages should be reoved fro the cache upo prefetchig results for a ew query. As oted i [11] (i the cotext of buerig of postig lists), kowledge of the access patters to the query cache should be cosidered whe settig the replaceet policy. For exaple, users usually browse result pages i their atural order. Thus, assuig that the rst two result pages of soe query are cached ad that oe of the ust be evicted, it sees atural to reove the secod page of results (ad ot the rst, as a LRU policy ight suggest). Most of the results i this paper are applicable to locally segeted idices. Oly sigle-ter

12 queries to global idices are cosidered. Additioal research is required i order to exted our results to ulti-ter queries to global idices. Ackowledgets We thak Adrei Broder 10 ad Farzi Maghoul fro AltaVista for useful discussios ad isights o the probles covered i this paper. Refereces [1] A. Arasu, J. Cho, H. Garcia-Molia, A. Paepcke, ad S. Raghava. Searchig the web. ACM Trasactios o Iteret Techology, 1(1):2{43, [2] Yossi Azar, Adrei Z. Broder, Aa R. Karli, ad Eli Upfal. Balaced allocatios. SIAM Joural of Coputig, 29(1):180{200, [3] Sergey Bri ad Lawrece Page. The aatoy of a large-scale hypertextual web search egie. Proc. 7th Iteratioal WWW Coferece, [4] Bredo Cahoo, Kathry S. McKiley, ad Zhihog Lu. Evaluatig the perforace of distributed architectures for iforatio retrieval usigavariety ofworkloads. ACM Trasactios o Iforatio Systes, 18(1):1{43, [5] Artur Czuaj ad Volker Stea. Radoized allocatios processes. Proc. 38th IEEE Syposiu o Foudatios of Coputer Sciece, pages 194{203, [6] David Hawkig. Scalable text retrieval for large digital libraries. First Europea Coferece o Digital Libraries, [7] Berard J. Jase, Aada Spik, ad Tefko Saracevic. Real life, real users, ad real eeds: A study ad aalysis of user queries o the web. Iforatio Processig ad Maageet, 36(2):207{227, [8] Byeog-Soo Jeog ad Edward Oieciski. Iverted le partitioig schees i ultiple disk systes. IEEE Trasactios o Parallel ad Distributed Systes, 6(2):142{153, [9] N. L. Johso ad D. H. Youg. Soe applicatios of two approxiatios to the ultioial distributio. Bioetrika, 47:463{469, [11] Bjor THor Josso, Michael J. Frakli, ad Divesh Srivastava. Iteractio of query evaluatio ad buer aageet for iforatio retrieval. I SIGMOD 1998, Proceedigs ACM SIG- MOD Iteratioal Coferece o Maageet of Data, Seattle, Washigto, USA, pages 118{ 129, Jue [12] Doald E. Kuth. The Art of Coputer Prograig, Volue 3. Addiso-Wesley Publishig Copay Ic., [13] Valeti F. Kolchi, Boris A. Sevast'yaov, ad Vladiir P. Chistyakov. Rado Allocatios. V. H. Wisto & Sos, [14] Steve Lawrece ad C. Lee Giles. Searchig the world wide web. Sciece, 280, April [15] Evagelos P. Markatos. O cachig search egie query results. Proceedigs of the 5th Iteratioal Web Cachig ad Cotet Delivery Workshop, May [16] Sergey Melik, Srira Raghava, Beverly Yag, ad Hector Garcia-Molia. Buildig a distributed full-text idex for the web. Proc. 10th Iteratioal WWW Coferece, [17] B. Ribeiro-Neto ad R. Barbosa. Query perforace for tightly coupled distributed digital libraries. I Proc. ACM Digital Libraries Coferece, 1998., pages 182{190, [18] B. A. Ribeiro-Neto, J. P. Kitajia, G. Navarro, C. R. G. Sat'Aa, ad N. Ziviai. Parallel geeratio of iverted les for distributed text collectios. Proc. 18th Iteratioal Coferece of the Chilea Coputer Sciece Society, [19] Craig Silverstei, Moika Heziger, Haes Marais, ad Michael Moricz. Aalysis of a very large altavista query log. Techical Report , Copaq Systes Research Ceter, October [20] A. Toasic ad H. Garcia-Molia. Perforace of iverted idices i shared-othig distributed text docuet iforatio retrieval systes. I Proc. Secod Iteratioal Coferece o Parallel ad Distributed Iforatio Systes, pages 8{ 17, [10] Nora L. Johso ad Sauel I. Kotz. Ur Models ad their Applicatio. Joh Wiley & Sos, Ic., Adrei Broder is curretly with IBM Research