Type Less, Find More: Fast Autocompletion Search with a Succinct Index


 Joseph Morton
 2 years ago
 Views:
Transcription
1 Type Less, Fid More: Fast Autocompletio Search with a Succict Idex Holger Bast MaxPlackIstitut für Iformatik Saarbrücke, Germay Igmar Weber MaxPlackIstitut für Iformatik Saarbrücke, Germay ABSTRACT We cosider the followig fulltext search autocompletio feature. Imagie a user of a search egie typig a query. The with every letter beig typed, we would like a istat display of completios of the last query word which would lead to good hits. At the same time, the best hits for ay of these completios should be displayed. Kow idexig data structures that apply to this problem either icur large processig times for a substatial class of queries, or they use a lot of space. We preset a ew idexig data structure that uses o more space tha a stateoftheart compressed iverted idex, but that yields a order of magitude faster query processig times. Eve o the large TREC Terabyte collectio, which comprises over 25 millio documets, we achieve, o a sigle machie ad with the idex o disk, average respose times of oe teth of a secod. We have built a fullfledged, iteractive search egie that realizes the proposed autocompletio feature combied with support for proximity search, semistructured (XML) text, subword ad phrase completio, ad sematic tags. Categories ad Subject Descriptors H.3.1 [Cotet Aalysis ad Idexig]: Idexig Methods; H.3.3 [Cotet Aalysis ad Idexig]: Retrieval Models; H.5.2 [User Iterfaces]: Theory ad Methods Geeral Terms Algorithms, Desig, Experimetatio, Huma Factors, Performace, Theory Keywords Autocompletio, Empirical Etropy, Idex Data Structure 1. INTRODUCTION Autocompletio is a widely used mechaism to get to a desired piece of iformatio quickly ad with as little kowledge ad effort as possible. Oe of its early uses was i the Uix Shell, where pressig the tabulator key gives a list of all file ames that start with whatever has bee typed o Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. SIGIR 06, August 6 11, 2006, Seattle, Washigto, USA. Copyright 2006 ACM /06/ $5.00. the commad lie after the last space. Nowadays, we fid a similar feature i most text editors, ad i a large variety of browsig GUIs, for example, i file browsers, i the Microsoft Help suite, or whe eterig data ito a web form. Recetly, autocompletio has bee itegrated ito a umber of (web ad desktop) search egies like Google Suggest or Apple s Spotlight. We discuss more applicatios i Sectio 1.2. I the simpler forms of autocompletio, the list of completios is simply a rage from a (typically precomputed) list of words. For the Uix Shell, this is the list of all file ames i all directories listed i the PATH variable. For the text editors, this is the list of all words etered ito the file so far (ad maybe also words from related files). I Google Suggest, completios appear to come from a precompiled list of popular queries. For these kids of applicatios we ca easily achieve fast respose times by two biary or Btree searches i the (pre)sorted list of cadidate strigs. More advaced forms of autocompletio take ito accout the cotext i which the tobecompleted word has bee typed. The problem we propose ad discuss i this paper is of this kid. The formal problem defiitio will be give i Sectio 2. More iformally, imagie a user of a search egie typig a query. The with every letter beig typed, we would like a istat display of completios of the last query word which would lead to good hits. At the same time, the best hits for ay of these completios should be displayed. All this should preferably happe i less time tha it takes to type a sigle letter. For example, assume a user has typed coferece sig. Promisig completios might the be sigir, sigmod, etc., but ot, for example, sigature, assumig that, although sigature by itself is a pretty frequet word, the query coferece sigature leads to oly few good hits. See Figure 1 for a screeshot of our search egie respodig to that query. For a live demo, see 1.1 Our results We have developed a ew idexig data structure, amed HYB, which uses o more space tha a stateoftheart compressed iverted idex, ad which ca respod to autocompletio queries as described above withi a small fractio of a secod, eve for collectio sizes i the Terabyte rage. Our mai competitor i this paper is the iverted idex, referred to as INV i the followig. Other data structures that could be directly applied to our problem either use a lot of space or have other limitatios; we discuss these i Sectio 1.2. We give a rigorous mathematical aalysis of HYB ad INV with respect to both space usage ad query processig times. Our aalysis accurately predicts the real behavior o our test collectios. Cocerig space usage, we defie a otio of empirical
2 Figure 1: A screeshot of our search egie for the query coferece sig searchig the Eglish Wikipedia. The list of completios ad hits is updated automatically ad istatly after each keystroke, hece the absece of ay kid of search butto. The umber i paretheses after each completio is the umber of hits that would be obtaied if that completio where typed. Query words eed ot be completed, however, because the search egie does a implicit prefix search: if, for example, the user cotiued typig coferece sig proc, completios ad hits for proc, e.g., proceedigs, would be from the 185 hits for coferece sig. etropy [11] [22], which captures the iheret space complexity of a idex idepedet of a particular compressio scheme. We prove that the empirical etropy of HYB is essetially equal to that of INV, ad we fid that the actual space usage of our implemetatio of the two idex structures is ideed almost equal, for each of our three test collectios. Cocerig processig times, we give a precise quatificatio of the umber of operatios eeded, from which we derive bouds for the worst, best, ad averagecase behavior of INV ad HYB. We also take ito accout the differet latecies of sequetial ad radom access to data [1]. We compare INV ad HYB o three test collectios with differet characteristics. Oe of our collectios has bee (semi)publicly searchable over the last year, so that we have autocompletio queries from real users for it. Our largest collectio is the TREC Terabyte bechmark with over 25 millio documets [7]. O all three collectios ad o all the queries we cosidered, HYB outperforms INV by a factor of i worstcase query processig time, ad by a factor of 3 10 i average case query processig time. I absolute terms, HYB achieves average query processig of oe teth of a secod or less o all collectios, o a sigle machie ad with the idex o disk (ad ot i mai memory). We have built a fullfledged search egie that supports autocompletio queries of the described kid combied with support for proximity/phrase search, XML tags, subword ad phrase completio, ad category iformatio. All of these extesios are described i Sectio Related work The autocompletio feature as described so far is remiiscet of stemmig, i the sese that by stemmig, too, prefixes istead of full words are cosidered [23]. But ulike stemmig, our autocompletio feature gives the user feedback o which completios of the prefix typed so far would lead to highly raked documets. The user ca the assess the relevace of these completios to his or her search desire, ad decide to (i) type more letters for the last query word, e.g., i the query from Figure 1, type i ad r so that the query is the coferece sigir, or to (ii) start with the ext query word, e.g., type a space ad the proc, or to (iii) stop searchig as, e.g., the user was actually lookig for oe of the hits show i Figure 1. There is o way to achieve this by a stemmig preprocessig step, because there is o way to foresee the user s itet. This kid of user iteractio is well kow to improve retrieval effectiveess i a variety of situatios [21]. While our autocompletio feature is for the purpose of fidig iformatio, autocompletio has also bee employed for the purpose of predictig user iput, for example, for typig messages with a mobile phoe, for users with disabilities cocerig typig, or for the compositio of stadard letters [6] [14] [20] [8] [15]. I [12], cotextual iformatio has bee used to select promisig extesios for a query. Payter et al. have devised a iterface with a zoomigi property o the word level, ad based o the idetificatio of frequet phrases [18]. We get a related feature by the subword/phrasecompletio mechaism described i Sectio 4.4. Our autocompletio problem is related to but distictly differet from multidimesioal rage searchig problems, where the collectio cosists of tuples (of some fixed dimesio, for example, pairs of word prefixes), ad queries are askig for all tuples that match a give tuple of rages [10] [2] [4] [13]. These data structures could be used for our autocompletio problem, provided that we were willig to limit the umber of query words. For fast processig times, however, the space cosumptio of ay of these structures is o the order of N 1+d, where N is the size of a iverted idex, ad d > 0 grows (fast) with the dimesio. For our au
3 tocompletio queries, we ca achieve fast query processig times ad space efficiecy at the same time because we have the set of documets matchig the part of the query before the last word already computed (amely whe this part was beig typed). I a sese, our autocompletio problem is therefore a 1 1/2  dimesioal rage searchig problem. Fially, there is a large variety of alteratives to the iverted idex i the literature. We have cosidered those we are aware of with regard to their applicability to our autocompletio problem, but foud them either usuitable or iferior to the iverted idex i that respect. For example, approaches that cosider documet by documet are boud to be slow due to a poor locality of access; i cotrast, both INV ad HYB are mostly scaig log lists; see Sectio 3. Sigature files were foud to be i o way superior (but sigificatly more complicated) to the iverted idex i all major respects i [24]. Suffix arrays ad related data structures address the issue of full substrig search, which is ot what we wat here (but see Sectio 4.4); a direct applicatio of a data structure like [11] would have the same efficiecy problems as INV, whereas multidimesioal variats like [10] require superliear space, as explaied above. 2. FORMAL PROBLEM DEFINITION AND DEFINITION OF EMPIRICAL ENTROPY The followig defiitio of our autocompletio problem takes either positioal iformatio, or rakig of the completios or of the documets ito accout. We will first, i Sectio 3, aalyze our data structures for this basic settig. I Sectio 4, we the show how to geeralize the data structures ad their aalysis to cope with positioal iformatio, rakig, ad a umber of other useful ehacemets. This geeralizatio will be straightforward. Defiitio 1. A autocompletio query is a pair (D, W ), where W is a rage of words (all possible completios of the last word which the user has started typig) ad D is a set of documets (the hits for the precedig part of the query). To process the query meas to compute the subset W W of words that occur i at least oe documet from D, as well as the subset D D of documets that cotai at least oe of these words. For our example coferece sig, D is the set of all documets cotaiig a word startig with coferece (computed whe the last letter of this word was typed), ad W is the rage of all words from the collectio startig with sig. For queries with oly a sigle word, e.g., cofer, D is simply the set of all documets. To aalyze the iheret space complexity of INV ad HYB idepedetly of the specialties of a particular compressio scheme, we itroduce a otio of empirical etropy. Both INV ad HYB are essetially a collectio of (multi)sets ad sequeces. The followig defiitio gives a atural otio of etropy for each such buildig block, ad for arbitrary combiatios of them (similar defiitios have bee made i [11] [22]). The reader might first wat to skip the followig defiitio ad come back to it whe it is first used i the aalysis that follows. Defiitio 2. We defie empirical etropy for the followig etities, where H(p 1,..., p l ) = l (pi log 2 pi) is the lary etropy fuctio. (a) For a subset of size with elemets from a uiverse of size, the empirical etropy is H( /, 1 /) (iclude each elemet of the uiverse ito the subset with probability /), which is log 2 + ( ) log 2. (b) For a multisubset of size with elemets from a uiverse of size, the empirical etropy is ( + ) H( /( + ), /( + )) (cosider a bitvector of size +, ad let a bit be 0 with probability /( + ) ad 1 otherwise; the prefix sums at the 0bits give the multisubset), which is + + log 2 + log 2. (c) For a sequece of elemets from a uiverse of size l, where the ith elemet occurs i times ( l = ), the empirical etropy is H( 1/,..., l /) (for each positio, pick elemet i with probability i/), which is 1 log l log 2. 1 l (d) For a collectio of l etities with empirical etropies H 1,..., H l, the empirical etropy is simply H H l. 3. INV, HYB, AND THEIR ANALYSIS I this sectio we will describe INV ad HYB, ad aalyze them with respect to their empirical etropy ad their processig time for autocompletio queries accordig to Defiitio 1. Query processig times will be quatified i terms of all relevat parameters; from this we ca easily derive worstcase, bestcase, ad averagecase bouds. Our averagecase bouds make simplifyig assumptios o the distributio of words i the documets, but evertheless tur out to predict the actual behavior quite well. Implemetatio issues ad the actual performace of our implemetatios of INV ad HYB will be discussed i Sectio 5. We briefly commet o idex costructio times i Sectio The iverted idex (INV) The iverted idex is the data structure of choice for most search applicatios: it is relatively easy to implemet ad exted by other features, it ca be compressed well, it is very efficiet for short queries, ad it has a excellet locality of access [23]. I this paper, by INV we mea the followig data structure: for each word store the list of all (ids of) documets cotaiig that word, sorted i ascedig order. We do ot cosider ehacemets such as skip poiters [17], which we would expect to give similar beefits for both INV ad HYB, however at the price of a icreased space usage. I the followig, we first estimate the iheret space efficiecy (empirical etropy) of INV. We the aalyze the time complexity of processig autocompletio queries with INV, ad poit out two iheret problems. Lemma 1. Cosider a istace of INV with documets ad m words, ad where the ith words occurs i i distict documets (so that m is the total umber of wordidocumet pairs). Let H iv be the empirical etropy accordig to Defiitio 2. The m ( ) 1 H iv i l 2 + i log 2, ad for all collectios cosidered i this paper (where most i are much smaller tha ) this boud is tight up to 2%. Proof. Accordig to Defiitio 2 (a) ad (d), we have H iv = m ( i log 2 i i + ( i) log 2 ). i To prove the lemma, it suffices to observe that because 1 + x e x for ay real x, ( i) log 2 = ( ) i l 1 + i i i l 2 i l 2.
4 Lemma 1 tells us that if the documets i each list were picked uiformly at radom, the a Golombecodig of the gaps [23] from oe documet id to the ext (for list i, the expected size of a gap would be / i) would achieve a space usage very close to H iv bits. I our implemetatio, we opted to ecode gaps with the Simple9 ecodig from [3], which is easy to implemet, yet achieves very fast decompressio speeds at the price of oly a moderate loss i compressio efficacy; details are reported i Sectio 5. Lemma 2. With INV, a autocompletio query (D, W ) ca be processed i the followig time, where D w deotes the iverted list for word w: D W + w W D w + w W D D w log W. Assumig that the elemets of W, D, ad the D w are picked uiformly at radom from the set of m words ad the set of documets, respectively, this boud has a expected value of D W + W m N + D W N log W. m Remark. By pickig the elemets of a set S at radom from a set U, we mea that each subset of U of size S is equally likely for S. We are ot makig ay radomess assumptio o the sizes of W, D, ad D w above. Proof sketch. The obvious way to use a iverted idex to process a autocompletio query (D, W ) is to compute, for each w W, the itersectios D D w. The, W is simply the set of all w for which the itersectio was oempty, ad D is the uio of all (oempty) itersectios. The itersectios ca be computed i time liear i the total iput volume w W ( D + Dw ).1 The uio ca be computed by a W way merge, which requires o the order of log W time per elemet scaed. With the radomess assumptios, the expected size of D w is N/m, ad the expected size of D D w is D / N/m. Lemma 2 highlights two problems of INV. The first is that the term D W ca become prohibitively large: i the worst case, whe D is o the order of (i.e., the first part of the query is ot very discrimiative) ad W is o the order of m (i.e., oly few letters of the last query word have bee typed), the boud is o the order of m, that is, quadratic i the collectio size. The secod problem is due to the required mergig. While the volume w W D Dw will typically be small oce the first query word has bee completed, it will be large for the first query word, especially whe oly few letters have bee typed. As we will see i Sectio 5, INV frequetly takes secods for some queries, which is quite udesirable i a iteractive settig, ad is exactly what motivated us to develop a more efficiet idex data structure. 3.2 Our ew data structure (HYB) The basic idea behid HYB is simple: precompute iverted lists for uios of words. Assume a autocompletio query (D, W ), where the uio of all lists for word rage W have bee precomputed. We would the get D with a sigle itersectio (of D with the precomputed list). However, from this precomputed list aloe we ca o loger ifer the set W of completios leadig to a hit. Sice W ca be a arbitrary word rage, it is also ot clear which uios should 1 There are asymptotically faster algorithms for the itersectio of two lists [5], but i our experimets, we got the best results with the simple lieartime itersect, which we attribute to its compact code ad perfect locality of access. be precomputed, especially whe we do ot wat to use more space tha a (optimally compressed) iverted idex. The aalysis give i this sectio suggests the followig approach: group the words i blocks so that the legths of the iverted lists i each block sum to (approximately) c, for some costat c < 1 (we will later choose c 0.2). For each block, store the uio of the covered iverted lists as a compressed multiset, usig a effective gap ecodig scheme just as doe for INV (repetitios of the same elemet i the multiset correspod to a gap of zero). I parallel to each multiset, for each elemet x store the id of the word that led to the iclusio of (this occurrece of) x i the multiset. This gives a sequece of word ids, the legth of which is exactly the size of the multiset. Ecode these word ids with code legth (approximately) log 2 (( l )/ i) for the ith word, where i is the umber of documets cotaiig the ith word, ad l is the umber of words i the respective block. Here is a example. Let oe of the blocks comprise four words A, B, C, ad D, with iverted lists A : 3, 5, 6, 8, 9, 11, 12, 15 B : 5, 11 C : 3, 7, 11, 13 D : 3, 8 We would the like to store, i compressed form, the multiset (of documet ids) ad the sequece (of word ids) A C D A B A C A D A A B C A C A The optimal ecodig of the words A, B, C, D would use code legths log 2 (16/8) = 1, log 2 (16/2) = 3, log 2 (16/4) = 2, log 2 (16/2) = 3, respectively, for example A = 0, B = 110, C = 10, D = 111. A optimal ecodig of the four gaps 0, 1, 2, 3 that occur i the above multiset of documet ids would be 0, 10, 110, 111, respectively. What we actually store are the the two bit vectors (where the are solely for better readability; the codes i this example are prefixfree) Note that due to the two differet ecodigs the two lists ed up havig differet legths i compressed form, ad this is also what will happe i reality. The followig aalysis will make very clear that (i) oe should choose blocks of equal list volume (ad ot, for example, of equal umber of words), (ii) this volume should be a small but substatial fractio of the umber of documets (ad either smaller or larger), ad (iii) the lists of documet ids should be gapecoded while the lists of word ids should be etropyecoded. As for the space usage, we will first derive a very tight estimate of the etropy of HYB, ad the show that, somewhat surprisigly, if we oly choose the block volume to be a small eough fractio of the umber of documets, the etropy of HYB is almost exactly that of INV. We will the show how HYB, whe the blocks are chose of sufficietly large volume, ca be used to process autocompletio queries i time liear i the umber of documets, for ay reasoable word rage. Sice HYB essetially scas log lists, without the eed for ay mergig, except whe the word rage is huge, it also has a excellet locality of access. Lemma 3. Cosider a istace of HYB with words ad m documets, where the ith word occurs i i documets, ad where for each block the sum of the i with i from that block is c, for some c > 0. The the empirical
5 etropy H hyb, defied accordig to Defiitio 2, satisfies m ( H hyb i 1 + c/2 ) + i log l 2 2, i ad the boud is tight as c 0. Proof. Cosider a fixed block of HYB, ad let i deote the umber of documets cotaiig the ith word belogig to that block. Throughout this proof, let i i deote the sum over all these i (so that the sum over all i i from all blocks gives the m i from the lemma). Accordig to Defiitio 2 (b), (c), ad (d), the empirical etropy of this block is the i i log + i i + 2 i + log i 2 + i i i i log i i 2. i Now addig the first ad the last term, the argumets of the logarithms partially cacel out (!), ad we get i i log + i i + i 2 + log i 2. i Now usig that, by assumptio, i i = c, we obtai ( ) i i (1 + 1/c) log 2 (1 + c) + log 2. i Sice (1 + 1/c) l(1 + c) 1 + c/2 for all c > 0 (ot obvious, but true), we ca upper boud this (tightly, as c 0) by ( ) 1 + c/2 i i + log l 2 2. i This bouds the empirical etropy of a sigle block of HYB (the sum goes over all words from that block). Addig this over all blocks gives us the boud claimed i the lemma. Comparig Lemma 3 with Lemma 1, we see that if we let the blocks of HYB be of volume at most c, for some small fractio c, the the empirical etropy of HYB is essetially that of a iverted idex. I Sectio 4.2, we will see that whe we take positioal iformatio ito accout, the empirical etropy of HYB actually becomes less tha that of INV, for ay choice of block volumes. I our implemetatio of HYB, we compress the lists of documet ids by a Simple9 ecodig of the gaps, just as described for INV above. For the lists of word ids, etropyoptimal compressio could be achieved by arithmetic ecodig [23], but for efficiecy reasos, we compress word ids as follows: assumig that the word frequecies i a block have a Zipflike distributio, it is ot hard to see that a uiversal ecodig with log x bits for umber x [17] of the raks of the words, if sorted i order of descedig frequecy, is etropyoptimal, too. We agai opted for Simple9 ecodig of these raks, which gives us a reasoable compressio ad very fast decompressio speed, without the eed for ay large codebook. We take block sizes as /5, but also take word/prefix boudaries ito accout such that frequet prefixes like pro, com, the get a block o their ow. This is to avoid that a query uecessarily spas more tha oe block. Lemma 4. Usig HYB with blocks of volume N, autocompletio queries (D, W ) ca be processed i the followig time, where D w is the iverted list for word w D w (1+ D /N )+ ( ) D D w log D w /N. w W w W w W For N = Θ() ad W m /N, ad assumig that the elemets of D, D w, ad W are picked uiformly at radom from the set of all documets or all m words, respectively, the expected processig time is bouded by O(). Proof sketch. Accordig to Defiitio 1, we have to compute, give (D, W ), the set W of words from W cotaied i documets from D, as well as the set D of documets cotaiig at least oe such word. For each block B, a straightforward itersectio of the give D with the list of documetword pairs from B, gives us the set W B of all words from W from block B, as well as the set D B of all documet from D which cotai a word from B. From these, D ca be computed by a kway merge, where k is the umber of blocks that cotai a word from W, ad W ca be computed by a simple lieartime sort ito W buckets (because W is a rage). The umber k of blocks is w Dw /N, which is O(1) i expectatio, give the radomess assumptios stated i the lemma. 3.3 Idex costructio time While gettig from a collectio of documets (files) to INV is essetially a matter of oe big exteral sort [23], HYB does ot require a full iversio of the data. For our experimets, however, we built the compressed idices for both INV ad HYB from a itermediate fully iverted text versio of the collectio, which takes essetially the same time for both. 4. EXTENSIONS I this sectio, we describe a umber of extesios of the basic autocompletio facility we have described ad aalyzed so far. The first (rakig) is essetial for practical usability, the secod (proximity search) greatly wides the spectrum of search tasks for which autocompletio ca be useful, ad the others (support for XML tags, subword ad phrase completio, ad sematic tags) give advaced search facilities to the expert searcher. 4.1 Rakig So far, we have cosidered the followig problem (from Defiitio 1): while the user is typig a query, compute after each keystroke the list of all completios of the last query word that lead to at least oe hit, as well as the list of all hits that would be obtaied by ay of these completios. I practice, oly a selectio of items from these lists ca ad will be preseted to the user, ad it is of course crucial that the most relevat completios ad hits are selected. A stadard approach for this task i adhoc retrieval is to have a precomputed score for each wordidocumet pair, ad whe a query is beig processed, to aggregate these scores for each cadidate documet, ad retur documets with the highest such aggregated scores [23]. Both INV ad HYB ca be easily adapted to implemet ay such scorig ad aggregatio scheme: store by each wordidocumet pair its precomputed score, ad whe itersectig, aggregate the scores. A decisio has to be made o how to recocile scores from differet completios withi the same documet. We suggest the followig: whe mergig the itersectios (which gives the set D accordig to Defiitio 1), compute for each documet i D the maximal score achieved for some completio i W cotaied i that documet, ad compute for each completio i W the maximal score achieved for a hit from D achieved for this completio. Asymptotically, the iclusio of rakig does ot affect the time bouds derived i Lemmas 2 ad 4, ad our experimets show that rakig ever takes more tha half of the total query processig time; see Sectio 5.4. The icrease i space usage depeds o the selected scorig scheme, ad is the same for INV ad HYB. It is for these reasos, that we factored out the rakig aspect from our basic Defiitio 1
6 ad from our space ad time complexity aalysis i Sectio Proximity/Phrase searches With a properly chose scorig fuctio, such as BM25, mere rakig by score aggregatio ofte gives very satisfactory precisio/recall behavior [19]. There are may queries, however, where the decisive cue o whether a particular documet is relevat or ot lies i the fact whether certai of the query words occur close to each other i that documet. See [16] for a recet positive result o the use of proximity iformatio i adhoc retrieval. Our autocompletio feature icreases the beefits of a proximity operator, because the use of this operator will strogly arrow dow the list of completios displayed to the user, which i tur makes it easier for the user to filter out irrelevat completios. For example, whe searchig the Wikipedia collectio the most relevat completio for the oproximity query max pl would be place (because max ad place are both frequet words), but for the proximity query max..pl it is plack. Here the two dots.. idicate that words should occur withi x words of each other, for some userdefiable parameter x. It is ot hard to exted both INV ad HYB to support proximity search: i the documet lists (INV ad HYB) as well as i the word lists (HYB oly), we duplicate each etry as may times as it occurs i the correspodig documet, ad store the positios i a parallel array of the same size. Word ad documet lists are compressed just as before, ad the lists of positios are gapecoded by Simple9, just like the lists of documet ids. The itersectio routie is adapted to cosider a proximity widow as a additioal parameter. As we will see i Sectio 5.3, the positio lists icrease the idex size by a factor of 45, for both INV ad HYB (without ay kid of stopword removal). We ca exted our aalysis from Sectio 3 to predict this factor as follows. If we replace i, the umber of documets cotaiig the ith word, by N i, the total umber of occurreces of the ith words, ad, the umber of documets, by N, the total umber of word occurreces, we ca show that (details omitted) H hyb H iv m (N i/ l 2 + N i log 2 (N/N i)), where H iv ad H hyb deote the empirical etropy of INV ad HYB, respectively, with positioal iformatio. That is, with positioal iformatio, HYB is always more spaceefficiet tha INV, irrespectively of how we divide ito blocks. It ca be show that N i log 2 (N/N i) 2N i log 2 (/ i), ad sice o average a word occurs about 23 times i a documet, this is just 45 times i log 2 (/ i), which was the correspodig term i the etropy boud for INV or HYB without positioal iformatio (Lemmas 1 ad 3). 4.3 Semistructured (XML) text May documets cotai sematic iformatio i the form of tag pairs. We briefly sketch how we ca make good use of such tags i our autocompletio sceario. Assume that i the archive of a mailig list, the subject of a mail is eclosed i a <subject>... <\subject> tag pair. We ca the easily implemet a operator = (i a way very similar to our implemetatio of the proximity operator), such that for a query subject=sig oly those completios of sig are displayed which actually occur i the subject lie of a mail, ad oly such documets are displayed as hits. 4.4 Completio to subwords ad phrases Aother simple yet ofte useful extesio to the basic autocompletio feature is to cosider as potetial matches ot oly the words as they occur i the collectio, but also meaigful subwords ad phrases. A example ivolvig a subword: for the query ormal..vec we might wat to see eigevector as oe of the relevat completios. A example ivolvig a phrase: for the query max plack we might wat to see the phrase max plack istitute as oe of the relevat completios. It is ot hard to see, that the autocompletio accordig to Defiitio 1 will automatically provide this feature if oly we add the correspodig subwords/phrases to the idex. 4.5 Category iformatio Our autocompletio feature ca be combied with a umber of other techologies that ehace the sematics of a corpus. To give just oe more example here, assume we have tagged all coferece ames i a collectio. The assume we duplicate all coferece ames i the idex, with a added prefix of, say, cof:, e.g., cof:sigir. By the way our autocompletio works, for the query seattle cof: we would the get a list of all ames of cofereces that occur i documets that also metio Seattle. 5. EXPERIMENTS We implemeted both INV ad HYB i compressed format, as described i Sectios 3.1 ad 3.2. Each idex is stored i a sigle file with the idividual lists cocateated ad a array of list offsets at the ed. The vocabulary (which is the same for INV as for HYB) is stored i a separate file. All our code is i C++. All our experimets were ru o a Dual Optero machie, with 2 Itel Xeo 3 GHz processors, 8 GB of mai memory, ad ruig Liux. We esured that the idex was ot cached i mai memory. 5.1 Test collectios We compared the performace of INV ad HYB o three collectios of differet characteristics. The first collectio is a mailiglist archive plus several ecyclopedias o homeopathic medicie (www.homeoet.org). This collectio has bee searchable via our egie over the past year by a audiece of several hudred people. The secod collectio cosists of the complete dumps of the Eglish ad Germa Wikipedia from December 2005 (search.mpiif.mpg.de/ wikipedia). The third collectio is the large TREC Terabyte collectio [7], which served as a stress test for our idex structures (ad for the authors as well). Details about all three collectio are give i Table 1, where the Raw size of a collectio is the total size of the origial, ucompressed files i their origial formats (e.g., HTML or PDF). 5.2 Queries For the Homeopathy collectio, we picked 5,732 maximal queries (that is, queries, which are ot a true prefix of aother query) from a fixed time slice of our query log for that collectio. From each of these maximal queries, a sequece of autocompletio queries was geerated by typig the query from left to right, with a miimal prefix legth of 3. Like that, for example, the maximal query acidum phos gives rise to the 6 autocompletio queries aci, acid, acidu, acidum, acidum pho, ad acidum phos. For the Wikipedia collectio, autocompletio queries were geerated i the same maer from a set of 100 radomly geerated queries, with a distributio of the umber of query words ad of the term frequecy similar to that of the real queries for the Homeopathy collectio. For the Terabyte collectio, autocompletio queries were geerated, i agai the same way but with a miimal prefix legth of 4, from the (stemmed) 50 adhoc queries of the Robust Track Bechmark [7], e.g., squirrel cotrol protect. For all three collectios, we removed queries cotaiig words that had o completio at all i the respective collectio.
7 For Homeopathy ad Wikipedia, all queries were ru as proximity queries (usig a full positioal idex accordig to Sectio 4.2), while for Terabyte, they were executed as ordiary documetlevel queries. For all collectios, both completios ad hits were raked as we described it i Sectio 4.1 (details of the aggregatio fuctio are omitted here). Each autocompletio query was processed accordig to Defiitio 1, e.g., for acidum pho, we compute all completios of pho that occur i a documet which also cotais a word startig with acidum, as well as the set of all such documets. The result for each autocompletio query is remembered i a history, so that we do ot eed to recompute the set of documets matchig the first part of the query. E.g., whe processig acidum pho, we ca take the set of documets matchig acidum from the history; see the explaatio followig Defiitio 1. The autocompletio queries with miimal prefix legths, like aci ad acidum pho for Homeopathy, are the most difficult oes. All other queries ca be easily processed by what we call filterig. For example, both the completios ad hits for the query acidum phos ca be obtaied by retrievig the list of matchig wordidocumet pairs for the previously processed query acidum pho from the history, ad by filterig out, i a liear sca over that list, all those pairs, where the word starts with phos. I practice, this is always faster tha processig such queries as full autocompletio queries accordig to Defiitio 1. Note that this filterig is idetical for INV ad HYB. We evertheless iclude the filtered queries i our experimets, because i reality we will always get a mix of both kids of queries. Table 3 will provide figures for just the difficult (ufiltered) queries. We remark that the history is useful also for cachig purposes, but i our experimets we used it solely for the purpose of filterig. 5.3 Idex space Table 1 shows that INV ad HYB use essetially the same space o all three test collectios, ad that HYB is slighter more compact tha INV for a full positioal idex. This is exactly what Lemmas 1 ad 3, ad the derivatio i Sectio 4.2 predicted! The sizes for both INV ad HYB exceed that predicted by the empirical etropy by about 50%. This is due to our use of the Simple9 compressio scheme, which trades very fast decompressio time for about this icrease i space usage [3]. A combiatio of Golomb ad arithmetic ecodig would give us a space usage closer to the empirical etropy. However, decompressio would the become the computatioal bottleeck for almost all queries, see Table 3. We remark that, by the way we did our aalysis, ay ew compressio scheme with improved compressio ratio/decompressio speed profile, would immediately yield a correspodig improvemet for both INV ad HYB. 5.4 Query processig time Table 2 shows that i terms of query processig time, HYB outperforms INV by a large margi o all collectios. With respect to maximum processig time, which is especially critical for a iteractive applicatio, the improvemet is by a factor of With respect to average processig time, which is critical for throughput i a highload sceario, the improvemet is by a factor of Table 3 gives iterestig isights ito where exactly INV loses agaist HYB. The table shows a breakdow of the ruig times of those queries for the Terabyte collectio, which were ot aswered by filterig as discussed above. (Note that the breakdow of the filtered queries would be idetical for both methods.) The table differetiates betwee 1word queries like squi, squir, etc. ad multiword queries like squirrel cotr or squirrel cotrol prot. For the 1word queries, o itersectios have to be computed for either INV or HYB. Accordig to Lemma 2, the Collectio Homeopathy Wikipedia Terabyte Raw size 452 MB 7.4 GB 426 GB #documets 44,015 2,866,503 25,204,103 #words 263,817 6,700,119 25,263,176 #items 12 [27] millio 0.3 [0.8] billio 3.5 billio Vocabulary 2.9 MB 73 MB 239 MB Etropy 6.6 [13.1] bits 9.1 [14.0] bits 8.4 bits INV idex size 13 [70] MB 0.5 [2.2] GB 4.6 GB per item 9.3 [21.5] bits 12.8 [23.2] bits 11.0 bits HYB idex size 14 [62] MB 0.5 [2.0] GB 4.9 GB per item 9.4 [19.2] bits 13.0 [20.7] bits 11.6 bits per doc 3.9 [15.4] bits 4.3 [14.8] bits 5.9 bits per word 5.5 [3.8] bits 8.7 [5.9] bits 5.7 bits Table 1: Properties of our three test collectios, ad the space cosumptio of INV versus HYB. The etries i square brackets are for a full positioal idex, without ay word whatsoever removed. Collectio Method mea 90% 99% max Homeopathy Wikipedia Terabyte INV HYB INV HYB INV HYB Table 2: Average, 90%ile, 99%ile ad maximum processig times i secods for INV versus HYB o our three test collectios. mergig of the itersectios the domiates for INV, ad this ideed shows i the first colum of Table 3. For multiword queries, the result volume w D Dw (Lemmas 2 ad 4) goes dow, ad, accordig to Lemma 2, the itersectio costs domiate for INV, which shows i the third colum of Table 3. I cotrast, colums two ad four demostrate that HYB achieves a better balace of the costs for readig, ucompressig, ad itersectig, ad oe of these essetial operatios becomes the bottleeck. HYB avoids mergig altogether sice, by costructio, the potetial completios from the give word rage W always lie withi a sigle block. The read time of HYB is about 50% larger tha that of INV, because HYB always reads a whole block of size Θ(), eve for small word rages. This also partially explais why HYB speds more time decompressig tha INV; the other factor is that decompressio of the word ids is more expesive tha decompressio of the documet ids. As we remarked i Sectio 4.1, the absolute time for rakig is
8 the same for both methods. Rakig takes more time o average for the 1word queries, because these ted to have larger result sets. The compariso with the time eeded for the maiteace of the history, which is othig but memory allocatio ad copyig, shows that all of HYB s operatio are essetially fast list scas. Query size 1word multiword Idex type INV HYB INV HYB average time secs secs secs secs read.024 7% %.032 1% % decompress.011 3% %.023 1% % itersect % % mergig %.010.4% rakig % %.007.3%.007 4% history % %.062 3% % Table 3: Breakdow of average processig times for INV ad HYB, for the difficult (ufiltered) queries o Terabyte. 6. CONCLUSIONS We have itroduced a autocompletio feature for fulltext search, ad preseted a ew compact idexig data structure for supportig this feature with very fast respose times. We have built a fullfledged search egie aroud this feature, ad we have give argumets, why we believe it to be practically useful. Give the iteractivity of this egie, the ext logical step followig this work would be to coduct a user study for verifyig that belief. We also see potetial for a further speedup of query processig time by applyig techiques from topk query processig [9], i order to display the most relevat hits ad completios without first computig ad rakig all of them. 7. ACKNOWLEDGEMENTS May thaks to our metor David Grossma for his ecouragemet ad may valuable commets. 8. REFERENCES [1] A. Aggarwal ad J. S. Vitter. The iput/output complexity of sortig ad related problems. Commuicatios of the ACM, 31(9): , [2] S. Alstrup, G. S. Brodal, ad T. Rauhe. New data structures for orthogoal rage searchig. I 41st Symposium o Foudatios of Computer Sciece (FOCS 00), pages , [3] V. N. Ah ad A. Moffat. Iverted idex compressio usig wordaliged biary codes. Iformatio Retrieval, 8: , [4] L. Arge, V. Samoladas, ad J. S. Vitter. O twodimesioal idexability ad optimal rage search idexig. I 18th Symposium o Priciples of Database Systems (PODS 99), pages , [5] R. BaezaYates. A fast set itersectio algorithm for sorted sequeces. Lecture Notes i Computer Sciece, 3109: , [6] S. Bickel, P. Haider, ad T. Scheffer. Learig to complete seteces. I 16th Europea Coferece o Machie Learig (ECML 05), pages , [7] C. L. A. Clarke, N. Craswell, ad I. Soboroff. The TREC terabyte retrieval track. SIGIR Forum, 39(1):25, [8] J. J. Darragh, I. H. Witte, ad M. L. James. The reactive keyboard: A predictive typig aid. IEEE Computer, pages 41 49, [9] R. Fagi, A. Lotem, ad M. Naor. Optimal aggregatio algorithms for middleware. J. Comput. Syst. Sci., 66(4): , [10] P. Ferragia, N. Koudas, S. Muthukrisha, ad D. Srivastava. Twodimesioal substrig idexig. Joural of Computer ad System Sciece, 66(4): , [11] P. Ferragia ad G. Mazii. Idexig compressed text. Joural of the ACM, 52(4): , [12] L. Fikelstei, E. Gabrilovich, Y. Matias, E. Rivli, Z. Sola, G. Wolfma, ad E. Ruppi. Placig search i cotext: The cocept revisited. I 10th Iteratioal World Wide Web Coferece (WWW10), pages , [13] V. Gaede ad O. Güther. Multidimesioal access methods. ACM Computig Surveys, 30(2): , [14] K. Grabski ad T. Scheffer. Setece completio. I 27th Coferece o Research ad Developmet i Iformatio Retrieval (SIGIR 04), pages , [15] M. Jakobsso. Autocompletio i full text trasactio etry: a method for humaized iput. I Coferece o Huma Factors i Computig Systems (CHI 86), pages , [16] D. Metzler, T. Strohma, H. Turtle, ad W. B. Croft. Idri at TREC 2004: Terabyte track. I 13th Text Retrieval Coferece (TREC 04), [17] A. Moffat ad J. Zobel. Selfidexig iverted files for fast text retrieval. ACM Trasactios o Iformatio Systems, 14(4): , [18] G. W. Payter, I. H. Witte, S. J. Cuigham, ad G. Buchaa. Scalable browsig for large collectios: A case study. I 5th Coferece o Digital Libraries (DL 00), pages , [19] S. E. Robertso, S. Walker, M. M. Beaulieu, M. Gatford, ad A. Paye. Okapi at TREC4. I 4th Text Retrieval Coferece (TREC 95), pages 73 96, [20] T. Stocky, A. Faaborg, ad H. Lieberma. A commosese approach to predictive text etry. I Coferece o Huma Factors i Computig Systems (CHI 04), pages , [21] E. M. Voorhees. Query expasio usig lexicalsematic relatios. I 17th Coferece o Research ad Developmet i Iformatio Retrieval (SIGIR 94), pages , [22] H. Williams ad J. Zobel. Compressig itegers for fast file access. Computer Joural, 42(3): , [23] I. H. Witte, T. C. Bell, ad A. Moffat. Maagig Gigabytes: Compressig ad Idexig Documets ad Images, 2d editio. Morga Kaufma, [24] J. Zobel, A. Moffat, ad K. Ramamohaarao. Iverted files versus sigature files for text idexig. ACM Trasactios o Database Systems, 23(4): , 1998.
In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008
I ite Sequeces Dr. Philippe B. Laval Keesaw State Uiversity October 9, 2008 Abstract This had out is a itroductio to i ite sequeces. mai de itios ad presets some elemetary results. It gives the I ite Sequeces
More informationDivide and Conquer, Solving Recurrences, Integer Multiplication Scribe: Juliana Cook (2015), V. Williams Date: April 6, 2016
CS 6, Lecture 3 Divide ad Coquer, Solvig Recurreces, Iteger Multiplicatio Scribe: Juliaa Cook (05, V Williams Date: April 6, 06 Itroductio Today we will cotiue to talk about divide ad coquer, ad go ito
More informationProject Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments
Project Deliverables CS 361, Lecture 28 Jared Saia Uiversity of New Mexico Each Group should tur i oe group project cosistig of: About 612 pages of text (ca be loger with appedix) 612 figures (please
More informationModified Line Search Method for Global Optimization
Modified Lie Search Method for Global Optimizatio Cria Grosa ad Ajith Abraham Ceter of Excellece for Quatifiable Quality of Service Norwegia Uiversity of Sciece ad Techology Trodheim, Norway {cria, ajith}@q2s.tu.o
More informationDepartment of Computer Science, University of Otago
Departmet of Computer Sciece, Uiversity of Otago Techical Report OUCS200609 Permutatios Cotaiig May Patters Authors: M.H. Albert Departmet of Computer Sciece, Uiversity of Otago Micah Colema, Rya Fly
More informationChapter 6: Variance, the law of large numbers and the MonteCarlo method
Chapter 6: Variace, the law of large umbers ad the MoteCarlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value
More informationSection IV.5: Recurrence Relations from Algorithms
Sectio IV.5: Recurrece Relatios from Algorithms Give a recursive algorithm with iput size, we wish to fid a Θ (best big O) estimate for its ru time T() either by obtaiig a explicit formula for T() or by
More informationFourier Series and the Wave Equation Part 2
Fourier Series ad the Wave Equatio Part There are two big ideas i our work this week. The first is the use of liearity to break complicated problems ito simple pieces. The secod is the use of the symmetries
More information5 Boolean Decision Trees (February 11)
5 Boolea Decisio Trees (February 11) 5.1 Graph Coectivity Suppose we are give a udirected graph G, represeted as a boolea adjacecy matrix = (a ij ), where a ij = 1 if ad oly if vertices i ad j are coected
More informationSearching Algorithm Efficiencies
Efficiecy of Liear Search Searchig Algorithm Efficiecies Havig implemeted the liear search algorithm, how would you measure its efficiecy? A useful measure (or metric) should be geeral, applicable to ay
More informationDomain 1: Designing a SQL Server Instance and a Database Solution
Maual SQL Server 2008 Desig, Optimize ad Maitai (70450) 18004186789 Domai 1: Desigig a SQL Server Istace ad a Database Solutio Desigig for CPU, Memory ad Storage Capacity Requiremets Whe desigig a
More informationRunning Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis
Ruig Time ( 3.) Aalysis of Algorithms Iput Algorithm Output A algorithm is a stepbystep procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.
More informationSECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES
SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,
More informationI. Chisquared Distributions
1 M 358K Supplemet to Chapter 23: CHISQUARED DISTRIBUTIONS, TDISTRIBUTIONS, AND DEGREES OF FREEDOM To uderstad tdistributios, we first eed to look at aother family of distributios, the chisquared distributios.
More informationLearning outcomes. Algorithms and Data Structures. Time Complexity Analysis. Time Complexity Analysis How fast is the algorithm? Prof. Dr.
Algorithms ad Data Structures Algorithm efficiecy Learig outcomes Able to carry out simple asymptotic aalysisof algorithms Prof. Dr. Qi Xi 2 Time Complexity Aalysis How fast is the algorithm? Code the
More informationAsymptotic Growth of Functions
CMPS Itroductio to Aalysis of Algorithms Fall 3 Asymptotic Growth of Fuctios We itroduce several types of asymptotic otatio which are used to compare the performace ad efficiecy of algorithms As we ll
More informationLecture Notes CMSC 251
We have this messy summatio to solve though First observe that the value remais costat throughout the sum, ad so we ca pull it out frot Also ote that we ca write 3 i / i ad (3/) i T () = log 3 (log ) 1
More informationDiscrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13
EECS 70 Discrete Mathematics ad Probability Theory Sprig 2014 Aat Sahai Note 13 Itroductio At this poit, we have see eough examples that it is worth just takig stock of our model of probability ad may
More informationHere are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.
This documet was writte ad copyrighted by Paul Dawkis. Use of this documet ad its olie versio is govered by the Terms ad Coditios of Use located at http://tutorial.math.lamar.edu/terms.asp. The olie versio
More information1 Computing the Standard Deviation of Sample Means
Computig the Stadard Deviatio of Sample Meas Quality cotrol charts are based o sample meas ot o idividual values withi a sample. A sample is a group of items, which are cosidered all together for our aalysis.
More informationChapter 5 An Introduction to Vector Searching and Sorting
Chapter 5 A Itroductio to Vector Searchig ad Sortig Searchig ad sortig are two of the most frequetly performed computig tasks. I this chapter we will examie several elemetary searchig ad sortig algorithms
More information6 Algorithm analysis
6 Algorithm aalysis Geerally, a algorithm has three cases Best case Average case Worse case. To demostrate, let us cosider the a really simple search algorithm which searches for k i the set A{a 1 a...
More informationModule 4: Mathematical Induction
Module 4: Mathematical Iductio Theme 1: Priciple of Mathematical Iductio Mathematical iductio is used to prove statemets about atural umbers. As studets may remember, we ca write such a statemet as a predicate
More informationAnalyzing Longitudinal Data from Complex Surveys Using SUDAAN
Aalyzig Logitudial Data from Complex Surveys Usig SUDAAN Darryl Creel Statistics ad Epidemiology, RTI Iteratioal, 312 Trotter Farm Drive, Rockville, MD, 20850 Abstract SUDAAN: Software for the Statistical
More informationLecture 2: Karger s Min Cut Algorithm
priceto uiv. F 3 cos 5: Advaced Algorithm Desig Lecture : Karger s Mi Cut Algorithm Lecturer: Sajeev Arora Scribe:Sajeev Today s topic is simple but gorgeous: Karger s mi cut algorithm ad its extesio.
More informationWeek 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable
Week 3 Coditioal probabilities, Bayes formula, WEEK 3 page 1 Expected value of a radom variable We recall our discussio of 5 card poker hads. Example 13 : a) What is the probability of evet A that a 5
More information7. Sample Covariance and Correlation
1 of 8 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 7. Sample Covariace ad Correlatio The Bivariate Model Suppose agai that we have a basic radom experimet, ad that X ad Y
More informationChair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics
Chair for Network Architectures ad Services Istitute of Iformatics TU Müche Prof. Carle Network Security Chapter 2 Basics 2.4 Radom Number Geeratio for Cryptographic Protocols Motivatio It is crucial to
More informationProperties of MLE: consistency, asymptotic normality. Fisher information.
Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout
More informationB1. Fourier Analysis of Discrete Time Signals
B. Fourier Aalysis of Discrete Time Sigals Objectives Itroduce discrete time periodic sigals Defie the Discrete Fourier Series (DFS) expasio of periodic sigals Defie the Discrete Fourier Trasform (DFT)
More informationTrading the randomness  Designing an optimal trading strategy under a drifted random walk price model
Tradig the radomess  Desigig a optimal tradig strategy uder a drifted radom walk price model Yuao Wu Math 20 Project Paper Professor Zachary Hamaker Abstract: I this paper the author iteds to explore
More informationSoving Recurrence Relations
Sovig Recurrece Relatios Part 1. Homogeeous liear 2d degree relatios with costat coefficiets. Cosider the recurrece relatio ( ) T () + at ( 1) + bt ( 2) = 0 This is called a homogeeous liear 2d degree
More informationLecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)
18.409 A Algorithmist s Toolkit October 27, 2009 Lecture 13 Lecturer: Joatha Keler Scribe: Joatha Pies (2009) 1 Outlie Last time, we proved the BruMikowski iequality for boxes. Today we ll go over the
More informationSection 73 Estimating a Population. Requirements
Sectio 73 Estimatig a Populatio Mea: σ Kow Key Cocept This sectio presets methods for usig sample data to fid a poit estimate ad cofidece iterval estimate of a populatio mea. A key requiremet i this sectio
More informationLecture 7: Borel Sets and Lebesgue Measure
EE50: Probability Foudatios for Electrical Egieers JulyNovember 205 Lecture 7: Borel Sets ad Lebesgue Measure Lecturer: Dr. Krisha Jagaatha Scribes: Ravi Kolla, Aseem Sharma, Vishakh Hegde I this lecture,
More informationDivide and Conquer. Maximum/minimum. Integer Multiplication. CS125 Lecture 4 Fall 2015
CS125 Lecture 4 Fall 2015 Divide ad Coquer We have see oe geeral paradigm for fidig algorithms: the greedy approach. We ow cosider aother geeral paradigm, kow as divide ad coquer. We have already see a
More informationNPTEL STRUCTURAL RELIABILITY
NPTEL Course O STRUCTURAL RELIABILITY Module # 0 Lecture 1 Course Format: Web Istructor: Dr. Aruasis Chakraborty Departmet of Civil Egieerig Idia Istitute of Techology Guwahati 1. Lecture 01: Basic Statistics
More information.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth
Questio 1: What is a ordiary auity? Let s look at a ordiary auity that is certai ad simple. By this, we mea a auity over a fixed term whose paymet period matches the iterest coversio period. Additioally,
More informationTHE ARITHMETIC OF INTEGERS.  multiplication, exponentiation, division, addition, and subtraction
THE ARITHMETIC OF INTEGERS  multiplicatio, expoetiatio, divisio, additio, ad subtractio What to do ad what ot to do. THE INTEGERS Recall that a iteger is oe of the whole umbers, which may be either positive,
More informationLecture 4: Cauchy sequences, BolzanoWeierstrass, and the Squeeze theorem
Lecture 4: Cauchy sequeces, BolzaoWeierstrass, ad the Squeeze theorem The purpose of this lecture is more modest tha the previous oes. It is to state certai coditios uder which we are guarateed that limits
More informationA probabilistic proof of a binomial identity
A probabilistic proof of a biomial idetity Joatho Peterso Abstract We give a elemetary probabilistic proof of a biomial idetity. The proof is obtaied by computig the probability of a certai evet i two
More information1 The Binomial Theorem: Another Approach
The Biomial Theorem: Aother Approach Pascal s Triagle I class (ad i our text we saw that, for iteger, the biomial theorem ca be stated (a + b = c a + c a b + c a b + + c ab + c b, where the coefficiets
More informationORDERS OF GROWTH KEITH CONRAD
ORDERS OF GROWTH KEITH CONRAD Itroductio Gaiig a ituitive feel for the relative growth of fuctios is importat if you really wat to uderstad their behavior It also helps you better grasp topics i calculus
More informationIncremental calculation of weighted mean and variance
Icremetal calculatio of weighted mea ad variace Toy Fich faf@cam.ac.uk dot@dotat.at Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically
More informationThe second difference is the sequence of differences of the first difference sequence, 2
Differece Equatios I differetial equatios, you look for a fuctio that satisfies ad equatio ivolvig derivatives. I differece equatios, istead of a fuctio of a cotiuous variable (such as time), we look for
More informationif A S, then X \ A S, and if (A n ) n is a sequence of sets in S, then n A n S,
Lecture 5: Borel Sets Topologically, the Borel sets i a topological space are the σalgebra geerated by the ope sets. Oe ca build up the Borel sets from the ope sets by iteratig the operatios of complemetatio
More informationA Gentle Introduction to Algorithms: Part II
A Getle Itroductio to Algorithms: Part II Cotets of Part I:. Merge: (to merge two sorted lists ito a sigle sorted list.) 2. Bubble Sort 3. Merge Sort: 4. The BigO, BigΘ, BigΩ otatios: asymptotic bouds
More informationMath Discrete Math Combinatorics MULTIPLICATION PRINCIPLE:
Math 355  Discrete Math 4.14.4 Combiatorics Notes MULTIPLICATION PRINCIPLE: If there m ways to do somethig ad ways to do aother thig the there are m ways to do both. I the laguage of set theory: Let
More information0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5
Sectio 13 KolmogorovSmirov test. Suppose that we have a i.i.d. sample X 1,..., X with some ukow distributio P ad we would like to test the hypothesis that P is equal to a particular distributio P 0, i.e.
More informationThe Stable Marriage Problem
The Stable Marriage Problem William Hut Lae Departmet of Computer Sciece ad Electrical Egieerig, West Virgiia Uiversity, Morgatow, WV William.Hut@mail.wvu.edu 1 Itroductio Imagie you are a matchmaker,
More informationNonlife insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring
Nolife isurace mathematics Nils F. Haavardsso, Uiversity of Oslo ad DNB Skadeforsikrig Mai issues so far Why does isurace work? How is risk premium defied ad why is it importat? How ca claim frequecy
More informationCS100: Introduction to Computer Science
Iclass Exercise: CS100: Itroductio to Computer Sciece What is a flipflop? What are the properties of flipflops? Draw a simple flipflop circuit? Lecture 3: Data Storage  Mass storage & represetig
More information*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.
Itegrated Productio ad Ivetory Cotrol System MRP ad MRP II Framework of Maufacturig System Ivetory cotrol, productio schedulig, capacity plaig ad fiacial ad busiess decisios i a productio system are iterrelated.
More informationRecursion and Recurrences
Chapter 5 Recursio ad Recurreces 5.1 Growth Rates of Solutios to Recurreces Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer. Cosider, for example,
More informationwhen n = 1, 2, 3, 4, 5, 6, This list represents the amount of dollars you have after n days. Note: The use of is read as and so on.
Geometric eries Before we defie what is meat by a series, we eed to itroduce a related topic, that of sequeces. Formally, a sequece is a fuctio that computes a ordered list. uppose that o day 1, you have
More informationBasic Elements of Arithmetic Sequences and Series
MA40S PRECALCULUS UNIT G GEOMETRIC SEQUENCES CLASS NOTES (COMPLETED NO NEED TO COPY NOTES FROM OVERHEAD) Basic Elemets of Arithmetic Sequeces ad Series Objective: To establish basic elemets of arithmetic
More informationrepresented by 4! different arrangements of boxes, divide by 4! to get ways
Problem Set #6 solutios A juggler colors idetical jugglig balls red, white, ad blue (a I how may ways ca this be doe if each color is used at least oce? Let us preemptively color oe ball i each color,
More informationODBC. Getting Started With Sage Timberline Office ODBC
ODBC Gettig Started With Sage Timberlie Office ODBC NOTICE This documet ad the Sage Timberlie Office software may be used oly i accordace with the accompayig Sage Timberlie Office Ed User Licese Agreemet.
More informationUniversal coding for classes of sources
Coexios module: m46228 Uiversal codig for classes of sources Dever Greee This work is produced by The Coexios Project ad licesed uder the Creative Commos Attributio Licese We have discussed several parametric
More informationMath 475, Problem Set #6: Solutions
Math 475, Problem Set #6: Solutios A (a) For each poit (a, b) with a, b oegative itegers satisfyig ab 8, cout the paths from (0,0) to (a, b) where the legal steps from (i, j) are to (i 2, j), (i, j 2),
More informationConfidence Intervals for One Mean with Tolerance Probability
Chapter 421 Cofidece Itervals for Oe Mea with Tolerace Probability Itroductio This procedure calculates the sample size ecessary to achieve a specified distace from the mea to the cofidece limit(s) with
More informationCS100: Introduction to Computer Science
Review: History of Computers CS100: Itroductio to Computer Sciece Maiframes Miicomputers Lecture 2: Data Storage  Bits, their storage ad mai memory Persoal Computers & Workstatios Review: The Role of
More informationDAME  Microsoft Excel addin for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2
Itroductio DAME  Microsoft Excel addi for solvig multicriteria decisio problems with scearios Radomir Perzia, Jaroslav Ramik 2 Abstract. The mai goal of every ecoomic aget is to make a good decisio,
More informationOverview on SBox Design Principles
Overview o SBox Desig Priciples Debdeep Mukhopadhyay Assistat Professor Departmet of Computer Sciece ad Egieerig Idia Istitute of Techology Kharagpur INDIA 721302 What is a SBox? SBoxes are Boolea
More informationThe Power of Free Branching in a General Model of Backtracking and Dynamic Programming Algorithms
The Power of Free Brachig i a Geeral Model of Backtrackig ad Dyamic Programmig Algorithms SASHKA DAVIS IDA/Ceter for Computig Scieces Bowie, MD sashka.davis@gmail.com RUSSELL IMPAGLIAZZO Dept. of Computer
More informationCHAPTER 3 DIGITAL CODING OF SIGNALS
CHAPTER 3 DIGITAL CODING OF SIGNALS Computers are ofte used to automate the recordig of measuremets. The trasducers ad sigal coditioig circuits produce a voltage sigal that is proportioal to a quatity
More informationTruStore: The storage. system that grows with you. Machine Tools / Power Tools Laser Technology / Electronics Medical Technology
TruStore: The storage system that grows with you Machie Tools / Power Tools Laser Techology / Electroics Medical Techology Everythig from a sigle source. Cotets Everythig from a sigle source. 2 TruStore
More informationSolving DivideandConquer Recurrences
Solvig DivideadCoquer Recurreces Victor Adamchik A divideadcoquer algorithm cosists of three steps: dividig a problem ito smaller subproblems solvig (recursively) each subproblem the combiig solutios
More informationEnhancing Oracle Business Intelligence with cubus EV How users of Oracle BI on Essbase cubes can benefit from cubus outperform EV Analytics (cubus EV)
Ehacig Oracle Busiess Itelligece with cubus EV How users of Oracle BI o Essbase cubes ca beefit from cubus outperform EV Aalytics (cubus EV) CONTENT 01 cubus EV as a ehacemet to Oracle BI o Essbase 02
More informationDesktop Management. Desktop Management Tools
Desktop Maagemet 9 Desktop Maagemet Tools Mac OS X icludes three desktop maagemet tools that you might fid helpful to work more efficietly ad productively: u Stacks puts expadable folders i the Dock. Clickig
More informationEntropy of bicapacities
Etropy of bicapacities Iva Kojadiovic LINA CNRS FRE 2729 Site école polytechique de l uiv. de Nates Rue Christia Pauc 44306 Nates, Frace iva.kojadiovic@uivates.fr JeaLuc Marichal Applied Mathematics
More information2.7 Sequences, Sequences of Sets
2.7. SEQUENCES, SEQUENCES OF SETS 67 2.7 Sequeces, Sequeces of Sets 2.7.1 Sequeces Defiitio 190 (sequece Let S be some set. 1. A sequece i S is a fuctio f : K S where K = { N : 0 for some 0 N}. 2. For
More informationDefinition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean
1 Social Studies 201 October 13, 2004 Note: The examples i these otes may be differet tha used i class. However, the examples are similar ad the methods used are idetical to what was preseted i class.
More informationTHE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n
We will cosider the liear regressio model i matrix form. For simple liear regressio, meaig oe predictor, the model is i = + x i + ε i for i =,,,, This model icludes the assumptio that the ε i s are a sample
More informationReview for College Algebra Final Exam
Review for College Algebra Fial Exam (Please remember that half of the fial exam will cover chapters 14. This review sheet covers oly the ew material, from chapters 5 ad 7.) 5.1 Systems of equatios i
More informationAdvanced Probability Theory
Advaced Probability Theory Math5411 HKUST Kai Che (Istructor) Chapter 1. Law of Large Numbers 1.1. σalgebra, measure, probability space ad radom variables. This sectio lays the ecessary rigorous foudatio
More informationVladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT
Keywords: project maagemet, resource allocatio, etwork plaig Vladimir N Burkov, Dmitri A Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT The paper deals with the problems of resource allocatio betwee
More informationTHE ABRACADABRA PROBLEM
THE ABRACADABRA PROBLEM FRANCESCO CARAVENNA Abstract. We preset a detailed solutio of Exercise E0.6 i [Wil9]: i a radom sequece of letters, draw idepedetly ad uiformly from the Eglish alphabet, the expected
More informationNotes on exponential generating functions and structures.
Notes o expoetial geeratig fuctios ad structures. 1. The cocept of a structure. Cosider the followig coutig problems: (1) to fid for each the umber of partitios of a elemet set, (2) to fid for each the
More informationCS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations
CS3A Hadout 3 Witer 00 February, 00 Solvig Recurrece Relatios Itroductio A wide variety of recurrece problems occur i models. Some of these recurrece relatios ca be solved usig iteratio or some other ad
More informationConfidence Intervals and Sample Size
8/7/015 C H A P T E R S E V E N Cofidece Itervals ad Copyright 015 The McGrawHill Compaies, Ic. Permissio required for reproductio or display. 1 Cofidece Itervals ad Outlie 71 Cofidece Itervals for the
More informationHypergeometric Distributions
7.4 Hypergeometric Distributios Whe choosig the startig lieup for a game, a coach obviously has to choose a differet player for each positio. Similarly, whe a uio elects delegates for a covetio or you
More informationCHAPTER 3 THE TIME VALUE OF MONEY
CHAPTER 3 THE TIME VALUE OF MONEY OVERVIEW A dollar i the had today is worth more tha a dollar to be received i the future because, if you had it ow, you could ivest that dollar ad ear iterest. Of all
More informationNow here is the important step
LINEST i Excel The Excel spreadsheet fuctio "liest" is a complete liear least squares curve fittig routie that produces ucertaity estimates for the fit values. There are two ways to access the "liest"
More informationMaximum Likelihood Estimators.
Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio
More informationLesson 15 ANOVA (analysis of variance)
Outlie Variability betwee group variability withi group variability total variability Fratio Computatio sums of squares (betwee/withi/total degrees of freedom (betwee/withi/total mea square (betwee/withi
More informationCenter, Spread, and Shape in Inference: Claims, Caveats, and Insights
Ceter, Spread, ad Shape i Iferece: Claims, Caveats, ad Isights Dr. Nacy Pfeig (Uiversity of Pittsburgh) AMATYC November 2008 Prelimiary Activities 1. I would like to produce a iterval estimate for the
More informationPage 2 of 14 = T(2) + 2 = [ T(3)+1 ] + 2 Substitute T(3)+1 for T(2) = T(3) + 3 = [ T(4)+1 ] + 3 Substitute T(4)+1 for T(3) = T(4) + 4 After i
Page 1 of 14 Search C455 Chapter 4  Recursio Tree Documet last modified: 02/09/2012 18:42:34 Uses: Use recursio tree to determie a good asymptotic boud o the recurrece T() = Sum the costs withi each level
More information3. Greatest Common Divisor  Least Common Multiple
3 Greatest Commo Divisor  Least Commo Multiple Defiitio 31: The greatest commo divisor of two atural umbers a ad b is the largest atural umber c which divides both a ad b We deote the greatest commo gcd
More information4.1 Sigma Notation and Riemann Sums
0 the itegral. Sigma Notatio ad Riema Sums Oe strategy for calculatig the area of a regio is to cut the regio ito simple shapes, calculate the area of each simple shape, ad the add these smaller areas
More informationEkkehart Schlicht: Economic Surplus and Derived Demand
Ekkehart Schlicht: Ecoomic Surplus ad Derived Demad Muich Discussio Paper No. 200617 Departmet of Ecoomics Uiversity of Muich Volkswirtschaftliche Fakultät LudwigMaximiliasUiversität Müche Olie at http://epub.ub.uimueche.de/940/
More informationHypothesis Tests Applied to Means
The Samplig Distributio of the Mea Hypothesis Tests Applied to Meas Recall that the samplig distributio of the mea is the distributio of sample meas that would be obtaied from a particular populatio (with
More information(VCP310) 18004186789
Maual VMware Lesso 1: Uderstadig the VMware Product Lie I this lesso, you will first lear what virtualizatio is. Next, you ll explore the products offered by VMware that provide virtualizatio services.
More informationHypothesis testing. Null and alternative hypotheses
Hypothesis testig Aother importat use of samplig distributios is to test hypotheses about populatio parameters, e.g. mea, proportio, regressio coefficiets, etc. For example, it is possible to stipulate
More informationhp calculators HP 12C Statistics  average and standard deviation Average and standard deviation concepts HP12C average and standard deviation
HP 1C Statistics  average ad stadard deviatio Average ad stadard deviatio cocepts HP1C average ad stadard deviatio Practice calculatig averages ad stadard deviatios with oe or two variables HP 1C Statistics
More informationYour organization has a Class B IP address of 166.144.0.0 Before you implement subnetting, the Network ID and Host ID are divided as follows:
Subettig Subettig is used to subdivide a sigle class of etwork i to multiple smaller etworks. Example: Your orgaizatio has a Class B IP address of 166.144.0.0 Before you implemet subettig, the Network
More informationBreaking Undercover: Exploiting Design Flaws and Nonuniform Human Behavior
Breakig Udercover: Exploitig Desig Flaws ad Nouiform Huma Behavior Toi Perković* FESB, Uiversity of Split, Croatia toperkov@fesbhr Shuju Li* Uiversity of Kostaz, Germay shujuli@uikostazde Asma Mumtaz
More informationAutomatic Tuning for FOREX Trading System Using Fuzzy Time Series
utomatic Tuig for FOREX Tradig System Usig Fuzzy Time Series Kraimo Maeesilp ad Pitihate Soorasa bstract Efficiecy of the automatic currecy tradig system is time depedet due to usig fixed parameters which
More informationSequences and Series
CHAPTER 9 Sequeces ad Series 9.. Covergece: Defiitio ad Examples Sequeces The purpose of this chapter is to itroduce a particular way of geeratig algorithms for fidig the values of fuctios defied by their
More informationResearch Method (I) Knowledge on Sampling (Simple Random Sampling)
Research Method (I) Kowledge o Samplig (Simple Radom Samplig) 1. Itroductio to samplig 1.1 Defiitio of samplig Samplig ca be defied as selectig part of the elemets i a populatio. It results i the fact
More information