Fast Set Intersection in Memory

Transcription

1 Fast Set Intersection in Memory Bolin Ding University of Illinois at Urbana-Champaign 2 N. Goodin Avenue Urbana, IL 68, USA bding3@uiuc.edu Arnd Christian König Microsoft Research One Microsoft Way Redmond, WA 9852, USA chriso@microsoft.com ABSTRACT Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a orst-case efficient ay. In general, given (preprocessed) sets, ith totally n elements, e ill sho ho to compute their intersection in expected time O(n/ + r), here r is the intersection size and is the number of bits in a machine-ord. In addition,e introduce a very simple version of this algorithm that has eaer asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and orloads.. INTRODUCTION Fast processing of set intersections is a ey operation in many query processing tass in the context of databases and information retrieval. For example, in the context of databases, set intersections are used in the context of various forms of data mining, text analytics, and evaluation of conunctive predicates. They are also the ey operations in enterprise and eb search. Many of these applications are interactive, meaning that the latency ith hich query results are displayed is a ey concern. It has been shon in the context of search that query latency is critical to user satisfaction, ith increases in latency directly leading to feer search queries being issued and higher rates of query abandonment [, 7]. As a consequence, significant portions of the sets to be intersected are often cached in main memory. This paper ill study the performance of set intersection algorithms for main-memory resident data. Note that these techniques are also relevant in the context of large dis-based (inverted) indexes, hen large fractions of these reside in a main memory cache. There has been considerable study of set intersection algorithms in information retrieval (e.g., [2, 4, ]). Most of these papers assume that the underlying data structure is an inverted index [23]. Much of this or (e.g., [2, 4]) focuses on adaptive algorithms hich use the number of comparisons as measure of overhead. For in-memory data, additional structures hich encode additional sipping-steps [8], tree-based structures [7], or hash-based algo- Permission to mae digital or hard copies of all or part of this or for personal or classroom use is granted ithout fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume ere invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th - September 3rd 2, Seattle, Washington. Proceedings of the VLDB Endoment, Vol. 4, No. 4 Copyright 2 VLDB Endoment //... $.. rithms become possible, hich often outperform inverted indexes; e.g., using hash-based dictionaries, intersecting to sets L, L 2 requires expected time O(min( L, L 2 )), hich is a factor of Θ(log( + max( L / L 2, L / L 2 ))) better than the best possible orst-case performance of comparison-based algorithms [6]. In this or, e propose ne set intersection algorithms aimed at fast performance.these outperform the competing techniques for most inputs and are also robust in that for inputs here they are not optimal they are close to the best-performing algorithm. The tradeoff for this gain is a slight increase in the size of the data structures, hen compared to an inverted index; hoever, in user-facing scenarios here latency is crucial, this tradeoff is often acceptable.. Contributions Our approach leverages to ey observations: (a) If is the size (in bits) of a machine-ord, e can encode a set from a universe of elements in a single machine ord, alloing for very fast intersections. (b) For the data distributions seen in many real-life examples (in particular search applications), the size of intersections is typically much smaller than the smallest set being intersected. To illustrate the second observation, e analyzed the K most frequent queries issued against the Bing Shopping portal. For 94% of all queries it held that the size of the full intersection as at least one order of magnitude smaller than the document frequency of the least frequent eyord; for 76% of the queries the difference as to orders of magnitude. By exploiting these to observations, e mae the folloing contributions. (i) We introduce linear-space data structures to represent sets such that their intersection can be computed in a orst-case efficient ay. Given sets, ith n elements in total, these data structures allo us to compute their intersection in expected time O(n/ + r), here r is the size of the intersection and is the number of bits in a machine-ord; hen the size of the intersection is an order of magnitude (or more) smaller than the size of the smallest set being intersected, our approach yields significant improvements in execution time over previous approaches. To the best of our noledge, the best asymptotic bound for fast set intersection is achieved by the O ( (n(log 2 ) 2 )/ + r ) algorithm of [6]. Hoever, note that the bound relies on a large value of ; in practice, is small (and constant), and < 2 6 = bits implies / < (log 2 ) 2 /. More importantly, [6] requires complex bit-manipulation, maing it slo in practice, hich e ill demonstrate empirically in Section 4. (ii) We describe a much simpler algorithm that computes the intersection in expected O(n/α m + mn/ + r ) time, here α is a constant determined by, and m is a parameter. This algorithm has eaer guarantees in theory, but performs better in practice, and gives significant improvements over the various data structures typically used, hile being very simple to implement. 255

2 2. BACKGROUND AND RELATED WORK Algorithms based on Ordered Lists: Most or on set intersection focuses on ordered lists as the underlying data structure, in particular algorithms using inverted indexes, hich have become the standard data structure in information retrieval. Here, documents are identified via a document ID, and for each term t, the inverted index stores a sorted list of all document IDs containing t. Using this representation, to sets L, L 2 of similar sizes (i.e., L L 2 ) can be intersected efficiently using a linear merge by scanning both lists in parallel, requiring O( L + L 2 ) operations (the merge step in merge sort). This approach is asteful hen set sizes differ significantly or only small fractions of the sets intersect. For very different set sizes, algorithms have been proposed that exploit this asymmetry, requiring log ( L + L 2 ) L + L comparisons at most (for L < L 2 ) [6]. To improve the performance further, there has recently been significant or on so-called adaptive set-intersection algorithms for set intersections [2, 4, 3,, 2, 5]. These algorithms use the total number of comparisons as measure of the algorithm s complexity and aim to use a number of comparisons as close as possible to the minimum number of comparisons ideally required to establish the intersection. Hoever, the resulting reduction in the number of comparisons does not necessarily result in performance improvements in practice: for example, in [2], binary search based algorithms outperform a parallel scan only hen L 2 < 2 L, even though several times feer comparisons are needed. Hierarchical Representations: There are various algorithms for set intersections based on variants of balanced trees (e.g. [9], treaps [7], and sip-lists [8]), computing the intersection of (preprocessed) sets L, L 2 in O( L log( L 2 / L )) (for L < L 2 ) operations. Hoever, hile some form of sipping is commonly used as part of algorithms based on inverted indexes, sip-lists (or trees) are typically not used in the scenarios outlined above (ith static set data) due to the required space-overhead. A novel and compact to-level representation of posting lists aimed at fast intersections in main memory as proposed in [9]. Algorithms based on Hashing: Using a hash-based representation of sets can speed up the intersection of sets L, L 2 ith L L 2 significantly (expected time O( L ) by looing up all elements of L in the hash-table of L 2); hoever, because of the added indirection, this approach performs poorly for less seed set sizes. A ne hashing-based approach is proposed in [6]: here, the elements in sets L, L 2 are mapped using a hash-function h to smaller (approximate) representations h(l ), h(l 2). These representations are then intersected to compute H = h(l ) h(l 2). Finally, the set of all elements in the original sets that map to H via h are computed and any false positives removed. As the hashed images h(l ), h(l 2) to be intersected are smaller than the original sets (using feer bits), they can be intersected more quicly. Given sets of total size n, their intersection can be computed in expected time O ( (n log 2 )/ + r ), here r = i Li. Score-based pruning: In many IR engines it is possible to avoid computing full intersections by leveraging scoring functions that are monotonic in the individual term-ise scores; this maes it possible to terminate the intersection processing early using approaches such as TA [5] or document-at-a-time (DAAT) processing (e.g., [8]). Hoever, in practice, this is often not possible, either because of the complexity of the scoring function (e.g., nonmonotonic machine-learning based raning functions) or because full intersection results are required. Our approach is based on partitioning the elements in each set into very small ( 8 elements) groups, for hich e have fast intersection schemes. Hence, DAATapproaches can be combined ith our or by using these small groups in place of individual documents. Set intersections using multiple cores: Techniques that exploit multi-core architectures to speed up set intersections are described in [2, 22].The use of multiple cores is orthogonal to our approach in the sense that our algorithms can be parallelized for these architectures as ell; hoever, this is beyond the scope of our paper. 3. OUR APPROACH Notation: We are given a collection of N sets S = {L,..., L N }, here L i Σ and Σ is the universe of elements in the sets; let n i = L i be the size of set L i. Suppose elements in a set are ordered, and for a set L, let inf(l) and sup(l) be the minimum and maximum elements of a set L, respectively. We use to denote the size (number of bits) of a ord on the target processor. Throughout the paper e ill use log to denote log 2. Finally, e use [] to denote the set {,..., }. Our approach can be extended to bag semantics by additionally storing element frequency. Frameor: Our tas is to design data structures such that the intersection of multiple sets can be computed efficiently. We differentiate beteen a pre-processing stage, during hich e reorganize each set and attach additional index structures, and an online processing stage, hich uses the pre-processed data structures to compute intersections. An intersection query is specified via a collection of sets L, L 2,..., L (to simplify notations, e use the offsets, 2,..., to refer to the sets in a query throughout this section); our goal is to compute L L 2... L efficiently. Note that pre-processing is typical of most non-trivial data structures used for computing set intersections; even building simple non-compressed inverted indexes requires sorting the posting lists as a pre-processing step. We require the pre-processing stage to be time/space-efficient in that it does not require more than O(n i log n i) time (necessary for sorting) and linear space O(n i). The size of intersection L L 2 is a loer bound of the time needed to compute the intersection. Our method leverages to ey ideas to approach this loer bound: (i) The intersection of to sets in a small universe can be computed very efficiently; in particular, if the to sets are subsets of {, 2,..., }, e can encode them as single machine-ords and compute their intersection using a bitise-and. (ii) A small number of elements in a large universe can be mapped into a small universe. partitioning via sorting/hashing L L p L h :Σ [] h(l ) h(l 2 ) h(l p ) compute L p h(l p ) h(l q 2) L p L q (ith the help of h(l p L q 2: 2 ) h(l q 2)) h(l 2) h(l 2 2) h(l q 2) L L L q 2 L partitioning via sorting/hashing L 2 Figure : Algorithmic Frameor h :Σ [] We leverage these to ideas by first partitioning each set L i into smaller groups L i s, hich are intersected separately. In the pre-processing stage, e map each small group into a small universe [] = {, 2,..., } using a universal hash function h and encode the image h(l i ) ith a machine-ord. Then, in the on- 256

3 line processing stage, to compute the intersection of to small groups L p and Lq 2, e first use a bitise-and operation to compute H = h(l p ) h(lq 2 ), and then try to recover Lp Lq 2 using the inverse mapping h from H. The union of L p Lq 2 s forms L L 2. Moreover, if the intersection L L 2 is of a small size compared to L and L 2 (seen in practice), a large fraction of the small groups ith overlapping ranges has an empty intersection; thus, by using the ord-representations of H to detect these groups quicly, e can sip much unnecessary computation, resulting in significant speed-up. The resulting algorithmic frameor is illustrated in Figure. Given this overall approach, the ey questions become ho to form groups, hat structures to be used to represent them, and ho to process intersections of these small groups. We ill discuss these details in the folloing sections. All the formal proofs of analytical results are deferred to the appendix. 3. Intersection via Fixed-Width Partitions We first consider the case hen there are only to sets L and L 2 in the intersection query. We ill present a pair of pre-processing and online processing algorithms, hich e use to illustrate the basic ideas of our algorithms. We subsequently refine and extend our techniques to sets in Section 3.2. In the pre-processing stage, L and L 2 are sorted, and partitioned into groups (recall is the ord idth) L, L 2,..., L n /, and L 2, L 2 2,..., L n 2/ 2 of equal size (except the last ones). In the online processing stage (Algorithm ), the small groups are scanned in order. If the ranges of L p and Lq 2 overlap, e may have L p Lq 2. The intersection Lp Lq 2 of each pair of overlapping groups is computed (line 8) in some iteration. And finally, the union of all these intersections is L L 2. Since each group is scanned once, line 2- repeat for O((n + n 2)/ ) iterations. The maor remaining question no becomes ho to compute L p Lq 2 efficiently ith proper pre-processing? For this purpose, e map each group L p or Lq 2 into a small universe for fast intersection, and e leverage single-ord representations to store and manipulate sets from a small universe. Single-Word Representation of Sets: We represent a set A [] = {, 2,..., } using a single machine-ord of idth by setting the y-th bit as iff y A. We refer to this as the ord representation (A) of A. For to sets A and B, the bitise-and (A) (B) (computed in O() time) is the ord representation of A B. Given a ord representation (A), all the elements of A can be retrieved in linear time O( A ). In the rest of this paper, if A [], e use A to denote both a set and its ord representation. Pre-processing Stage: Elements in a set L i are sorted as {x i, x 2 i,..., x n i i } (i.e., x i < x + i ) and L i is partitioned as follos: L i = {x i,..., x i }, L 2 i = {x + i,..., x 2 i },... () L i = {x( ) + i, x ( ) +2 i,..., x i },... (2) For each small group L i, e compute the ord-representation of its image under a universal hash function h : Σ [], i.e., h(l i ) = {h(x) x L i }. In addition, for each position y [] and each small group L i, e also maintain the inverted mapping h (y, L i ) = {x x L i and h(x) = y}, i.e., for each y [] We use the folloing ell-non technique: ( is bitise-xor) (i) lobit = (((A) ) (A)) (A) is the loest -bit of (A). For the smallest element y in A, e have 2 y = lobit. y = log(lobit) A can be computed using the machine instruction NLZ (number of leading zeros) or pre-computed looup tables. (ii) Set (A) as (A) lobit and repeat (i) to scan the next smallest element until (A) becomes. e store the elements in L i ith hash value y, in a short list hich supports ordered access. We ensure that the order of these elements is identical across different h (y, L i ) s and Li s; in this ay, e can intersect these short lists using a linear merge. EXAMPLE 3.. (PRE-PROCESSING AND DATA STRUCTURES) Suppose e have to sets L = {, 2, 4, 9, 6, 27, 43}, L 2 = {, 3, 5, 9,, 6, 22, 32, 34, 49}. And, let = 6 ( = 4). For simplicity, h is selected to be h(x) = (x ) mod 6. L is partitioned into 2 groups: L = {, 2, 4, 9}, L 2 = {6, 27, 43}, and L 2 is partitioned into 3 groups: L 2 = {, 3, 5, 9}, L 2 2 = {, 6, 22, 32}, L 3 2 = {34, 49}. We pre-compute: h(l ) = {, 2, 4, 9}, h(l 2 ) = {, }, h(l 2) = {, 3, 5, 9}, h(l 2 2) = {, 6, }, h(l 3 2) = {, 2}. We also pre-process h (y, L p i ) s: for example, h (, L 2 ) = {6}, h (, L 2 2) = {6, 32}, h (, L 2 ) = {27, 43}, and h (, L 2 2) = {}. : p, q, 2: hile p n and q n 2 do 3: if inf(l q 2 ) > sup(lp ) then 4: p p + 5: else if inf(l p ) > sup(lq 2 6: ) then q q + 7: else 8: compute (L p Lq 2 ) using IntersectSmall 9: (L p Lq 2 ) : if sup(l p ) < sup(lq 2 ) then p p + else q q + : is the result of L L 2 Algorithm : Intersection via fixed-idth partitioning Online Processing Stage: The algorithm used to intersect to sets is shon in Algorithm. Since elements in L i are sorted, Algorithm ensures that if the ranges of any to small groups L p, Lq 2 overlap, their intersection is computed (line 8). After scanning all such pairs, must then contain the intersection of the hole sets. No the question is: ho to compute the intersection of to small groups L p Lq 2 efficiently? For this purpose, e introduce the algorithm IntersectSmall (Algorithm 2), hich: (i) first computes H = h(l p ) h(lq 2 ) using a bitise-and; (ii) for each (-bit) y H, intersects the corresponding inverted mappings using the linear merge algorithm. IntersectSmall(L p, Lq 2 ): computing Lp Lq 2 : Compute H h(l p ) h(lq 2 ) 2: for each y H do 3: Γ Γ (h (y, L p ) h (y, L q 2 )) 4: Γ is the result of L p Lq 2 Algorithm 2: Computing the intersection of small groups EXAMPLE 3.2. (ONLINE PROCESSING) Folloing Example 3., to compute L L 2, e need to compute L L 2, L 2 L 2 2, and L 2 L 3 2 (pairs ith overlapping ranges): for example, for computing L 2 L 2 2, e first compute h(l 2 ) h(l 2 2) = {, }; then L 2 L 2 2 = ( y=, h (y, L 2 ) h (y, L 2 2) ) = {6}. Similarly, e can compute L L 2 = {, 9}. Finally, e find h(l 2 ) h(l 3 2) =, and thus L 2 L 3 2 =. So, e have L L 2 = {, 9} {6}. Note that ord representations and inverted mappings for L i are pre-computed, and ord-representations can be intersected using one operation. So the running time of IntersectSmall is bounded by the number of pairs of elements, one from L p and one from Lq 2, that are mapped to the same hash-value. This number can be shon to be equal (in expectation) to the intersection size plus O() for each group L i. Using this, e obtain Algorithm s running time: 257

4 THEOREM 3.3. ) Algorithm computes L L 2 in expected + r time, here r = L L 2. O( n +n 2 To achieve a better bound, e optimize the group sizes: ith L and L 2 partitioned into groups of sizes s = n /n 2 and s 2 = n2/n, respectively, L L 2 can be computed in expected O( n n 2/ + r) time. A detailed analysis of the effect of group size on running times can be found in Section A... Overhead of Pre-processing: If only the bound in Theorem 3.3 is required, then to pre-process a set L i of size n i, it is obvious that O(n i log n i) time and O(n i) space suffice: e only need to partition a sorted list into small groups of size, and for each small group, construct the ord representation and inverted mapping in linear time using the hash function h. To achieve the better bound O( n n 2/+r), e need multiple resolutions of the partitioning of a set L i. This is because, as discussed above, the optimal group size s = n /n 2 of the set L also depends on the size n 2 of the set L 2 to be intersected ith it. For this purpose, e partition a set L i into small groups of size 2, 4,..., 2, etc. To compute L L 2 for the given to sets, suppose s i is the optimal group size of L i; e then select the actual group size s i = 2 t s.t. s i s i 2s i, obtaining the same bound. A carefully-designed multi-resolution data structure enabling access to these groups consumes only O(n i) space for L i. We ill describe and analyze this structure in Section THEOREM 3.4. To pre-process a set L i of size n i for Algorithm, e need O(n i log n i) time and O(n i) space (in ords). Limitations of Fixed-Width Partitions: The main limitation of the proposed approach is that it is difficult to extend to more than to sets, because the partitioning scheme e use is not ell-aligned for more than to sets: for three sets, e.g., there may be more than O((n + n 2 + n 3)/ ) triples of small groups that overlap. We introduce a different partitioning scheme to address this issue in Section 3.2, hich extends to > 2 sets. 3.2 Intersection via Randomized Partitions In this section, e ill introduce an algorithm based on a randomized partitioning scheme to compute the intersection of to or more sets. The general approach is as follos: instead of fixedidth partitions, e use a hash function g to partition each set into small groups, using the most significant bits of g(x) to group an element x Σ. This reduces the number of combinations (pairs) of small groups e have to intersect, alloing us to prove bounds similar to Theorem 3.3 for computing intersections of > 2 sets. Pre-processing Stage: Let g be a hash function g : Σ {, } mapping an element to a bit-string (or binary number); e use g t(x) to denote the t most significant bits of g(x). We say that for to bit-strings z and z 2, z is a t -prefix of z 2, iff z is identical to the highest t bits in z 2; e.g., is a 4-prefix of. To pre-process a set L i, e partition it into groups L z i = {x x L i and g t(x) = z} for all z {, } t (some t). As before, e compute the ord representation of the image of each L z i under another hash function h : Σ [], and inverted mappings h. Online Processing Stage: This stage is similar to our previous algorithm: to compute the intersection of to sets L and L 2, e compute the intersections of pairs of overlapping small groups, one from each set, and finally tae the union of these intersections. In general, suppose L is partitioned using g t : Σ {, } t, and L 2 is partitioned using g t2 : Σ {, } t 2. Assume n n 2 and t t 2. We no intersect sets L and L 2 using Algorithm 3. The maor improvement of Algorithm 3 compared to Algorithm is that in Algorithm, e need compute L p Lq 2 hen the ranges of L p and Lq 2 overlap; in Algorithm 3, e compute Lz Lz 2 2 (also using Algorithm 2) hen z is a t -prefix of z 2 (this is a necessary condition for L z Lz 2 2 ; so Algorithm 3 is correct). This significantly reduces the number of pairs to be intersected. : for each z 2 {, } t 2 do 2: Let z {, } t be the t -prefix of z 2 3: Compute L z Lz 2 2 using IntersectSmall(Lz 4: Let (L z Lz 2 2 5: ) is the result of L L 2, Lz 2 2 ) Algorithm 3: 2-list Intersection via Randomized Partitioning Based on the choices of parameters t and t 2, e can either partition L and L 2 into the same number of small groups (yielding the bound of Theorem 3.5), or into small groups of the (approximately) identical sizes (yielding Theorem 3.6). ( THEOREM 3.5. Algorithm 3 computes L L 2 in expected n ) n O 2 + r time (r = L L 2 ), ith t = t 2 = log n n 2. THEOREM 3.6. ) Algorithm 3 computes L L 2 in expected + r time (r = L L 2 ), using t = log(n / ) O( n +n 2 and t 2 = log(n 2/ ). Note that hen n n 2, Theorem 3.5 has a better bound than Theorem 3.6. But e can extend Theorem 3.6 to -set intersection. Extension to More Than To Sets: Suppose e ant to compute the intersection of sets L,..., L, here n i = L i and n n 2... n. L i is partitioned into groups L z i s using g ti : Σ {, } t i. Note that g ti s are generated from the same hash function g. We use t i = log(n i/ ) and proceed as in Algorithm 4. Algorithm 4 is almost identical to Algorithm 3, but is generalized to sets: for each z {, } t, e pic the group identifiers z i to be the t i-prefix of z, and e only intersect groups L z, Lz 2 2,..., Lz, here z, z 2,..., z share a prefix of size t. Also, e extend IntersectSmall (Algorithm 2) for groups: e first compute the intersection (bitise-and) of hash images (their ord-representations) of the groups L z i i s; and, if the result H = i= h(lz i i ) is not zero, for each (-bit) y H, e intersect the corresponding inverted mappings h (y, L z i i ) s. Details and analysis are deferred to the appendix. THEOREM 3.7. Using t i = log(n i/ ), Algorithm 4 computes the intersection i= Li of sets in expected O(n/ + r) time, here r = i= Li and n = i= ni = i= Li. : for each z {, } t (t i = log(n i / ) ) do 2: Let z i be the t i -prefix of z for i =,..., 3: Compute i= Lz i i using extended IntersectSmall 4: Let ( i= Lz i i ) 5: is the result of i= L i Algorithm 4: -list Intersection via Randomized Partitioning 3.2. A Multi-resolution Data Structure Recall that in some algorithms (e.g., Theorem 3.5), the selection of the number of small groups used for a set L i depends on the (size of) other sets being intersected ith L i. So by naively precomputing the required structures for each possible group size, e ould incur excessive space requirements. In this section, e describe a data structure that supports access to partitions of L i into 2 t groups for any possible t, using only O(n i) space. It is illustrated in Figure 2. To support the algorithms introduced so far, this structure must also allo us: (i) for each L z i, to retrieve the ord-representation h(l z i ), and 258

5 pointers 2 x L z i x L z i partitioned by g L z i partitioned by g 2 x L z i partitioned by g 3 y pointer to the first element in L z i s.t. h(x) =y next(x) = the smallest x L i s.t. x x and h(x) =h(x ) Figure 2: Multi-Resolution Partition of L i g :Σ {, } g 2 :Σ {, } 2 g 3 :Σ {, } 3 g t :Σ {, } t (ii) for each y [], to access all elements in h (y, L z i ) = {x x L z i and h(x) = y} in time linear in its size h (y, L z i ). Multi-resolution Partitioning: For the ease of explanation, e suppose Σ = {, } and choose g as a random permutation of Σ. To pre-process L i, e first order all the elements x L i according to g(x). Then any small group L z i = {x x L i and g t(x) = z} forms a consecutive interval in L i (partitions of different resolutions are formed for t =, 2,...). Note: in all of our algorithms, universal hash functions and random permutations are almost interchangeable (hen used as g) the differences being that (i) a permutation induces a total ordering of elements (in this data structure, this property is required), hereas hashing may result in collisions (hich e can overcome by using the pre-image to brea ties) and (ii) there is a slight difference in the resulting probability of, e.g., elements being grouped together (hashing results in (limited) independence, hereas permutations result in negative dependence e account for this by using the eaer condition in our proofs). Word Representations of Hash Mappings: No, for each small group L z i, e need to pre-compute and store the ord representation h(l z i ). Note the total number of small groups is n i/2+n i/ n i/2 t +... n i. So this requires O(n i) space. Inverted Mappings: We need to access all elements in h (y, L z i ) in order, for each y []. If e ere to store these mappings for each L z i explicitly, this ould require O(n i log n i) space. Hoever, by storing the inverted mappings h (y, L z i ) s implicitly, e can do better, as follos: For each group L z i, since it corresponds to an interval in L i, e can store the starting and ending positions in L i, denoted by left(l z i ) and right(l z i ). These allo us to determine if an element x belongs to L z i. No, to enable the ordered access to the inverted mappings, e define, for each x L i, next(x) to be the next element x to x on the right s.t. h(x ) = h(x) (i.e., ith minimum g(x ) > g(x) s.t. h(x ) = h(x)). Then, for each L z i and each y [], e store the position first(y, L z i ) of the first element x in L z i s.t. h(x ) = y. No, to access all elements in h (y, L z i ) in order, e can start from the element at first(y, L z i ), and follo the pointers next(x), until passing the right boundary right(l z i ). And, in this ay, all elements in the inverted mapping are retrieved in the same order as g(x) hich e require for IntersectSmall. Space Requirements: For all groups of different sizes, the total space for storing h(l z i ) s, left(l z i ) s, right(l z i ) s, first(y, L z i ) s and next(x) s is O(n i). So the hole multi-resolution data structure requires O(n i) space. A detailed analysis is in the appendix. When the group size t i depends only on n i (e.g., in Algorithm 4), single-resolution in pre-processing suffices, and the above multiresolution scheme (for selecting t i online) is not necessary. THEOREM 3.8. To pre-process a set L i of size n i for Algorithm 3-4, e need O(n i log n i) time and O(n i) space (in ords). 3.3 From Theory to Practice In this section, e describe a more practical version of our methods. This algorithm is simpler, uses significantly less memory, straight-forard data structures, and, hile it has orse theoretical guarantees, is faster in practice. The main difference is that for each small group L z i, e only store the elements in L z i and their images under m hash functions (i.e., e do not maintain inverted mappings, trading off a complex O()-access for a simple scan over a short bloc of data). Also, e use only a single partition for each set L i. Having multiple ord representations of hash images (different hash functions) for each small group allos us to detect empty intersections of small groups ith higher probability. Pre-processing Stage: As before, each set L i is partitioned into groups L z i s using a hash function g ti : Σ {, } t i. We ill sho that a good selection of t i is log(n i/ ), hich depends only on the size of L i. Thus for each set L i, pre-processing ith a single partitioning suffices, saving significant memory. For each group, e compute ord representations of images under m (independent) universal hash functions h,..., h m : Σ []. Note that e only require a small value of m in practice (e.g., m = 2). Online Processing Stage: The algorithm for computing il i e use here (Algorithm 5) is identical to Algorithm 4, ith to exceptions: () When needed, il z i i is directly computed by a linear merge of L z i i s (line 4), using O(Σi Lz i i ) time. (2) We can sip the computation of il z i i if, for some h, the bitise-and of the corresponding ord representations h (L z i i ) s is zero (line 3). : for each z {, } t (t i = log(n i / ) ) do 2: Let z i be the t i -prefix of z for i =,..., 3: if i= h (L z i i ) for all =,..., m then 4: Compute i= Lz i i by a linear merge of L z,..., Lz 5: Let ( i= Lz i i ) 6: is the result of i= L i Algorithm 5: Simple Intersection via Randomized Partitioning Analysis: To see hy Algorithm 5 is efficient, e observe that: if L z Lz 2 2 =, then ith high probability, h(lz ) h(lz 2 2 ) = for some =,..., m. So most empty intersections can be sipped using the test in line 3. With the probability of a successful filtering (i.e. given il z i i =, ih (L z i i ) = for some hash function h, =,..., m) bounded by the Lemmas A. and A.3, e can derive Theorem 3.9. Detailed analysis of this probability (both theoretical and experimental) and overall complexity is deferred to Appendix A.5. THEOREM 3.9. Using t i = log(n i/ ), Algorithm 5 computes ( i= Li in expected O max(n,n ) + mn α() m + r ) time (r= i= L i, n= i= n i, α()= for β() used in Lemma A.3). β() 3.3. Data Structure for Storing L z i In this section, e describe the simple and space-efficient data structure that e use in Algorithm 5. As stated earlier, e only need to partition L i using one hash function g ti ; hence e can represent each L i as an array of small groups L z i s, ordered by z. For each small group, e store the information associated ith it in the structure shon in Figure 3. The first ord in this structure stores z = g ti (L z i ). The second ord stores the structure s length len. The folloing m ords represent the hash images h (L z i ),..., h m(l z i ) of L z i. Finally, e store the elements of L z i as an array in the remaining part. We need n i/ such blocs for 259

6 z len h (L z i ) m ords h m (L z i ) len Figure 3: The Structure for a Pre-processed Small Group L z i L i in total. The first ord z can be also computed on-the-fly, as these small groups are accessed sequentially in Algorithm 5. So, if e store len using one ord, and one ord for each element of L z i, then e need totally m + + L z i ords for each group L z i, and thus n i( + (m + )/ ) ords to store the pre-processed L i. The overhead of the pre-processing is dominated by the cost of sorting L i (the remaining operations are trivial). THEOREM 3.. To pre-process a set L i of size n i for Algorithm 5, e need O(n i(m + log n i)) time, and O(n i( + m/ )) (ords) space. We describe methods for compressing this structure in Appendix B. 3.4 Intersecting Small and Large Sets An important special case for set intersection are asymmetric intersections here the sizes n and n 2 of the sets that are intersected vary significantly (.l.o.g., assume n n 2). In this subsection, using the same multi-resolution data structure as in Section 3.2., e present an algorithm HashBin that computes L L 2 in O(n log(n 2/n )) time. This bound is also achieved by other previous ors, e.g., SmallAdaptive [5], but our algorithm is even simpler in online processing. It is also non that algorithms based on hash-tables only require O(n ) time for this scenario; hoever, unlie HashBin, they are ill-suited for less asymmetric cases. Algorithm HashBin: When intersecting to sets L and L 2 ith sizes n n 2, e focus on the partitioning induced by g t : Σ {, } t, here t = log n for both of them, and g is a random permutation of Σ. To compute L L 2, e compute L z L z 2 for all z {, } t and tae the union. To compute L z L z 2, e iterate over each x L z, and perform a binary search to chec hether x L z 2 using O(log L z 2 ) time. This scheme can be extended to multiple sets by searching for x in L z i if found in L z,..., L z i. THEOREM ( 3.. The algorithm HashBin computes L L 2 in expected O n log n 2 time. The pre-processing of a list L i n ) requires O(n i log n i) time and O(n i) space. The proof of Theorem 3. and ho HashBin uses the multiresolution data structure is deferred to the Section A.6 in the appendix. The advantage of HashBin is that, since it is based on the same structure as the algorithm introduced in Section 3.2, e can mae the choice beteen algorithms online, based on n /n EXPERIMENTAL EVALUATION We evaluate the performance and space requirements of four of the techniques described in this paper: (a) the fixed-idth partition algorithm described in Section 3. (hich e ill refer to as IntGroup); (b) the randomized partition algorithm in Section 3.2 (RanGroup) (c) the simple algorithm based on randomized partitions described in Section 3.3 (RanGroupScan); and (d) the one for intersecting sets of seed sizes in Section 3.4 (HashBin). Setup: All algorithms are implemented using C and evaluated on a 4GB 64-bit 2.4GHz PC. We employ a random permutation of the document IDs for the hash function g and 2-universal hash functions for h (or h s). For RanGroup, e use m = 4 (the number of hash functions h ), unless noted otherise. We compare our techniques to the folloing competitors: (i) set intersection based on a simple parallel scan of inverted indexes: Merge; (ii) set intersection based on sip lists: SipList [8]; (iii) L z i set intersection based on hash tables: Hash (i.e., e iterate over the smallest set L, looing up every element x L in hash-table representations of L 2,... L ); (iv) the algorithm of [6]: BPP; (v) the algorithm proposed for fast intersection in integer inverted indices in main memory [9, 2]: Looup (using B = 32 as the bucetsize, hich is the best value in our and the authors experience); and (vi) various adaptive intersection algorithms based on binary search/galloping search: SvS, Adaptive [2, 3, 3], BaezaYates [, 2], and SmallAdaptive [5]. Note that BaezaYates is generalized to handle more than to sets as in [5]. Implementation: For each competitor, e tried our best to optimize its performance. For example, for Merge e tried to minimize the number of branches in the inner loop; e also store postings in consecutive memory addresses to speed up parallel scans and reduce page als after TLB misses. Our implementation of sip lists follos [8], ith simplifications since e are focusing on static data and do not need fast insertion/deletion. We also simplified the bit-manipulation in BPP [6] so that it ors faster in practice for small. For the algorithms using inverted indexes, e initially do not consider compression on the posting lists, as e do not ant the decompression step to impact the performance reports. In Section 4. e ill study variants of the algorithms incorporating compression. With regards to sip-operations in the index note that since e use uncompressed posting lists, algorithms such as Adaptive can perform arbitrary sips into the index directly. Datasets: To evaluate these algorithms e use both synthetic and real data. For the experiments ith synthetic datasets, sets are generated randomly (and uniformly) from a universe Σ. The real dataset is a collection of more than 8M Wiipedia pages. In each experiment for the synthetic datasets, 2 combinations of sets are randomly generated, and the average time is reported. Intersection Time (ms) M 2M 3M 4M 5M 6M 7M 8M 9M M Set Size Figure 4: Varying the Set Size Merge SipList Hash IntGroup BPP Adaptive Looup RanGroupScan Varying the Set Size: First, e measure the performance hen intersecting only 2 sets; e use synthetic data, the lists are of equal size and the size of the intersection is fixed at % of the list size; the results are shon in Figure 4. We can see that the performance of the different techniques relative to each other does not change ith varying list size. Hash performs orst, as the (relatively) expensive looup operation needs to be performed many times. SipList performs poorly for the same reason. The BPP algorithm is also slo, but this is because of a number of complex operations that need to be performed, hich are hidden as a constant in the O()-notation. The same trend held for the remaining experiments as ell; hence, for readability, e did not include BPP in the subsequent graphs. For the same reason e only sho the best-performing among the adaptive algorithms in the evaluation; if one adaptive algorithm dominates another on all parameter settings in an experiment, e don t plot the orse one. Among the remaining algorithms, RanGroupScan (4%-5% faster than Merge) and IntGroup perform the best (RanGroup performs similarly to IntGroup and is not plotted). Interestingly, 26

7 Intersection TIme (ms) MERGE performs best for L L 2 >.7 L Merge SipList Hash Adaptive SvS Looup IntGroup RanGroup Average Intersection Time (ms) K=2 K=3 K=4 RanGroupScan Intersection Size Figure 5: Varying the Intersection Size Figure 6: Varying the Number of Keyords the simple Merge algorithm is next, outperforming the more sophisticated algorithms, folloed by Looup and the best-performing adaptive algorithm. Varying the Intersection Size: The size of the intersection r is an important factor concerning the performance of the algorithms: larger intersections mean feer opportunities to eliminate small groups early for our algorithms or to sip parts of the set for the adaptive and siplist-based approaches. Here, e use synthetic data, intersecting to sets ith M elements and vary r = L L 2 beteen 5 and M. The results are reported in Figure 5. For r < 7M (7% of the set size) RanGroupScan and IntGroup perform best. Otherise, Merge becomes the fastest and RanGroup- Scan the 2nd-fastest alternative; here, the performance of Ran- GroupScan is very similar to Merge, all the ay to r = M. Among the remaining algorithms, RanGroup slightly outperforms Merge for r < 5M, Looup is the next-best algorithm and SvS and Adaptive perform best among the adaptive algorithms. Varying the Sets Size Ratios: As e illustrated in the introduction, the se in set sizes is also an important factor in performance. When sets are very different in size, algorithms that iterate through the smaller set and are able to locate the corresponding values in the larger set quicly, such as HashBin and Hash, perform ell. In this experiment e use synthetic data and vary the ratio of set sizes, setting L 2 = M and varying L beteen 6K and M. The size of the intersection is set to be % of L and e define the ratio beteen the list sizes as sr = L 2 / L. Here, the differences beteen the algorithms become small ith groing sr (for this reason, e also don t report them in a graph, as too many lines overlap). For sr < 32, RanGroupScan performs best; for larger sr, Looup and Hash perform best, until a ratio of sr for this and larger ratios, Hash outperforms the remaining algorithms, folloed by Looup and HashBin. Generally, both HashBin and RanGroupScan perform close to the best-performing algorithm. The adaptive algorithms require more time than RanGroupScan for sr 2 and more time than HashBin for all values of sr; Siplist and BPP perform orst across all values of sr. Varying the Number of Keyords: In this experiment, e varied the number of sets = 2, 3, 4, fixing L i = M for i =,...,, ith the IDs in the sets being randomly generated using a uniform distribution over [, 2 8 ]; the results are reported in Figure 6. In this experiment, e use m = 2 hash images for RanGroupScan. For multiple sets, RanGroupScan is the fastest, ith the difference becoming more pronounced for 3 and 4 eyords, since, ith additional sets, intersecting the hash-images (ord-representations) yields more empty results, alloing us to sip the corresponding groups. RanGroup is the next-best performing algorithm; e don t include results for IntGroup here, as it is designed for intersections of to sets (see Section 3.). In- terestingly, the simple Merge algorithm again performs very ell hen compared to the more sophisticated techniques; the Looup algorithm is next, folloed by the various adaptive techniques. Size of the Data Structure: The improvements in speed come at the cost of an increase in space: our data structures (ithout compression) require more space than an uncompressed posting list the increase is 37% (RanGroupScan for m = 2), 63% (Ran- GroupScan for m = 4), 75% (IntGroup) or 87% (RanGroup). Normalized Intersection Time Results: All Query Sizes Merge Figure 7: Normalized Execution Time on a Real Worload Experiment on Real Data: In this experiment, e used a orload of the 4 most frequent (measured over a ee in 29) queries against the Bing.com search engine. As the text corpus, e used a set of 8 Million Wiipedia documents. Query characteristics: 68% of the queries contain 2 eyords, 23% 3 eyords and 6% 4 eyords. As e have illustrated before, a ey factor for performance is the ratio of set sizes among the 2-ord queries, the average ratio L / L 2 is.2, for 3-ord queries the average ratio L / L 2 is.3 and the average ratio L / L 3 is.9, and for 4-ord queries, the L / L 2 ratio is.36 and the L / L 4 ratio is.6 note that L L 2 L 3 L 4. The average ratio of intersection size to L is.9. To illustrate the relative performance of the algorithms over all queries e plotted their average running times in Figure 7: here, the running time of Merge is normalized to. Both RanGroup and RanGroupScan significantly outperform Merge, ith the latter performing the best overall; interestingly, hen used for all queries (as opposed to only for the large se case it as designed for) HashBin still performed better than Merge. The remaining algorithms performed in similar order to the earlier experiments, ith the one exception being SvS hich outperformed both Merge and Looup for this more realistic data. Overall, the RanGroupScan as the best-performing algorithm for 6.6% of the queries, folloed by RanGroup (6%) and Hashbin (7.7%) among the remaining algorithms not proposed in this paper, Looup performed best in 6.4% of the queries and SvS for 3.6% of the queries. All of the other techniques ere best for 2.% of the queries or feer. We present additional experiments for this data set in the Appendix C.2. 26

8 4. Experiments on Compressed Structures To illustrate the impact of compression on performance, e repeated the first experiment above, intersecting to sets of identical size, ith the size of the intersection fixed to % of the set size. Varying the set size, e report the execution times and storage requirements for the three algorithms that performed best overall in the earlier experiments Merge, Looup and RanGroupScan (since e are interested in small structures here, e only use m = hash images in RanGroupScan) hen being compressed ith different techniques: e used the standard techniques based on γ- and δ-coding (see [23], p.6) to compress the parts of the posting data stored and accessed sequentially for the three algorithms, and the compression technique described in Appendix B for Ran- GroupScan (hich e refer to as RanGroupScan Lobits). The results are shon in Figure 8; here, e omitted the results for γ- encoding as they ere essentially indistinguishable from ones for δ-coding. RanGroupScan outperforms in terms of speed the other to algorithms using the same compression scheme; the other to algorithms perform similarly to each other, as the decompression no dominates their run-time. Using our encoding scheme of Appendix B improves the performance significantly. Looing at the graph, e can see that the storage requirement for RanGroupScan (using our on encoding) is beteen.3-.9x of the size of the compressed inverted index and beteen.2-.6x of the compressed Looup structure. At the same time, the performance improvements are beteen 7.6-5x (vs. Merge) or 7.4-3x (vs. Looup). Furthermore, by increasing the number of hash images to m = 2, e obtain an algorithm that significantly outperforms the uncompressed Merge, hile requiring less memory. Intersection time (ms) Size of the structure (in ords) Merge_Delta Looup_Delta RanGroupScan_Lobits RanGroupScan_Delta 28K 256K 52K M 2M 4M 8M Number of postings Merge_Delta Looup_Delta RanGroupScan_Lobits RanGroupScan_Delta 28K 256K 52K M 2M 4M 8M Number of postings Figure 8: Running Time and Space Requirement Experiment on Real Data: We repeated this experiment using the real-life data/orload described earlier and the compressed variants of RanGroupScan, Looup and Merge. Again, Ran- GroupScan Lobits performed best, improving run-times by a factor of 8.4x (vs. Merge + δ-coding), 9.x (Merge + γ-coding), 5.7x (Looup + δ-coding), 6.2x (Looup + γ-coding), respectively. Hoever, our approach required the most space (66% of the uncompressed data), hereas Merge (26% / 28% for γ- / δ-coding) and Looup (35% / 37%) required significantly less. Finally, to illustrate the robustness of our techniques, e also measured the orst-case latency for any single query: here, the orst-case latency using Merge + δ-coding as 5.2x the orstcase latency of RanGroupScan Lobits. We sa similar results for Merge + γ-coding (5.6x higher), Looup + δ-coding (4.4x higher), and Looup + γ-coding (4.9x higher). 5. CONCLUSION In this paper e introduced algorithms for set intersection processing for memory-resident data. Our approach provides both novel theoretical orst-case guarantees as ell as very fast performance in practice, at the cost of increased storage space. Our techniques outperform a ide range of existing techniques and are robust in that for inputs for hich they are not the best-performing approach they perform close to the best one. Our techniques have applications in information retrieval and query processing scenarios here performance is of greater concern than space. 6. REFERENCES [] R. A. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In CPM, pages 4 48, 24. [2] R. A. Baeza-Yates and A. Salinger. Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences. In SPIRE, pages 3 24, 25. [3] J. Barbay and C. Kenyon. Adaptive intersection and t-threshold problems. In SODA, pages , 22. [4] J. Barbay, A. López-Ortiz, and T. Lu. Faster Adaptive Set Intersections for Text Searching. In 5th WEA, pages 46 57, 26. [5] J. Barbay, A. López-Ortiz, T. Lu, and A. Salinger. An experimental investigation of set intersection algorithms for text searching. ACM Journal of Experimental Algorithmics, 4:7 24, 2. [6] P. Bille, A. Pagh, and R. Pagh. Fast Evaluation of Union-Intersection Expressions. In ISAAC, pages , 27. [7] G. E. Blelloch and M. Reid-Miller. Fast Set Operations using Treaps. In ACM SPAA, pages 6 26, 998. [8] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a to-level retrieval process. In CIKM, pages , 23. [9] M. R. Bron and R. E. Taran. A Fast Merging Algorithm. Journal of the ACM, 26(2):2 226, 979. [] J. Brutlag. Speed Matters for Google Web Search [] E. Chiniforooshan, A. Farzan, and M. Mirzazadeh. Worst case optimal union-intersection expression evaluation. In ALENEX, pages 79 9, 2. [2] E. Demaine, A. López-Ortiz, and J. Munro. Adaptive Set Intersections, Unions, and Differences. In SODA, pages , 2. [3] E. Demaine, A. López-Ortiz, and J. Munro. Experiments on Adaptive Set Intersections for Text Retrieval Systems. In ALENEX, pages 9 4, 2. [4] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge, 29. [5] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleare. In PODS, pages 2 3, 2. [6] F. K. Hang and S. Lin. A Simple Algorithm for Merging To Disoint Linearly Ordered Sets. SIAM Journal, ():3 39, 972. [7] G. Linden. [8] W. Pugh. A sip list cooboo. Technical Report UMIACS-TR , University of Maryland, 99. [9] P. Sanders and F. Transier. Intersection in Integer Inverted Indices. In ALENEX, pages 7 83, 27. [2] S. Tationda, F. Junqueira, B. B. Cambazoglu, and V. Plachouras. On Efficient Posting List Intersection ith Multicore Processors. In ACM SIGIR, pages , 29. [2] F. Transier and P. Sanders. Compressed inverted indexes for in-memory search engines. In ALENEX, pages 3 2, 28. [22] D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. PVLDB, 2(): , 29. [23] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes - Compressing and Indexing Documents and Images. Morgan Kaufman Publishers,