Fast Set Intersection in Memory


 Douglas Holmes
 1 years ago
 Views:
Transcription
1 Fast Set Intersection in Memory Bolin Ding University of Illinois at UrbanaChampaign 2 N. Goodin Avenue Urbana, IL 68, USA Arnd Christian König Microsoft Research One Microsoft Way Redmond, WA 9852, USA ABSTRACT Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a orstcase efficient ay. In general, given (preprocessed) sets, ith totally n elements, e ill sho ho to compute their intersection in expected time O(n/ + r), here r is the intersection size and is the number of bits in a machineord. In addition,e introduce a very simple version of this algorithm that has eaer asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and orloads.. INTRODUCTION Fast processing of set intersections is a ey operation in many query processing tass in the context of databases and information retrieval. For example, in the context of databases, set intersections are used in the context of various forms of data mining, text analytics, and evaluation of conunctive predicates. They are also the ey operations in enterprise and eb search. Many of these applications are interactive, meaning that the latency ith hich query results are displayed is a ey concern. It has been shon in the context of search that query latency is critical to user satisfaction, ith increases in latency directly leading to feer search queries being issued and higher rates of query abandonment [, 7]. As a consequence, significant portions of the sets to be intersected are often cached in main memory. This paper ill study the performance of set intersection algorithms for mainmemory resident data. Note that these techniques are also relevant in the context of large disbased (inverted) indexes, hen large fractions of these reside in a main memory cache. There has been considerable study of set intersection algorithms in information retrieval (e.g., [2, 4, ]). Most of these papers assume that the underlying data structure is an inverted index [23]. Much of this or (e.g., [2, 4]) focuses on adaptive algorithms hich use the number of comparisons as measure of overhead. For inmemory data, additional structures hich encode additional sippingsteps [8], treebased structures [7], or hashbased algo Permission to mae digital or hard copies of all or part of this or for personal or classroom use is granted ithout fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume ere invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th  September 3rd 2, Seattle, Washington. Proceedings of the VLDB Endoment, Vol. 4, No. 4 Copyright 2 VLDB Endoment //... $.. rithms become possible, hich often outperform inverted indexes; e.g., using hashbased dictionaries, intersecting to sets L, L 2 requires expected time O(min( L, L 2 )), hich is a factor of Θ(log( + max( L / L 2, L / L 2 ))) better than the best possible orstcase performance of comparisonbased algorithms [6]. In this or, e propose ne set intersection algorithms aimed at fast performance.these outperform the competing techniques for most inputs and are also robust in that for inputs here they are not optimal they are close to the bestperforming algorithm. The tradeoff for this gain is a slight increase in the size of the data structures, hen compared to an inverted index; hoever, in userfacing scenarios here latency is crucial, this tradeoff is often acceptable.. Contributions Our approach leverages to ey observations: (a) If is the size (in bits) of a machineord, e can encode a set from a universe of elements in a single machine ord, alloing for very fast intersections. (b) For the data distributions seen in many reallife examples (in particular search applications), the size of intersections is typically much smaller than the smallest set being intersected. To illustrate the second observation, e analyzed the K most frequent queries issued against the Bing Shopping portal. For 94% of all queries it held that the size of the full intersection as at least one order of magnitude smaller than the document frequency of the least frequent eyord; for 76% of the queries the difference as to orders of magnitude. By exploiting these to observations, e mae the folloing contributions. (i) We introduce linearspace data structures to represent sets such that their intersection can be computed in a orstcase efficient ay. Given sets, ith n elements in total, these data structures allo us to compute their intersection in expected time O(n/ + r), here r is the size of the intersection and is the number of bits in a machineord; hen the size of the intersection is an order of magnitude (or more) smaller than the size of the smallest set being intersected, our approach yields significant improvements in execution time over previous approaches. To the best of our noledge, the best asymptotic bound for fast set intersection is achieved by the O ( (n(log 2 ) 2 )/ + r ) algorithm of [6]. Hoever, note that the bound relies on a large value of ; in practice, is small (and constant), and < 2 6 = bits implies / < (log 2 ) 2 /. More importantly, [6] requires complex bitmanipulation, maing it slo in practice, hich e ill demonstrate empirically in Section 4. (ii) We describe a much simpler algorithm that computes the intersection in expected O(n/α m + mn/ + r ) time, here α is a constant determined by, and m is a parameter. This algorithm has eaer guarantees in theory, but performs better in practice, and gives significant improvements over the various data structures typically used, hile being very simple to implement.
2 2. BACKGROUND AND RELATED WORK Algorithms based on Ordered Lists: Most or on set intersection focuses on ordered lists as the underlying data structure, in particular algorithms using inverted indexes, hich have become the standard data structure in information retrieval. Here, documents are identified via a document ID, and for each term t, the inverted index stores a sorted list of all document IDs containing t. Using this representation, to sets L, L 2 of similar sizes (i.e., L L 2 ) can be intersected efficiently using a linear merge by scanning both lists in parallel, requiring O( L + L 2 ) operations (the merge step in merge sort). This approach is asteful hen set sizes differ significantly or only small fractions of the sets intersect. For very different set sizes, algorithms have been proposed that exploit this asymmetry, requiring log ( L + L 2 ) L + L comparisons at most (for L < L 2 ) [6]. To improve the performance further, there has recently been significant or on socalled adaptive setintersection algorithms for set intersections [2, 4, 3,, 2, 5]. These algorithms use the total number of comparisons as measure of the algorithm s complexity and aim to use a number of comparisons as close as possible to the minimum number of comparisons ideally required to establish the intersection. Hoever, the resulting reduction in the number of comparisons does not necessarily result in performance improvements in practice: for example, in [2], binary search based algorithms outperform a parallel scan only hen L 2 < 2 L, even though several times feer comparisons are needed. Hierarchical Representations: There are various algorithms for set intersections based on variants of balanced trees (e.g. [9], treaps [7], and siplists [8]), computing the intersection of (preprocessed) sets L, L 2 in O( L log( L 2 / L )) (for L < L 2 ) operations. Hoever, hile some form of sipping is commonly used as part of algorithms based on inverted indexes, siplists (or trees) are typically not used in the scenarios outlined above (ith static set data) due to the required spaceoverhead. A novel and compact tolevel representation of posting lists aimed at fast intersections in main memory as proposed in [9]. Algorithms based on Hashing: Using a hashbased representation of sets can speed up the intersection of sets L, L 2 ith L L 2 significantly (expected time O( L ) by looing up all elements of L in the hashtable of L 2); hoever, because of the added indirection, this approach performs poorly for less seed set sizes. A ne hashingbased approach is proposed in [6]: here, the elements in sets L, L 2 are mapped using a hashfunction h to smaller (approximate) representations h(l ), h(l 2). These representations are then intersected to compute H = h(l ) h(l 2). Finally, the set of all elements in the original sets that map to H via h are computed and any false positives removed. As the hashed images h(l ), h(l 2) to be intersected are smaller than the original sets (using feer bits), they can be intersected more quicly. Given sets of total size n, their intersection can be computed in expected time O ( (n log 2 )/ + r ), here r = i Li. Scorebased pruning: In many IR engines it is possible to avoid computing full intersections by leveraging scoring functions that are monotonic in the individual termise scores; this maes it possible to terminate the intersection processing early using approaches such as TA [5] or documentatatime (DAAT) processing (e.g., [8]). Hoever, in practice, this is often not possible, either because of the complexity of the scoring function (e.g., nonmonotonic machinelearning based raning functions) or because full intersection results are required. Our approach is based on partitioning the elements in each set into very small ( 8 elements) groups, for hich e have fast intersection schemes. Hence, DAATapproaches can be combined ith our or by using these small groups in place of individual documents. Set intersections using multiple cores: Techniques that exploit multicore architectures to speed up set intersections are described in [2, 22].The use of multiple cores is orthogonal to our approach in the sense that our algorithms can be parallelized for these architectures as ell; hoever, this is beyond the scope of our paper. 3. OUR APPROACH Notation: We are given a collection of N sets S = {L,..., L N }, here L i Σ and Σ is the universe of elements in the sets; let n i = L i be the size of set L i. Suppose elements in a set are ordered, and for a set L, let inf(l) and sup(l) be the minimum and maximum elements of a set L, respectively. We use to denote the size (number of bits) of a ord on the target processor. Throughout the paper e ill use log to denote log 2. Finally, e use [] to denote the set {,..., }. Our approach can be extended to bag semantics by additionally storing element frequency. Frameor: Our tas is to design data structures such that the intersection of multiple sets can be computed efficiently. We differentiate beteen a preprocessing stage, during hich e reorganize each set and attach additional index structures, and an online processing stage, hich uses the preprocessed data structures to compute intersections. An intersection query is specified via a collection of sets L, L 2,..., L (to simplify notations, e use the offsets, 2,..., to refer to the sets in a query throughout this section); our goal is to compute L L 2... L efficiently. Note that preprocessing is typical of most nontrivial data structures used for computing set intersections; even building simple noncompressed inverted indexes requires sorting the posting lists as a preprocessing step. We require the preprocessing stage to be time/spaceefficient in that it does not require more than O(n i log n i) time (necessary for sorting) and linear space O(n i). The size of intersection L L 2 is a loer bound of the time needed to compute the intersection. Our method leverages to ey ideas to approach this loer bound: (i) The intersection of to sets in a small universe can be computed very efficiently; in particular, if the to sets are subsets of {, 2,..., }, e can encode them as single machineords and compute their intersection using a bitiseand. (ii) A small number of elements in a large universe can be mapped into a small universe. partitioning via sorting/hashing L L p L h :Σ [] h(l ) h(l 2 ) h(l p ) compute L p h(l p ) h(l q 2) L p L q (ith the help of h(l p L q 2: 2 ) h(l q 2)) h(l 2) h(l 2 2) h(l q 2) L L L q 2 L partitioning via sorting/hashing L 2 Figure : Algorithmic Frameor h :Σ [] We leverage these to ideas by first partitioning each set L i into smaller groups L i s, hich are intersected separately. In the preprocessing stage, e map each small group into a small universe [] = {, 2,..., } using a universal hash function h and encode the image h(l i ) ith a machineord. Then, in the on
3 line processing stage, to compute the intersection of to small groups L p and Lq 2, e first use a bitiseand operation to compute H = h(l p ) h(lq 2 ), and then try to recover Lp Lq 2 using the inverse mapping h from H. The union of L p Lq 2 s forms L L 2. Moreover, if the intersection L L 2 is of a small size compared to L and L 2 (seen in practice), a large fraction of the small groups ith overlapping ranges has an empty intersection; thus, by using the ordrepresentations of H to detect these groups quicly, e can sip much unnecessary computation, resulting in significant speedup. The resulting algorithmic frameor is illustrated in Figure. Given this overall approach, the ey questions become ho to form groups, hat structures to be used to represent them, and ho to process intersections of these small groups. We ill discuss these details in the folloing sections. All the formal proofs of analytical results are deferred to the appendix. 3. Intersection via FixedWidth Partitions We first consider the case hen there are only to sets L and L 2 in the intersection query. We ill present a pair of preprocessing and online processing algorithms, hich e use to illustrate the basic ideas of our algorithms. We subsequently refine and extend our techniques to sets in Section 3.2. In the preprocessing stage, L and L 2 are sorted, and partitioned into groups (recall is the ord idth) L, L 2,..., L n /, and L 2, L 2 2,..., L n 2/ 2 of equal size (except the last ones). In the online processing stage (Algorithm ), the small groups are scanned in order. If the ranges of L p and Lq 2 overlap, e may have L p Lq 2. The intersection Lp Lq 2 of each pair of overlapping groups is computed (line 8) in some iteration. And finally, the union of all these intersections is L L 2. Since each group is scanned once, line 2 repeat for O((n + n 2)/ ) iterations. The maor remaining question no becomes ho to compute L p Lq 2 efficiently ith proper preprocessing? For this purpose, e map each group L p or Lq 2 into a small universe for fast intersection, and e leverage singleord representations to store and manipulate sets from a small universe. SingleWord Representation of Sets: We represent a set A [] = {, 2,..., } using a single machineord of idth by setting the yth bit as iff y A. We refer to this as the ord representation (A) of A. For to sets A and B, the bitiseand (A) (B) (computed in O() time) is the ord representation of A B. Given a ord representation (A), all the elements of A can be retrieved in linear time O( A ). In the rest of this paper, if A [], e use A to denote both a set and its ord representation. Preprocessing Stage: Elements in a set L i are sorted as {x i, x 2 i,..., x n i i } (i.e., x i < x + i ) and L i is partitioned as follos: L i = {x i,..., x i }, L 2 i = {x + i,..., x 2 i },... () L i = {x( ) + i, x ( ) +2 i,..., x i },... (2) For each small group L i, e compute the ordrepresentation of its image under a universal hash function h : Σ [], i.e., h(l i ) = {h(x) x L i }. In addition, for each position y [] and each small group L i, e also maintain the inverted mapping h (y, L i ) = {x x L i and h(x) = y}, i.e., for each y [] We use the folloing ellnon technique: ( is bitisexor) (i) lobit = (((A) ) (A)) (A) is the loest bit of (A). For the smallest element y in A, e have 2 y = lobit. y = log(lobit) A can be computed using the machine instruction NLZ (number of leading zeros) or precomputed looup tables. (ii) Set (A) as (A) lobit and repeat (i) to scan the next smallest element until (A) becomes. e store the elements in L i ith hash value y, in a short list hich supports ordered access. We ensure that the order of these elements is identical across different h (y, L i ) s and Li s; in this ay, e can intersect these short lists using a linear merge. EXAMPLE 3.. (PREPROCESSING AND DATA STRUCTURES) Suppose e have to sets L = {, 2, 4, 9, 6, 27, 43}, L 2 = {, 3, 5, 9,, 6, 22, 32, 34, 49}. And, let = 6 ( = 4). For simplicity, h is selected to be h(x) = (x ) mod 6. L is partitioned into 2 groups: L = {, 2, 4, 9}, L 2 = {6, 27, 43}, and L 2 is partitioned into 3 groups: L 2 = {, 3, 5, 9}, L 2 2 = {, 6, 22, 32}, L 3 2 = {34, 49}. We precompute: h(l ) = {, 2, 4, 9}, h(l 2 ) = {, }, h(l 2) = {, 3, 5, 9}, h(l 2 2) = {, 6, }, h(l 3 2) = {, 2}. We also preprocess h (y, L p i ) s: for example, h (, L 2 ) = {6}, h (, L 2 2) = {6, 32}, h (, L 2 ) = {27, 43}, and h (, L 2 2) = {}. : p, q, 2: hile p n and q n 2 do 3: if inf(l q 2 ) > sup(lp ) then 4: p p + 5: else if inf(l p ) > sup(lq 2 6: ) then q q + 7: else 8: compute (L p Lq 2 ) using IntersectSmall 9: (L p Lq 2 ) : if sup(l p ) < sup(lq 2 ) then p p + else q q + : is the result of L L 2 Algorithm : Intersection via fixedidth partitioning Online Processing Stage: The algorithm used to intersect to sets is shon in Algorithm. Since elements in L i are sorted, Algorithm ensures that if the ranges of any to small groups L p, Lq 2 overlap, their intersection is computed (line 8). After scanning all such pairs, must then contain the intersection of the hole sets. No the question is: ho to compute the intersection of to small groups L p Lq 2 efficiently? For this purpose, e introduce the algorithm IntersectSmall (Algorithm 2), hich: (i) first computes H = h(l p ) h(lq 2 ) using a bitiseand; (ii) for each (bit) y H, intersects the corresponding inverted mappings using the linear merge algorithm. IntersectSmall(L p, Lq 2 ): computing Lp Lq 2 : Compute H h(l p ) h(lq 2 ) 2: for each y H do 3: Γ Γ (h (y, L p ) h (y, L q 2 )) 4: Γ is the result of L p Lq 2 Algorithm 2: Computing the intersection of small groups EXAMPLE 3.2. (ONLINE PROCESSING) Folloing Example 3., to compute L L 2, e need to compute L L 2, L 2 L 2 2, and L 2 L 3 2 (pairs ith overlapping ranges): for example, for computing L 2 L 2 2, e first compute h(l 2 ) h(l 2 2) = {, }; then L 2 L 2 2 = ( y=, h (y, L 2 ) h (y, L 2 2) ) = {6}. Similarly, e can compute L L 2 = {, 9}. Finally, e find h(l 2 ) h(l 3 2) =, and thus L 2 L 3 2 =. So, e have L L 2 = {, 9} {6}. Note that ord representations and inverted mappings for L i are precomputed, and ordrepresentations can be intersected using one operation. So the running time of IntersectSmall is bounded by the number of pairs of elements, one from L p and one from Lq 2, that are mapped to the same hashvalue. This number can be shon to be equal (in expectation) to the intersection size plus O() for each group L i. Using this, e obtain Algorithm s running time:
4 THEOREM 3.3. ) Algorithm computes L L 2 in expected + r time, here r = L L 2. O( n +n 2 To achieve a better bound, e optimize the group sizes: ith L and L 2 partitioned into groups of sizes s = n /n 2 and s 2 = n2/n, respectively, L L 2 can be computed in expected O( n n 2/ + r) time. A detailed analysis of the effect of group size on running times can be found in Section A... Overhead of Preprocessing: If only the bound in Theorem 3.3 is required, then to preprocess a set L i of size n i, it is obvious that O(n i log n i) time and O(n i) space suffice: e only need to partition a sorted list into small groups of size, and for each small group, construct the ord representation and inverted mapping in linear time using the hash function h. To achieve the better bound O( n n 2/+r), e need multiple resolutions of the partitioning of a set L i. This is because, as discussed above, the optimal group size s = n /n 2 of the set L also depends on the size n 2 of the set L 2 to be intersected ith it. For this purpose, e partition a set L i into small groups of size 2, 4,..., 2, etc. To compute L L 2 for the given to sets, suppose s i is the optimal group size of L i; e then select the actual group size s i = 2 t s.t. s i s i 2s i, obtaining the same bound. A carefullydesigned multiresolution data structure enabling access to these groups consumes only O(n i) space for L i. We ill describe and analyze this structure in Section THEOREM 3.4. To preprocess a set L i of size n i for Algorithm, e need O(n i log n i) time and O(n i) space (in ords). Limitations of FixedWidth Partitions: The main limitation of the proposed approach is that it is difficult to extend to more than to sets, because the partitioning scheme e use is not ellaligned for more than to sets: for three sets, e.g., there may be more than O((n + n 2 + n 3)/ ) triples of small groups that overlap. We introduce a different partitioning scheme to address this issue in Section 3.2, hich extends to > 2 sets. 3.2 Intersection via Randomized Partitions In this section, e ill introduce an algorithm based on a randomized partitioning scheme to compute the intersection of to or more sets. The general approach is as follos: instead of fixedidth partitions, e use a hash function g to partition each set into small groups, using the most significant bits of g(x) to group an element x Σ. This reduces the number of combinations (pairs) of small groups e have to intersect, alloing us to prove bounds similar to Theorem 3.3 for computing intersections of > 2 sets. Preprocessing Stage: Let g be a hash function g : Σ {, } mapping an element to a bitstring (or binary number); e use g t(x) to denote the t most significant bits of g(x). We say that for to bitstrings z and z 2, z is a t prefix of z 2, iff z is identical to the highest t bits in z 2; e.g., is a 4prefix of. To preprocess a set L i, e partition it into groups L z i = {x x L i and g t(x) = z} for all z {, } t (some t). As before, e compute the ord representation of the image of each L z i under another hash function h : Σ [], and inverted mappings h. Online Processing Stage: This stage is similar to our previous algorithm: to compute the intersection of to sets L and L 2, e compute the intersections of pairs of overlapping small groups, one from each set, and finally tae the union of these intersections. In general, suppose L is partitioned using g t : Σ {, } t, and L 2 is partitioned using g t2 : Σ {, } t 2. Assume n n 2 and t t 2. We no intersect sets L and L 2 using Algorithm 3. The maor improvement of Algorithm 3 compared to Algorithm is that in Algorithm, e need compute L p Lq 2 hen the ranges of L p and Lq 2 overlap; in Algorithm 3, e compute Lz Lz 2 2 (also using Algorithm 2) hen z is a t prefix of z 2 (this is a necessary condition for L z Lz 2 2 ; so Algorithm 3 is correct). This significantly reduces the number of pairs to be intersected. : for each z 2 {, } t 2 do 2: Let z {, } t be the t prefix of z 2 3: Compute L z Lz 2 2 using IntersectSmall(Lz 4: Let (L z Lz 2 2 5: ) is the result of L L 2, Lz 2 2 ) Algorithm 3: 2list Intersection via Randomized Partitioning Based on the choices of parameters t and t 2, e can either partition L and L 2 into the same number of small groups (yielding the bound of Theorem 3.5), or into small groups of the (approximately) identical sizes (yielding Theorem 3.6). ( THEOREM 3.5. Algorithm 3 computes L L 2 in expected n ) n O 2 + r time (r = L L 2 ), ith t = t 2 = log n n 2. THEOREM 3.6. ) Algorithm 3 computes L L 2 in expected + r time (r = L L 2 ), using t = log(n / ) O( n +n 2 and t 2 = log(n 2/ ). Note that hen n n 2, Theorem 3.5 has a better bound than Theorem 3.6. But e can extend Theorem 3.6 to set intersection. Extension to More Than To Sets: Suppose e ant to compute the intersection of sets L,..., L, here n i = L i and n n 2... n. L i is partitioned into groups L z i s using g ti : Σ {, } t i. Note that g ti s are generated from the same hash function g. We use t i = log(n i/ ) and proceed as in Algorithm 4. Algorithm 4 is almost identical to Algorithm 3, but is generalized to sets: for each z {, } t, e pic the group identifiers z i to be the t iprefix of z, and e only intersect groups L z, Lz 2 2,..., Lz, here z, z 2,..., z share a prefix of size t. Also, e extend IntersectSmall (Algorithm 2) for groups: e first compute the intersection (bitiseand) of hash images (their ordrepresentations) of the groups L z i i s; and, if the result H = i= h(lz i i ) is not zero, for each (bit) y H, e intersect the corresponding inverted mappings h (y, L z i i ) s. Details and analysis are deferred to the appendix. THEOREM 3.7. Using t i = log(n i/ ), Algorithm 4 computes the intersection i= Li of sets in expected O(n/ + r) time, here r = i= Li and n = i= ni = i= Li. : for each z {, } t (t i = log(n i / ) ) do 2: Let z i be the t i prefix of z for i =,..., 3: Compute i= Lz i i using extended IntersectSmall 4: Let ( i= Lz i i ) 5: is the result of i= L i Algorithm 4: list Intersection via Randomized Partitioning 3.2. A Multiresolution Data Structure Recall that in some algorithms (e.g., Theorem 3.5), the selection of the number of small groups used for a set L i depends on the (size of) other sets being intersected ith L i. So by naively precomputing the required structures for each possible group size, e ould incur excessive space requirements. In this section, e describe a data structure that supports access to partitions of L i into 2 t groups for any possible t, using only O(n i) space. It is illustrated in Figure 2. To support the algorithms introduced so far, this structure must also allo us: (i) for each L z i, to retrieve the ordrepresentation h(l z i ), and
5 pointers 2 x L z i x L z i partitioned by g L z i partitioned by g 2 x L z i partitioned by g 3 y pointer to the first element in L z i s.t. h(x) =y next(x) = the smallest x L i s.t. x x and h(x) =h(x ) Figure 2: MultiResolution Partition of L i g :Σ {, } g 2 :Σ {, } 2 g 3 :Σ {, } 3 g t :Σ {, } t (ii) for each y [], to access all elements in h (y, L z i ) = {x x L z i and h(x) = y} in time linear in its size h (y, L z i ). Multiresolution Partitioning: For the ease of explanation, e suppose Σ = {, } and choose g as a random permutation of Σ. To preprocess L i, e first order all the elements x L i according to g(x). Then any small group L z i = {x x L i and g t(x) = z} forms a consecutive interval in L i (partitions of different resolutions are formed for t =, 2,...). Note: in all of our algorithms, universal hash functions and random permutations are almost interchangeable (hen used as g) the differences being that (i) a permutation induces a total ordering of elements (in this data structure, this property is required), hereas hashing may result in collisions (hich e can overcome by using the preimage to brea ties) and (ii) there is a slight difference in the resulting probability of, e.g., elements being grouped together (hashing results in (limited) independence, hereas permutations result in negative dependence e account for this by using the eaer condition in our proofs). Word Representations of Hash Mappings: No, for each small group L z i, e need to precompute and store the ord representation h(l z i ). Note the total number of small groups is n i/2+n i/ n i/2 t +... n i. So this requires O(n i) space. Inverted Mappings: We need to access all elements in h (y, L z i ) in order, for each y []. If e ere to store these mappings for each L z i explicitly, this ould require O(n i log n i) space. Hoever, by storing the inverted mappings h (y, L z i ) s implicitly, e can do better, as follos: For each group L z i, since it corresponds to an interval in L i, e can store the starting and ending positions in L i, denoted by left(l z i ) and right(l z i ). These allo us to determine if an element x belongs to L z i. No, to enable the ordered access to the inverted mappings, e define, for each x L i, next(x) to be the next element x to x on the right s.t. h(x ) = h(x) (i.e., ith minimum g(x ) > g(x) s.t. h(x ) = h(x)). Then, for each L z i and each y [], e store the position first(y, L z i ) of the first element x in L z i s.t. h(x ) = y. No, to access all elements in h (y, L z i ) in order, e can start from the element at first(y, L z i ), and follo the pointers next(x), until passing the right boundary right(l z i ). And, in this ay, all elements in the inverted mapping are retrieved in the same order as g(x) hich e require for IntersectSmall. Space Requirements: For all groups of different sizes, the total space for storing h(l z i ) s, left(l z i ) s, right(l z i ) s, first(y, L z i ) s and next(x) s is O(n i). So the hole multiresolution data structure requires O(n i) space. A detailed analysis is in the appendix. When the group size t i depends only on n i (e.g., in Algorithm 4), singleresolution in preprocessing suffices, and the above multiresolution scheme (for selecting t i online) is not necessary. THEOREM 3.8. To preprocess a set L i of size n i for Algorithm 34, e need O(n i log n i) time and O(n i) space (in ords). 3.3 From Theory to Practice In this section, e describe a more practical version of our methods. This algorithm is simpler, uses significantly less memory, straightforard data structures, and, hile it has orse theoretical guarantees, is faster in practice. The main difference is that for each small group L z i, e only store the elements in L z i and their images under m hash functions (i.e., e do not maintain inverted mappings, trading off a complex O()access for a simple scan over a short bloc of data). Also, e use only a single partition for each set L i. Having multiple ord representations of hash images (different hash functions) for each small group allos us to detect empty intersections of small groups ith higher probability. Preprocessing Stage: As before, each set L i is partitioned into groups L z i s using a hash function g ti : Σ {, } t i. We ill sho that a good selection of t i is log(n i/ ), hich depends only on the size of L i. Thus for each set L i, preprocessing ith a single partitioning suffices, saving significant memory. For each group, e compute ord representations of images under m (independent) universal hash functions h,..., h m : Σ []. Note that e only require a small value of m in practice (e.g., m = 2). Online Processing Stage: The algorithm for computing il i e use here (Algorithm 5) is identical to Algorithm 4, ith to exceptions: () When needed, il z i i is directly computed by a linear merge of L z i i s (line 4), using O(Σi Lz i i ) time. (2) We can sip the computation of il z i i if, for some h, the bitiseand of the corresponding ord representations h (L z i i ) s is zero (line 3). : for each z {, } t (t i = log(n i / ) ) do 2: Let z i be the t i prefix of z for i =,..., 3: if i= h (L z i i ) for all =,..., m then 4: Compute i= Lz i i by a linear merge of L z,..., Lz 5: Let ( i= Lz i i ) 6: is the result of i= L i Algorithm 5: Simple Intersection via Randomized Partitioning Analysis: To see hy Algorithm 5 is efficient, e observe that: if L z Lz 2 2 =, then ith high probability, h(lz ) h(lz 2 2 ) = for some =,..., m. So most empty intersections can be sipped using the test in line 3. With the probability of a successful filtering (i.e. given il z i i =, ih (L z i i ) = for some hash function h, =,..., m) bounded by the Lemmas A. and A.3, e can derive Theorem 3.9. Detailed analysis of this probability (both theoretical and experimental) and overall complexity is deferred to Appendix A.5. THEOREM 3.9. Using t i = log(n i/ ), Algorithm 5 computes ( i= Li in expected O max(n,n ) + mn α() m + r ) time (r= i= L i, n= i= n i, α()= for β() used in Lemma A.3). β() 3.3. Data Structure for Storing L z i In this section, e describe the simple and spaceefficient data structure that e use in Algorithm 5. As stated earlier, e only need to partition L i using one hash function g ti ; hence e can represent each L i as an array of small groups L z i s, ordered by z. For each small group, e store the information associated ith it in the structure shon in Figure 3. The first ord in this structure stores z = g ti (L z i ). The second ord stores the structure s length len. The folloing m ords represent the hash images h (L z i ),..., h m(l z i ) of L z i. Finally, e store the elements of L z i as an array in the remaining part. We need n i/ such blocs for
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta fandeng@cs.ualberta.ca Davood Rafiei University of Alberta drafiei@cs.ualberta.ca ABSTRACT
More informationCostAware Strategies for Query Result Caching in Web Search Engines
CostAware Strategies for Query Result Caching in Web Search Engines RIFAT OZCAN, ISMAIL SENGOR ALTINGOVDE, and ÖZGÜR ULUSOY, Bilkent University Search engines and largescale IR systems need to cache
More informationAn efficient reconciliation algorithm for social networks
An efficient reconciliation algorithm for social networks Nitish Korula Google Inc. 76 Ninth Ave, 4th Floor New York, NY nitish@google.com Silvio Lattanzi Google Inc. 76 Ninth Ave, 4th Floor New York,
More informationExtracting k Most Important Groups from Data Efficiently
Extracting k Most Important Groups from Data Efficiently Man Lung Yiu a, Nikos Mamoulis b, Vagelis Hristidis c a Department of Computer Science, Aalborg University, DK9220 Aalborg, Denmark b Department
More informationSEEDB: Supporting Visual Analytics with DataDriven Recommendations
SEEDB: Supporting Visual Analytics with DataDriven Recommendations Manasi Varta MIT mvarta@mit.edu Samuel Madden MIT madden@csail.mit.edu Aditya Parameswaran University of Illinois (UIUC) adityagp@illinois.edu
More informationSpeeding up Distributed RequestResponse Workflows
Speeding up Distributed RequestResponse Workflows Virajith Jalaparti (UIUC) Peter Bodik Srikanth Kandula Ishai Menache Mikhail Rybalkin (Steklov Math Inst.) Chenyu Yan Microsoft Abstract We found that
More informationEfficient Query Evaluation using a TwoLevel Retrieval Process
Efficient Query Evaluation using a TwoLevel Retrieval Process Andrei Z. Broder,DavidCarmel, Michael Herscovici,AyaSoffer, Jason Zien ( ) IBM Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532
More informationEamonn Keogh Dept. of Computer Science & Engineering University of California, Riverside Riverside, CA 92521
isax: Indexing and Mining Terabyte Sized Time Series Jin Shieh Dept. of Computer Science & Engineering University of California, Riverside Riverside, CA 95 shiehj@cs.ucr.edu Eamonn Keogh Dept. of Computer
More informationLess is More: Selecting Sources Wisely for Integration
Less is More: Selecting Sources Wisely for Integration Xin Luna Dong AT&T LabsResearch lunadong@research.att.com Barna Saha AT&T LabsResearch barna@research.att.com Divesh Srivastava AT&T LabsResearch
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationContinuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream
Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Xuemin Lin University of New South Wales Sydney, Australia Jian Xu University of New South Wales Sydney, Australia
More informationDiscovering Queries based on Example Tuples
Discovering Queries based on Example Tuples Yanyan Shen Kaushik Chakrabarti Surajit Chaudhuri Bolin Ding Lev Novik National University of Singapore Microsoft Research Microsoft shenyanyan@comp.nus.edu.sg
More informationSo Who Won? Dynamic Max Discovery with the Crowd
So Who Won? Dynamic Max Discovery with the Crowd Stephen Guo Stanford University Stanford, CA, USA sdguo@cs.stanford.edu Aditya Parameswaran Stanford University Stanford, CA, USA adityagp@cs.stanford.edu
More informationCLoud Computing is the long dreamed vision of
1 Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data Cong Wang, Student Member, IEEE, Ning Cao, Student Member, IEEE, Kui Ren, Senior Member, IEEE, Wenjing Lou, Senior Member,
More informationRecovering Semantics of Tables on the Web
Recovering Semantics of Tables on the Web Petros Venetis Alon Halevy Jayant Madhavan Marius Paşca Stanford University Google Inc. Google Inc. Google Inc. venetis@cs.stanford.edu halevy@google.com jayant@google.com
More informationMaximizing the Spread of Influence through a Social Network
Maximizing the Spread of Influence through a Social Network David Kempe Dept. of Computer Science Cornell University, Ithaca NY kempe@cs.cornell.edu Jon Kleinberg Dept. of Computer Science Cornell University,
More informationPractical Lessons from Predicting Clicks on Ads at Facebook
Practical Lessons from Predicting Clicks on Ads at Facebook Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, Joaquin Quiñonero Candela
More informationLow Overhead Concurrency Control for Partitioned Main Memory Databases
Low Overhead Concurrency Control for Partitioned Main Memory bases Evan P. C. Jones MIT CSAIL Cambridge, MA, USA evanj@csail.mit.edu Daniel J. Abadi Yale University New Haven, CT, USA dna@cs.yale.edu Samuel
More informationBenchmarking Cloud Serving Systems with YCSB
Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears Yahoo! Research Santa Clara, CA, USA {cooperb,silberst,etam,ramakris,sears}@yahooinc.com
More informationApproximate Frequency Counts over Data Streams
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku Stanford University manku@cs.stanford.edu Rajeev Motwani Stanford University rajeev@cs.stanford.edu Abstract We present algorithms for
More informationObliviStore: High Performance Oblivious Cloud Storage
ObliviStore: High Performance Oblivious Cloud Storage Emil Stefanov University of California, Berkeley emil@cs.berkeley.edu Elaine Shi University of Maryland, College Park elaine@cs.umd.edu Abstract. We
More informationA Principled Approach to Bridging the Gap between Graph Data and their Schemas
A Principled Approach to Bridging the Gap between Graph Data and their Schemas Marcelo Arenas,2, Gonzalo Díaz, Achille Fokoue 3, Anastasios Kementsietsidis 3, Kavitha Srinivas 3 Pontificia Universidad
More informationScalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park
More informationDiscovering All Most Specific Sentences
Discovering All Most Specific Sentences DIMITRIOS GUNOPULOS Computer Science and Engineering Department, University of California, Riverside RONI KHARDON EECS Department, Tufts University, Medford, MA
More informationProvable Data Possession at Untrusted Stores
Provable Data Possession at Untrusted Stores Giuseppe Ateniese Randal Burns Reza Curtmola Joseph Herring Lea Kissner Zachary Peterson Dawn Song ABSTRACT We introduce a model for provable data possession
More informationBitmap Index Design Choices and Their Performance Implications
Bitmap Index Design Choices and Their Performance Implications Elizabeth O Neil and Patrick O Neil University of Massachusetts at Boston Kesheng Wu Lawrence Berkeley National Laboratory {eoneil, poneil}@cs.umb.edu
More informationMining Templates from Search Result Records of Search Engines
Mining Templates from Search Result Records of Search Engines Hongkun Zhao, Weiyi Meng State University of New York at Binghamton Binghamton, NY 13902, USA {hkzhao, meng}@cs.binghamton.edu Clement Yu University
More informationComputing Personalized PageRank Quickly by Exploiting Graph Structures
Computing Personalized PageRank Quickly by Exploiting Graph Structures Takanori Maehara Takuya Akiba Yoichi Iwata Ken ichi Kawarabayashi National Institute of Informatics, The University of Tokyo JST,
More informationON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More informationReal Time Personalized Search on Social Networks
Real Time Personalized Search on Social Networks Yuchen Li #, Zhifeng Bao, Guoliang Li +, KianLee Tan # # National University of Singapore University of Tasmania + Tsinghua University {liyuchen, tankl}@comp.nus.edu.sg
More information