Fast Set Intersection in Memory


 Douglas Holmes
 1 years ago
 Views:
Transcription
1 Fast Set Intersection in Memory Bolin Ding University of Illinois at UrbanaChampaign 2 N. Goodin Avenue Urbana, IL 68, USA Arnd Christian König Microsoft Research One Microsoft Way Redmond, WA 9852, USA ABSTRACT Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a orstcase efficient ay. In general, given (preprocessed) sets, ith totally n elements, e ill sho ho to compute their intersection in expected time O(n/ + r), here r is the intersection size and is the number of bits in a machineord. In addition,e introduce a very simple version of this algorithm that has eaer asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and orloads.. INTRODUCTION Fast processing of set intersections is a ey operation in many query processing tass in the context of databases and information retrieval. For example, in the context of databases, set intersections are used in the context of various forms of data mining, text analytics, and evaluation of conunctive predicates. They are also the ey operations in enterprise and eb search. Many of these applications are interactive, meaning that the latency ith hich query results are displayed is a ey concern. It has been shon in the context of search that query latency is critical to user satisfaction, ith increases in latency directly leading to feer search queries being issued and higher rates of query abandonment [, 7]. As a consequence, significant portions of the sets to be intersected are often cached in main memory. This paper ill study the performance of set intersection algorithms for mainmemory resident data. Note that these techniques are also relevant in the context of large disbased (inverted) indexes, hen large fractions of these reside in a main memory cache. There has been considerable study of set intersection algorithms in information retrieval (e.g., [2, 4, ]). Most of these papers assume that the underlying data structure is an inverted index [23]. Much of this or (e.g., [2, 4]) focuses on adaptive algorithms hich use the number of comparisons as measure of overhead. For inmemory data, additional structures hich encode additional sippingsteps [8], treebased structures [7], or hashbased algo Permission to mae digital or hard copies of all or part of this or for personal or classroom use is granted ithout fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume ere invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th  September 3rd 2, Seattle, Washington. Proceedings of the VLDB Endoment, Vol. 4, No. 4 Copyright 2 VLDB Endoment //... $.. rithms become possible, hich often outperform inverted indexes; e.g., using hashbased dictionaries, intersecting to sets L, L 2 requires expected time O(min( L, L 2 )), hich is a factor of Θ(log( + max( L / L 2, L / L 2 ))) better than the best possible orstcase performance of comparisonbased algorithms [6]. In this or, e propose ne set intersection algorithms aimed at fast performance.these outperform the competing techniques for most inputs and are also robust in that for inputs here they are not optimal they are close to the bestperforming algorithm. The tradeoff for this gain is a slight increase in the size of the data structures, hen compared to an inverted index; hoever, in userfacing scenarios here latency is crucial, this tradeoff is often acceptable.. Contributions Our approach leverages to ey observations: (a) If is the size (in bits) of a machineord, e can encode a set from a universe of elements in a single machine ord, alloing for very fast intersections. (b) For the data distributions seen in many reallife examples (in particular search applications), the size of intersections is typically much smaller than the smallest set being intersected. To illustrate the second observation, e analyzed the K most frequent queries issued against the Bing Shopping portal. For 94% of all queries it held that the size of the full intersection as at least one order of magnitude smaller than the document frequency of the least frequent eyord; for 76% of the queries the difference as to orders of magnitude. By exploiting these to observations, e mae the folloing contributions. (i) We introduce linearspace data structures to represent sets such that their intersection can be computed in a orstcase efficient ay. Given sets, ith n elements in total, these data structures allo us to compute their intersection in expected time O(n/ + r), here r is the size of the intersection and is the number of bits in a machineord; hen the size of the intersection is an order of magnitude (or more) smaller than the size of the smallest set being intersected, our approach yields significant improvements in execution time over previous approaches. To the best of our noledge, the best asymptotic bound for fast set intersection is achieved by the O ( (n(log 2 ) 2 )/ + r ) algorithm of [6]. Hoever, note that the bound relies on a large value of ; in practice, is small (and constant), and < 2 6 = bits implies / < (log 2 ) 2 /. More importantly, [6] requires complex bitmanipulation, maing it slo in practice, hich e ill demonstrate empirically in Section 4. (ii) We describe a much simpler algorithm that computes the intersection in expected O(n/α m + mn/ + r ) time, here α is a constant determined by, and m is a parameter. This algorithm has eaer guarantees in theory, but performs better in practice, and gives significant improvements over the various data structures typically used, hile being very simple to implement.
2 2. BACKGROUND AND RELATED WORK Algorithms based on Ordered Lists: Most or on set intersection focuses on ordered lists as the underlying data structure, in particular algorithms using inverted indexes, hich have become the standard data structure in information retrieval. Here, documents are identified via a document ID, and for each term t, the inverted index stores a sorted list of all document IDs containing t. Using this representation, to sets L, L 2 of similar sizes (i.e., L L 2 ) can be intersected efficiently using a linear merge by scanning both lists in parallel, requiring O( L + L 2 ) operations (the merge step in merge sort). This approach is asteful hen set sizes differ significantly or only small fractions of the sets intersect. For very different set sizes, algorithms have been proposed that exploit this asymmetry, requiring log ( L + L 2 ) L + L comparisons at most (for L < L 2 ) [6]. To improve the performance further, there has recently been significant or on socalled adaptive setintersection algorithms for set intersections [2, 4, 3,, 2, 5]. These algorithms use the total number of comparisons as measure of the algorithm s complexity and aim to use a number of comparisons as close as possible to the minimum number of comparisons ideally required to establish the intersection. Hoever, the resulting reduction in the number of comparisons does not necessarily result in performance improvements in practice: for example, in [2], binary search based algorithms outperform a parallel scan only hen L 2 < 2 L, even though several times feer comparisons are needed. Hierarchical Representations: There are various algorithms for set intersections based on variants of balanced trees (e.g. [9], treaps [7], and siplists [8]), computing the intersection of (preprocessed) sets L, L 2 in O( L log( L 2 / L )) (for L < L 2 ) operations. Hoever, hile some form of sipping is commonly used as part of algorithms based on inverted indexes, siplists (or trees) are typically not used in the scenarios outlined above (ith static set data) due to the required spaceoverhead. A novel and compact tolevel representation of posting lists aimed at fast intersections in main memory as proposed in [9]. Algorithms based on Hashing: Using a hashbased representation of sets can speed up the intersection of sets L, L 2 ith L L 2 significantly (expected time O( L ) by looing up all elements of L in the hashtable of L 2); hoever, because of the added indirection, this approach performs poorly for less seed set sizes. A ne hashingbased approach is proposed in [6]: here, the elements in sets L, L 2 are mapped using a hashfunction h to smaller (approximate) representations h(l ), h(l 2). These representations are then intersected to compute H = h(l ) h(l 2). Finally, the set of all elements in the original sets that map to H via h are computed and any false positives removed. As the hashed images h(l ), h(l 2) to be intersected are smaller than the original sets (using feer bits), they can be intersected more quicly. Given sets of total size n, their intersection can be computed in expected time O ( (n log 2 )/ + r ), here r = i Li. Scorebased pruning: In many IR engines it is possible to avoid computing full intersections by leveraging scoring functions that are monotonic in the individual termise scores; this maes it possible to terminate the intersection processing early using approaches such as TA [5] or documentatatime (DAAT) processing (e.g., [8]). Hoever, in practice, this is often not possible, either because of the complexity of the scoring function (e.g., nonmonotonic machinelearning based raning functions) or because full intersection results are required. Our approach is based on partitioning the elements in each set into very small ( 8 elements) groups, for hich e have fast intersection schemes. Hence, DAATapproaches can be combined ith our or by using these small groups in place of individual documents. Set intersections using multiple cores: Techniques that exploit multicore architectures to speed up set intersections are described in [2, 22].The use of multiple cores is orthogonal to our approach in the sense that our algorithms can be parallelized for these architectures as ell; hoever, this is beyond the scope of our paper. 3. OUR APPROACH Notation: We are given a collection of N sets S = {L,..., L N }, here L i Σ and Σ is the universe of elements in the sets; let n i = L i be the size of set L i. Suppose elements in a set are ordered, and for a set L, let inf(l) and sup(l) be the minimum and maximum elements of a set L, respectively. We use to denote the size (number of bits) of a ord on the target processor. Throughout the paper e ill use log to denote log 2. Finally, e use [] to denote the set {,..., }. Our approach can be extended to bag semantics by additionally storing element frequency. Frameor: Our tas is to design data structures such that the intersection of multiple sets can be computed efficiently. We differentiate beteen a preprocessing stage, during hich e reorganize each set and attach additional index structures, and an online processing stage, hich uses the preprocessed data structures to compute intersections. An intersection query is specified via a collection of sets L, L 2,..., L (to simplify notations, e use the offsets, 2,..., to refer to the sets in a query throughout this section); our goal is to compute L L 2... L efficiently. Note that preprocessing is typical of most nontrivial data structures used for computing set intersections; even building simple noncompressed inverted indexes requires sorting the posting lists as a preprocessing step. We require the preprocessing stage to be time/spaceefficient in that it does not require more than O(n i log n i) time (necessary for sorting) and linear space O(n i). The size of intersection L L 2 is a loer bound of the time needed to compute the intersection. Our method leverages to ey ideas to approach this loer bound: (i) The intersection of to sets in a small universe can be computed very efficiently; in particular, if the to sets are subsets of {, 2,..., }, e can encode them as single machineords and compute their intersection using a bitiseand. (ii) A small number of elements in a large universe can be mapped into a small universe. partitioning via sorting/hashing L L p L h :Σ [] h(l ) h(l 2 ) h(l p ) compute L p h(l p ) h(l q 2) L p L q (ith the help of h(l p L q 2: 2 ) h(l q 2)) h(l 2) h(l 2 2) h(l q 2) L L L q 2 L partitioning via sorting/hashing L 2 Figure : Algorithmic Frameor h :Σ [] We leverage these to ideas by first partitioning each set L i into smaller groups L i s, hich are intersected separately. In the preprocessing stage, e map each small group into a small universe [] = {, 2,..., } using a universal hash function h and encode the image h(l i ) ith a machineord. Then, in the on
3 line processing stage, to compute the intersection of to small groups L p and Lq 2, e first use a bitiseand operation to compute H = h(l p ) h(lq 2 ), and then try to recover Lp Lq 2 using the inverse mapping h from H. The union of L p Lq 2 s forms L L 2. Moreover, if the intersection L L 2 is of a small size compared to L and L 2 (seen in practice), a large fraction of the small groups ith overlapping ranges has an empty intersection; thus, by using the ordrepresentations of H to detect these groups quicly, e can sip much unnecessary computation, resulting in significant speedup. The resulting algorithmic frameor is illustrated in Figure. Given this overall approach, the ey questions become ho to form groups, hat structures to be used to represent them, and ho to process intersections of these small groups. We ill discuss these details in the folloing sections. All the formal proofs of analytical results are deferred to the appendix. 3. Intersection via FixedWidth Partitions We first consider the case hen there are only to sets L and L 2 in the intersection query. We ill present a pair of preprocessing and online processing algorithms, hich e use to illustrate the basic ideas of our algorithms. We subsequently refine and extend our techniques to sets in Section 3.2. In the preprocessing stage, L and L 2 are sorted, and partitioned into groups (recall is the ord idth) L, L 2,..., L n /, and L 2, L 2 2,..., L n 2/ 2 of equal size (except the last ones). In the online processing stage (Algorithm ), the small groups are scanned in order. If the ranges of L p and Lq 2 overlap, e may have L p Lq 2. The intersection Lp Lq 2 of each pair of overlapping groups is computed (line 8) in some iteration. And finally, the union of all these intersections is L L 2. Since each group is scanned once, line 2 repeat for O((n + n 2)/ ) iterations. The maor remaining question no becomes ho to compute L p Lq 2 efficiently ith proper preprocessing? For this purpose, e map each group L p or Lq 2 into a small universe for fast intersection, and e leverage singleord representations to store and manipulate sets from a small universe. SingleWord Representation of Sets: We represent a set A [] = {, 2,..., } using a single machineord of idth by setting the yth bit as iff y A. We refer to this as the ord representation (A) of A. For to sets A and B, the bitiseand (A) (B) (computed in O() time) is the ord representation of A B. Given a ord representation (A), all the elements of A can be retrieved in linear time O( A ). In the rest of this paper, if A [], e use A to denote both a set and its ord representation. Preprocessing Stage: Elements in a set L i are sorted as {x i, x 2 i,..., x n i i } (i.e., x i < x + i ) and L i is partitioned as follos: L i = {x i,..., x i }, L 2 i = {x + i,..., x 2 i },... () L i = {x( ) + i, x ( ) +2 i,..., x i },... (2) For each small group L i, e compute the ordrepresentation of its image under a universal hash function h : Σ [], i.e., h(l i ) = {h(x) x L i }. In addition, for each position y [] and each small group L i, e also maintain the inverted mapping h (y, L i ) = {x x L i and h(x) = y}, i.e., for each y [] We use the folloing ellnon technique: ( is bitisexor) (i) lobit = (((A) ) (A)) (A) is the loest bit of (A). For the smallest element y in A, e have 2 y = lobit. y = log(lobit) A can be computed using the machine instruction NLZ (number of leading zeros) or precomputed looup tables. (ii) Set (A) as (A) lobit and repeat (i) to scan the next smallest element until (A) becomes. e store the elements in L i ith hash value y, in a short list hich supports ordered access. We ensure that the order of these elements is identical across different h (y, L i ) s and Li s; in this ay, e can intersect these short lists using a linear merge. EXAMPLE 3.. (PREPROCESSING AND DATA STRUCTURES) Suppose e have to sets L = {, 2, 4, 9, 6, 27, 43}, L 2 = {, 3, 5, 9,, 6, 22, 32, 34, 49}. And, let = 6 ( = 4). For simplicity, h is selected to be h(x) = (x ) mod 6. L is partitioned into 2 groups: L = {, 2, 4, 9}, L 2 = {6, 27, 43}, and L 2 is partitioned into 3 groups: L 2 = {, 3, 5, 9}, L 2 2 = {, 6, 22, 32}, L 3 2 = {34, 49}. We precompute: h(l ) = {, 2, 4, 9}, h(l 2 ) = {, }, h(l 2) = {, 3, 5, 9}, h(l 2 2) = {, 6, }, h(l 3 2) = {, 2}. We also preprocess h (y, L p i ) s: for example, h (, L 2 ) = {6}, h (, L 2 2) = {6, 32}, h (, L 2 ) = {27, 43}, and h (, L 2 2) = {}. : p, q, 2: hile p n and q n 2 do 3: if inf(l q 2 ) > sup(lp ) then 4: p p + 5: else if inf(l p ) > sup(lq 2 6: ) then q q + 7: else 8: compute (L p Lq 2 ) using IntersectSmall 9: (L p Lq 2 ) : if sup(l p ) < sup(lq 2 ) then p p + else q q + : is the result of L L 2 Algorithm : Intersection via fixedidth partitioning Online Processing Stage: The algorithm used to intersect to sets is shon in Algorithm. Since elements in L i are sorted, Algorithm ensures that if the ranges of any to small groups L p, Lq 2 overlap, their intersection is computed (line 8). After scanning all such pairs, must then contain the intersection of the hole sets. No the question is: ho to compute the intersection of to small groups L p Lq 2 efficiently? For this purpose, e introduce the algorithm IntersectSmall (Algorithm 2), hich: (i) first computes H = h(l p ) h(lq 2 ) using a bitiseand; (ii) for each (bit) y H, intersects the corresponding inverted mappings using the linear merge algorithm. IntersectSmall(L p, Lq 2 ): computing Lp Lq 2 : Compute H h(l p ) h(lq 2 ) 2: for each y H do 3: Γ Γ (h (y, L p ) h (y, L q 2 )) 4: Γ is the result of L p Lq 2 Algorithm 2: Computing the intersection of small groups EXAMPLE 3.2. (ONLINE PROCESSING) Folloing Example 3., to compute L L 2, e need to compute L L 2, L 2 L 2 2, and L 2 L 3 2 (pairs ith overlapping ranges): for example, for computing L 2 L 2 2, e first compute h(l 2 ) h(l 2 2) = {, }; then L 2 L 2 2 = ( y=, h (y, L 2 ) h (y, L 2 2) ) = {6}. Similarly, e can compute L L 2 = {, 9}. Finally, e find h(l 2 ) h(l 3 2) =, and thus L 2 L 3 2 =. So, e have L L 2 = {, 9} {6}. Note that ord representations and inverted mappings for L i are precomputed, and ordrepresentations can be intersected using one operation. So the running time of IntersectSmall is bounded by the number of pairs of elements, one from L p and one from Lq 2, that are mapped to the same hashvalue. This number can be shon to be equal (in expectation) to the intersection size plus O() for each group L i. Using this, e obtain Algorithm s running time:
4 THEOREM 3.3. ) Algorithm computes L L 2 in expected + r time, here r = L L 2. O( n +n 2 To achieve a better bound, e optimize the group sizes: ith L and L 2 partitioned into groups of sizes s = n /n 2 and s 2 = n2/n, respectively, L L 2 can be computed in expected O( n n 2/ + r) time. A detailed analysis of the effect of group size on running times can be found in Section A... Overhead of Preprocessing: If only the bound in Theorem 3.3 is required, then to preprocess a set L i of size n i, it is obvious that O(n i log n i) time and O(n i) space suffice: e only need to partition a sorted list into small groups of size, and for each small group, construct the ord representation and inverted mapping in linear time using the hash function h. To achieve the better bound O( n n 2/+r), e need multiple resolutions of the partitioning of a set L i. This is because, as discussed above, the optimal group size s = n /n 2 of the set L also depends on the size n 2 of the set L 2 to be intersected ith it. For this purpose, e partition a set L i into small groups of size 2, 4,..., 2, etc. To compute L L 2 for the given to sets, suppose s i is the optimal group size of L i; e then select the actual group size s i = 2 t s.t. s i s i 2s i, obtaining the same bound. A carefullydesigned multiresolution data structure enabling access to these groups consumes only O(n i) space for L i. We ill describe and analyze this structure in Section THEOREM 3.4. To preprocess a set L i of size n i for Algorithm, e need O(n i log n i) time and O(n i) space (in ords). Limitations of FixedWidth Partitions: The main limitation of the proposed approach is that it is difficult to extend to more than to sets, because the partitioning scheme e use is not ellaligned for more than to sets: for three sets, e.g., there may be more than O((n + n 2 + n 3)/ ) triples of small groups that overlap. We introduce a different partitioning scheme to address this issue in Section 3.2, hich extends to > 2 sets. 3.2 Intersection via Randomized Partitions In this section, e ill introduce an algorithm based on a randomized partitioning scheme to compute the intersection of to or more sets. The general approach is as follos: instead of fixedidth partitions, e use a hash function g to partition each set into small groups, using the most significant bits of g(x) to group an element x Σ. This reduces the number of combinations (pairs) of small groups e have to intersect, alloing us to prove bounds similar to Theorem 3.3 for computing intersections of > 2 sets. Preprocessing Stage: Let g be a hash function g : Σ {, } mapping an element to a bitstring (or binary number); e use g t(x) to denote the t most significant bits of g(x). We say that for to bitstrings z and z 2, z is a t prefix of z 2, iff z is identical to the highest t bits in z 2; e.g., is a 4prefix of. To preprocess a set L i, e partition it into groups L z i = {x x L i and g t(x) = z} for all z {, } t (some t). As before, e compute the ord representation of the image of each L z i under another hash function h : Σ [], and inverted mappings h. Online Processing Stage: This stage is similar to our previous algorithm: to compute the intersection of to sets L and L 2, e compute the intersections of pairs of overlapping small groups, one from each set, and finally tae the union of these intersections. In general, suppose L is partitioned using g t : Σ {, } t, and L 2 is partitioned using g t2 : Σ {, } t 2. Assume n n 2 and t t 2. We no intersect sets L and L 2 using Algorithm 3. The maor improvement of Algorithm 3 compared to Algorithm is that in Algorithm, e need compute L p Lq 2 hen the ranges of L p and Lq 2 overlap; in Algorithm 3, e compute Lz Lz 2 2 (also using Algorithm 2) hen z is a t prefix of z 2 (this is a necessary condition for L z Lz 2 2 ; so Algorithm 3 is correct). This significantly reduces the number of pairs to be intersected. : for each z 2 {, } t 2 do 2: Let z {, } t be the t prefix of z 2 3: Compute L z Lz 2 2 using IntersectSmall(Lz 4: Let (L z Lz 2 2 5: ) is the result of L L 2, Lz 2 2 ) Algorithm 3: 2list Intersection via Randomized Partitioning Based on the choices of parameters t and t 2, e can either partition L and L 2 into the same number of small groups (yielding the bound of Theorem 3.5), or into small groups of the (approximately) identical sizes (yielding Theorem 3.6). ( THEOREM 3.5. Algorithm 3 computes L L 2 in expected n ) n O 2 + r time (r = L L 2 ), ith t = t 2 = log n n 2. THEOREM 3.6. ) Algorithm 3 computes L L 2 in expected + r time (r = L L 2 ), using t = log(n / ) O( n +n 2 and t 2 = log(n 2/ ). Note that hen n n 2, Theorem 3.5 has a better bound than Theorem 3.6. But e can extend Theorem 3.6 to set intersection. Extension to More Than To Sets: Suppose e ant to compute the intersection of sets L,..., L, here n i = L i and n n 2... n. L i is partitioned into groups L z i s using g ti : Σ {, } t i. Note that g ti s are generated from the same hash function g. We use t i = log(n i/ ) and proceed as in Algorithm 4. Algorithm 4 is almost identical to Algorithm 3, but is generalized to sets: for each z {, } t, e pic the group identifiers z i to be the t iprefix of z, and e only intersect groups L z, Lz 2 2,..., Lz, here z, z 2,..., z share a prefix of size t. Also, e extend IntersectSmall (Algorithm 2) for groups: e first compute the intersection (bitiseand) of hash images (their ordrepresentations) of the groups L z i i s; and, if the result H = i= h(lz i i ) is not zero, for each (bit) y H, e intersect the corresponding inverted mappings h (y, L z i i ) s. Details and analysis are deferred to the appendix. THEOREM 3.7. Using t i = log(n i/ ), Algorithm 4 computes the intersection i= Li of sets in expected O(n/ + r) time, here r = i= Li and n = i= ni = i= Li. : for each z {, } t (t i = log(n i / ) ) do 2: Let z i be the t i prefix of z for i =,..., 3: Compute i= Lz i i using extended IntersectSmall 4: Let ( i= Lz i i ) 5: is the result of i= L i Algorithm 4: list Intersection via Randomized Partitioning 3.2. A Multiresolution Data Structure Recall that in some algorithms (e.g., Theorem 3.5), the selection of the number of small groups used for a set L i depends on the (size of) other sets being intersected ith L i. So by naively precomputing the required structures for each possible group size, e ould incur excessive space requirements. In this section, e describe a data structure that supports access to partitions of L i into 2 t groups for any possible t, using only O(n i) space. It is illustrated in Figure 2. To support the algorithms introduced so far, this structure must also allo us: (i) for each L z i, to retrieve the ordrepresentation h(l z i ), and
5 pointers 2 x L z i x L z i partitioned by g L z i partitioned by g 2 x L z i partitioned by g 3 y pointer to the first element in L z i s.t. h(x) =y next(x) = the smallest x L i s.t. x x and h(x) =h(x ) Figure 2: MultiResolution Partition of L i g :Σ {, } g 2 :Σ {, } 2 g 3 :Σ {, } 3 g t :Σ {, } t (ii) for each y [], to access all elements in h (y, L z i ) = {x x L z i and h(x) = y} in time linear in its size h (y, L z i ). Multiresolution Partitioning: For the ease of explanation, e suppose Σ = {, } and choose g as a random permutation of Σ. To preprocess L i, e first order all the elements x L i according to g(x). Then any small group L z i = {x x L i and g t(x) = z} forms a consecutive interval in L i (partitions of different resolutions are formed for t =, 2,...). Note: in all of our algorithms, universal hash functions and random permutations are almost interchangeable (hen used as g) the differences being that (i) a permutation induces a total ordering of elements (in this data structure, this property is required), hereas hashing may result in collisions (hich e can overcome by using the preimage to brea ties) and (ii) there is a slight difference in the resulting probability of, e.g., elements being grouped together (hashing results in (limited) independence, hereas permutations result in negative dependence e account for this by using the eaer condition in our proofs). Word Representations of Hash Mappings: No, for each small group L z i, e need to precompute and store the ord representation h(l z i ). Note the total number of small groups is n i/2+n i/ n i/2 t +... n i. So this requires O(n i) space. Inverted Mappings: We need to access all elements in h (y, L z i ) in order, for each y []. If e ere to store these mappings for each L z i explicitly, this ould require O(n i log n i) space. Hoever, by storing the inverted mappings h (y, L z i ) s implicitly, e can do better, as follos: For each group L z i, since it corresponds to an interval in L i, e can store the starting and ending positions in L i, denoted by left(l z i ) and right(l z i ). These allo us to determine if an element x belongs to L z i. No, to enable the ordered access to the inverted mappings, e define, for each x L i, next(x) to be the next element x to x on the right s.t. h(x ) = h(x) (i.e., ith minimum g(x ) > g(x) s.t. h(x ) = h(x)). Then, for each L z i and each y [], e store the position first(y, L z i ) of the first element x in L z i s.t. h(x ) = y. No, to access all elements in h (y, L z i ) in order, e can start from the element at first(y, L z i ), and follo the pointers next(x), until passing the right boundary right(l z i ). And, in this ay, all elements in the inverted mapping are retrieved in the same order as g(x) hich e require for IntersectSmall. Space Requirements: For all groups of different sizes, the total space for storing h(l z i ) s, left(l z i ) s, right(l z i ) s, first(y, L z i ) s and next(x) s is O(n i). So the hole multiresolution data structure requires O(n i) space. A detailed analysis is in the appendix. When the group size t i depends only on n i (e.g., in Algorithm 4), singleresolution in preprocessing suffices, and the above multiresolution scheme (for selecting t i online) is not necessary. THEOREM 3.8. To preprocess a set L i of size n i for Algorithm 34, e need O(n i log n i) time and O(n i) space (in ords). 3.3 From Theory to Practice In this section, e describe a more practical version of our methods. This algorithm is simpler, uses significantly less memory, straightforard data structures, and, hile it has orse theoretical guarantees, is faster in practice. The main difference is that for each small group L z i, e only store the elements in L z i and their images under m hash functions (i.e., e do not maintain inverted mappings, trading off a complex O()access for a simple scan over a short bloc of data). Also, e use only a single partition for each set L i. Having multiple ord representations of hash images (different hash functions) for each small group allos us to detect empty intersections of small groups ith higher probability. Preprocessing Stage: As before, each set L i is partitioned into groups L z i s using a hash function g ti : Σ {, } t i. We ill sho that a good selection of t i is log(n i/ ), hich depends only on the size of L i. Thus for each set L i, preprocessing ith a single partitioning suffices, saving significant memory. For each group, e compute ord representations of images under m (independent) universal hash functions h,..., h m : Σ []. Note that e only require a small value of m in practice (e.g., m = 2). Online Processing Stage: The algorithm for computing il i e use here (Algorithm 5) is identical to Algorithm 4, ith to exceptions: () When needed, il z i i is directly computed by a linear merge of L z i i s (line 4), using O(Σi Lz i i ) time. (2) We can sip the computation of il z i i if, for some h, the bitiseand of the corresponding ord representations h (L z i i ) s is zero (line 3). : for each z {, } t (t i = log(n i / ) ) do 2: Let z i be the t i prefix of z for i =,..., 3: if i= h (L z i i ) for all =,..., m then 4: Compute i= Lz i i by a linear merge of L z,..., Lz 5: Let ( i= Lz i i ) 6: is the result of i= L i Algorithm 5: Simple Intersection via Randomized Partitioning Analysis: To see hy Algorithm 5 is efficient, e observe that: if L z Lz 2 2 =, then ith high probability, h(lz ) h(lz 2 2 ) = for some =,..., m. So most empty intersections can be sipped using the test in line 3. With the probability of a successful filtering (i.e. given il z i i =, ih (L z i i ) = for some hash function h, =,..., m) bounded by the Lemmas A. and A.3, e can derive Theorem 3.9. Detailed analysis of this probability (both theoretical and experimental) and overall complexity is deferred to Appendix A.5. THEOREM 3.9. Using t i = log(n i/ ), Algorithm 5 computes ( i= Li in expected O max(n,n ) + mn α() m + r ) time (r= i= L i, n= i= n i, α()= for β() used in Lemma A.3). β() 3.3. Data Structure for Storing L z i In this section, e describe the simple and spaceefficient data structure that e use in Algorithm 5. As stated earlier, e only need to partition L i using one hash function g ti ; hence e can represent each L i as an array of small groups L z i s, ordered by z. For each small group, e store the information associated ith it in the structure shon in Figure 3. The first ord in this structure stores z = g ti (L z i ). The second ord stores the structure s length len. The folloing m ords represent the hash images h (L z i ),..., h m(l z i ) of L z i. Finally, e store the elements of L z i as an array in the remaining part. We need n i/ such blocs for
6 z len h (L z i ) m ords h m (L z i ) len Figure 3: The Structure for a Preprocessed Small Group L z i L i in total. The first ord z can be also computed onthefly, as these small groups are accessed sequentially in Algorithm 5. So, if e store len using one ord, and one ord for each element of L z i, then e need totally m + + L z i ords for each group L z i, and thus n i( + (m + )/ ) ords to store the preprocessed L i. The overhead of the preprocessing is dominated by the cost of sorting L i (the remaining operations are trivial). THEOREM 3.. To preprocess a set L i of size n i for Algorithm 5, e need O(n i(m + log n i)) time, and O(n i( + m/ )) (ords) space. We describe methods for compressing this structure in Appendix B. 3.4 Intersecting Small and Large Sets An important special case for set intersection are asymmetric intersections here the sizes n and n 2 of the sets that are intersected vary significantly (.l.o.g., assume n n 2). In this subsection, using the same multiresolution data structure as in Section 3.2., e present an algorithm HashBin that computes L L 2 in O(n log(n 2/n )) time. This bound is also achieved by other previous ors, e.g., SmallAdaptive [5], but our algorithm is even simpler in online processing. It is also non that algorithms based on hashtables only require O(n ) time for this scenario; hoever, unlie HashBin, they are illsuited for less asymmetric cases. Algorithm HashBin: When intersecting to sets L and L 2 ith sizes n n 2, e focus on the partitioning induced by g t : Σ {, } t, here t = log n for both of them, and g is a random permutation of Σ. To compute L L 2, e compute L z L z 2 for all z {, } t and tae the union. To compute L z L z 2, e iterate over each x L z, and perform a binary search to chec hether x L z 2 using O(log L z 2 ) time. This scheme can be extended to multiple sets by searching for x in L z i if found in L z,..., L z i. THEOREM ( 3.. The algorithm HashBin computes L L 2 in expected O n log n 2 time. The preprocessing of a list L i n ) requires O(n i log n i) time and O(n i) space. The proof of Theorem 3. and ho HashBin uses the multiresolution data structure is deferred to the Section A.6 in the appendix. The advantage of HashBin is that, since it is based on the same structure as the algorithm introduced in Section 3.2, e can mae the choice beteen algorithms online, based on n /n EXPERIMENTAL EVALUATION We evaluate the performance and space requirements of four of the techniques described in this paper: (a) the fixedidth partition algorithm described in Section 3. (hich e ill refer to as IntGroup); (b) the randomized partition algorithm in Section 3.2 (RanGroup) (c) the simple algorithm based on randomized partitions described in Section 3.3 (RanGroupScan); and (d) the one for intersecting sets of seed sizes in Section 3.4 (HashBin). Setup: All algorithms are implemented using C and evaluated on a 4GB 64bit 2.4GHz PC. We employ a random permutation of the document IDs for the hash function g and 2universal hash functions for h (or h s). For RanGroup, e use m = 4 (the number of hash functions h ), unless noted otherise. We compare our techniques to the folloing competitors: (i) set intersection based on a simple parallel scan of inverted indexes: Merge; (ii) set intersection based on sip lists: SipList [8]; (iii) L z i set intersection based on hash tables: Hash (i.e., e iterate over the smallest set L, looing up every element x L in hashtable representations of L 2,... L ); (iv) the algorithm of [6]: BPP; (v) the algorithm proposed for fast intersection in integer inverted indices in main memory [9, 2]: Looup (using B = 32 as the bucetsize, hich is the best value in our and the authors experience); and (vi) various adaptive intersection algorithms based on binary search/galloping search: SvS, Adaptive [2, 3, 3], BaezaYates [, 2], and SmallAdaptive [5]. Note that BaezaYates is generalized to handle more than to sets as in [5]. Implementation: For each competitor, e tried our best to optimize its performance. For example, for Merge e tried to minimize the number of branches in the inner loop; e also store postings in consecutive memory addresses to speed up parallel scans and reduce page als after TLB misses. Our implementation of sip lists follos [8], ith simplifications since e are focusing on static data and do not need fast insertion/deletion. We also simplified the bitmanipulation in BPP [6] so that it ors faster in practice for small. For the algorithms using inverted indexes, e initially do not consider compression on the posting lists, as e do not ant the decompression step to impact the performance reports. In Section 4. e ill study variants of the algorithms incorporating compression. With regards to sipoperations in the index note that since e use uncompressed posting lists, algorithms such as Adaptive can perform arbitrary sips into the index directly. Datasets: To evaluate these algorithms e use both synthetic and real data. For the experiments ith synthetic datasets, sets are generated randomly (and uniformly) from a universe Σ. The real dataset is a collection of more than 8M Wiipedia pages. In each experiment for the synthetic datasets, 2 combinations of sets are randomly generated, and the average time is reported. Intersection Time (ms) M 2M 3M 4M 5M 6M 7M 8M 9M M Set Size Figure 4: Varying the Set Size Merge SipList Hash IntGroup BPP Adaptive Looup RanGroupScan Varying the Set Size: First, e measure the performance hen intersecting only 2 sets; e use synthetic data, the lists are of equal size and the size of the intersection is fixed at % of the list size; the results are shon in Figure 4. We can see that the performance of the different techniques relative to each other does not change ith varying list size. Hash performs orst, as the (relatively) expensive looup operation needs to be performed many times. SipList performs poorly for the same reason. The BPP algorithm is also slo, but this is because of a number of complex operations that need to be performed, hich are hidden as a constant in the O()notation. The same trend held for the remaining experiments as ell; hence, for readability, e did not include BPP in the subsequent graphs. For the same reason e only sho the bestperforming among the adaptive algorithms in the evaluation; if one adaptive algorithm dominates another on all parameter settings in an experiment, e don t plot the orse one. Among the remaining algorithms, RanGroupScan (4%5% faster than Merge) and IntGroup perform the best (RanGroup performs similarly to IntGroup and is not plotted). Interestingly,
7 Intersection TIme (ms) MERGE performs best for L L 2 >.7 L Merge SipList Hash Adaptive SvS Looup IntGroup RanGroup Average Intersection Time (ms) K=2 K=3 K=4 RanGroupScan Intersection Size Figure 5: Varying the Intersection Size Figure 6: Varying the Number of Keyords the simple Merge algorithm is next, outperforming the more sophisticated algorithms, folloed by Looup and the bestperforming adaptive algorithm. Varying the Intersection Size: The size of the intersection r is an important factor concerning the performance of the algorithms: larger intersections mean feer opportunities to eliminate small groups early for our algorithms or to sip parts of the set for the adaptive and siplistbased approaches. Here, e use synthetic data, intersecting to sets ith M elements and vary r = L L 2 beteen 5 and M. The results are reported in Figure 5. For r < 7M (7% of the set size) RanGroupScan and IntGroup perform best. Otherise, Merge becomes the fastest and RanGroup Scan the 2ndfastest alternative; here, the performance of Ran GroupScan is very similar to Merge, all the ay to r = M. Among the remaining algorithms, RanGroup slightly outperforms Merge for r < 5M, Looup is the nextbest algorithm and SvS and Adaptive perform best among the adaptive algorithms. Varying the Sets Size Ratios: As e illustrated in the introduction, the se in set sizes is also an important factor in performance. When sets are very different in size, algorithms that iterate through the smaller set and are able to locate the corresponding values in the larger set quicly, such as HashBin and Hash, perform ell. In this experiment e use synthetic data and vary the ratio of set sizes, setting L 2 = M and varying L beteen 6K and M. The size of the intersection is set to be % of L and e define the ratio beteen the list sizes as sr = L 2 / L. Here, the differences beteen the algorithms become small ith groing sr (for this reason, e also don t report them in a graph, as too many lines overlap). For sr < 32, RanGroupScan performs best; for larger sr, Looup and Hash perform best, until a ratio of sr for this and larger ratios, Hash outperforms the remaining algorithms, folloed by Looup and HashBin. Generally, both HashBin and RanGroupScan perform close to the bestperforming algorithm. The adaptive algorithms require more time than RanGroupScan for sr 2 and more time than HashBin for all values of sr; Siplist and BPP perform orst across all values of sr. Varying the Number of Keyords: In this experiment, e varied the number of sets = 2, 3, 4, fixing L i = M for i =,...,, ith the IDs in the sets being randomly generated using a uniform distribution over [, 2 8 ]; the results are reported in Figure 6. In this experiment, e use m = 2 hash images for RanGroupScan. For multiple sets, RanGroupScan is the fastest, ith the difference becoming more pronounced for 3 and 4 eyords, since, ith additional sets, intersecting the hashimages (ordrepresentations) yields more empty results, alloing us to sip the corresponding groups. RanGroup is the nextbest performing algorithm; e don t include results for IntGroup here, as it is designed for intersections of to sets (see Section 3.). In terestingly, the simple Merge algorithm again performs very ell hen compared to the more sophisticated techniques; the Looup algorithm is next, folloed by the various adaptive techniques. Size of the Data Structure: The improvements in speed come at the cost of an increase in space: our data structures (ithout compression) require more space than an uncompressed posting list the increase is 37% (RanGroupScan for m = 2), 63% (Ran GroupScan for m = 4), 75% (IntGroup) or 87% (RanGroup). Normalized Intersection Time Results: All Query Sizes Merge Figure 7: Normalized Execution Time on a Real Worload Experiment on Real Data: In this experiment, e used a orload of the 4 most frequent (measured over a ee in 29) queries against the Bing.com search engine. As the text corpus, e used a set of 8 Million Wiipedia documents. Query characteristics: 68% of the queries contain 2 eyords, 23% 3 eyords and 6% 4 eyords. As e have illustrated before, a ey factor for performance is the ratio of set sizes among the 2ord queries, the average ratio L / L 2 is.2, for 3ord queries the average ratio L / L 2 is.3 and the average ratio L / L 3 is.9, and for 4ord queries, the L / L 2 ratio is.36 and the L / L 4 ratio is.6 note that L L 2 L 3 L 4. The average ratio of intersection size to L is.9. To illustrate the relative performance of the algorithms over all queries e plotted their average running times in Figure 7: here, the running time of Merge is normalized to. Both RanGroup and RanGroupScan significantly outperform Merge, ith the latter performing the best overall; interestingly, hen used for all queries (as opposed to only for the large se case it as designed for) HashBin still performed better than Merge. The remaining algorithms performed in similar order to the earlier experiments, ith the one exception being SvS hich outperformed both Merge and Looup for this more realistic data. Overall, the RanGroupScan as the bestperforming algorithm for 6.6% of the queries, folloed by RanGroup (6%) and Hashbin (7.7%) among the remaining algorithms not proposed in this paper, Looup performed best in 6.4% of the queries and SvS for 3.6% of the queries. All of the other techniques ere best for 2.% of the queries or feer. We present additional experiments for this data set in the Appendix C.2.
8 4. Experiments on Compressed Structures To illustrate the impact of compression on performance, e repeated the first experiment above, intersecting to sets of identical size, ith the size of the intersection fixed to % of the set size. Varying the set size, e report the execution times and storage requirements for the three algorithms that performed best overall in the earlier experiments Merge, Looup and RanGroupScan (since e are interested in small structures here, e only use m = hash images in RanGroupScan) hen being compressed ith different techniques: e used the standard techniques based on γ and δcoding (see [23], p.6) to compress the parts of the posting data stored and accessed sequentially for the three algorithms, and the compression technique described in Appendix B for Ran GroupScan (hich e refer to as RanGroupScan Lobits). The results are shon in Figure 8; here, e omitted the results for γ encoding as they ere essentially indistinguishable from ones for δcoding. RanGroupScan outperforms in terms of speed the other to algorithms using the same compression scheme; the other to algorithms perform similarly to each other, as the decompression no dominates their runtime. Using our encoding scheme of Appendix B improves the performance significantly. Looing at the graph, e can see that the storage requirement for RanGroupScan (using our on encoding) is beteen.3.9x of the size of the compressed inverted index and beteen.2.6x of the compressed Looup structure. At the same time, the performance improvements are beteen 7.65x (vs. Merge) or 7.43x (vs. Looup). Furthermore, by increasing the number of hash images to m = 2, e obtain an algorithm that significantly outperforms the uncompressed Merge, hile requiring less memory. Intersection time (ms) Size of the structure (in ords) Merge_Delta Looup_Delta RanGroupScan_Lobits RanGroupScan_Delta 28K 256K 52K M 2M 4M 8M Number of postings Merge_Delta Looup_Delta RanGroupScan_Lobits RanGroupScan_Delta 28K 256K 52K M 2M 4M 8M Number of postings Figure 8: Running Time and Space Requirement Experiment on Real Data: We repeated this experiment using the reallife data/orload described earlier and the compressed variants of RanGroupScan, Looup and Merge. Again, Ran GroupScan Lobits performed best, improving runtimes by a factor of 8.4x (vs. Merge + δcoding), 9.x (Merge + γcoding), 5.7x (Looup + δcoding), 6.2x (Looup + γcoding), respectively. Hoever, our approach required the most space (66% of the uncompressed data), hereas Merge (26% / 28% for γ / δcoding) and Looup (35% / 37%) required significantly less. Finally, to illustrate the robustness of our techniques, e also measured the orstcase latency for any single query: here, the orstcase latency using Merge + δcoding as 5.2x the orstcase latency of RanGroupScan Lobits. We sa similar results for Merge + γcoding (5.6x higher), Looup + δcoding (4.4x higher), and Looup + γcoding (4.9x higher). 5. CONCLUSION In this paper e introduced algorithms for set intersection processing for memoryresident data. Our approach provides both novel theoretical orstcase guarantees as ell as very fast performance in practice, at the cost of increased storage space. Our techniques outperform a ide range of existing techniques and are robust in that for inputs for hich they are not the bestperforming approach they perform close to the best one. Our techniques have applications in information retrieval and query processing scenarios here performance is of greater concern than space. 6. REFERENCES [] R. A. BaezaYates. A fast set intersection algorithm for sorted sequences. In CPM, pages 4 48, 24. [2] R. A. BaezaYates and A. Salinger. Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences. In SPIRE, pages 3 24, 25. [3] J. Barbay and C. Kenyon. Adaptive intersection and tthreshold problems. In SODA, pages , 22. [4] J. Barbay, A. LópezOrtiz, and T. Lu. Faster Adaptive Set Intersections for Text Searching. In 5th WEA, pages 46 57, 26. [5] J. Barbay, A. LópezOrtiz, T. Lu, and A. Salinger. An experimental investigation of set intersection algorithms for text searching. ACM Journal of Experimental Algorithmics, 4:7 24, 2. [6] P. Bille, A. Pagh, and R. Pagh. Fast Evaluation of UnionIntersection Expressions. In ISAAC, pages , 27. [7] G. E. Blelloch and M. ReidMiller. Fast Set Operations using Treaps. In ACM SPAA, pages 6 26, 998. [8] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a tolevel retrieval process. In CIKM, pages , 23. [9] M. R. Bron and R. E. Taran. A Fast Merging Algorithm. Journal of the ACM, 26(2):2 226, 979. [] J. Brutlag. Speed Matters for Google Web Search [] E. Chiniforooshan, A. Farzan, and M. Mirzazadeh. Worst case optimal unionintersection expression evaluation. In ALENEX, pages 79 9, 2. [2] E. Demaine, A. LópezOrtiz, and J. Munro. Adaptive Set Intersections, Unions, and Differences. In SODA, pages , 2. [3] E. Demaine, A. LópezOrtiz, and J. Munro. Experiments on Adaptive Set Intersections for Text Retrieval Systems. In ALENEX, pages 9 4, 2. [4] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge, 29. [5] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleare. In PODS, pages 2 3, 2. [6] F. K. Hang and S. Lin. A Simple Algorithm for Merging To Disoint Linearly Ordered Sets. SIAM Journal, ():3 39, 972. [7] G. Linden. [8] W. Pugh. A sip list cooboo. Technical Report UMIACSTR , University of Maryland, 99. [9] P. Sanders and F. Transier. Intersection in Integer Inverted Indices. In ALENEX, pages 7 83, 27. [2] S. Tationda, F. Junqueira, B. B. Cambazoglu, and V. Plachouras. On Efficient Posting List Intersection ith Multicore Processors. In ACM SIGIR, pages , 29. [2] F. Transier and P. Sanders. Compressed inverted indexes for inmemory search engines. In ALENEX, pages 3 2, 28. [22] D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. PVLDB, 2(): , 29. [23] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes  Compressing and Indexing Documents and Images. Morgan Kaufman Publishers, 999.
9 APPENDIX Acnoledgments We than the anonymous revieers for their numerous insights and suggestions that immensely improved the paper. A. PROOFS OF THEOREMS A. Analysis of Algorithm (Proof of Theorem 3.3) There are a total of O((n + n 2)/ ) pairs of L p and Lq 2 to be checed in Algorithm. For each pair, since H = h(l p ) h(lq 2 ) can be computed in O() time and elements in H can be enumerated in linear time, the cost of computing L p Lq 2 is dominated by computing h (y, L p ) h (y, L p 2 ) for every y H, the cost of hich is in turn determined by the number of pairs of elements hich are mapped to the same location by h; e denote this set as I = {(x, x 2) x L p, x2 Lq 2, and h(x) = h(x2)}. Let I = = {(x, x 2) x = x 2} I denote the pairs of identical elements (i.e., elements in the intersection) in I and I = {(x, x 2) x x 2} I the remaining pairs of elements that are hashed to the same value by h but are not identical. Obviously, I = = L p Lq 2. If e can sho E [ I ] = O(), the proof is completed: this is because, for a total of O((n + n 2)/ ) pairs of L p and Lq 2 to be checed, the total running time is O(E [ I ] + ) = ( I = + (E [ I ] + )) O() p,q p,q p,q = O(r) + (n + n 2)/ O(). (3) Indeed, e can sho for each pair of L p and Lq 2 that: E [ I ] = Pr [h(x ) = h(x 2)] = O(), x L p, x 2 L q 2, x x 2 for a universal hash function h, hich completes the proof. (4) A.. Group Size and Optimizing Running Time In Algorithm, the group size is selected as the magical number (i.e., L p = Lq 2 = ). To explain this choice, e no explore the effect of group size on the running time of Algorithm. Suppose in general L i is partitioned into groups of size s i. Extending Equation (4) a bit, e have E [ I ] = O() as long as s s 2. Then folloing the same argument as in (3), a total of O(n /s + n 2/s 2) pairs are to be checed, and the expected running time of Algorithm is O(T (s, s 2)), here T (s, s 2) = n /s + n 2/s 2 + r. Minimizing T (s, s 2) under the constraint s s 2 yields optimal group sizes of s = n/n 2 and s 2 = n 2/n, and the optimal running time is O(T (s, s 2)) = O( n n 2/ + r). If e no use the group sizes s = s 2 =, as in the proof of Theorem 3.3, e obtain a running time of O(T (s, s 2)) = O((n + n 2)/ + r). O( n n 2/ + r) is better than O((n + n 2)/ + r) hen set sizes are seed (e.g., n n 2 or n = n 2). To achieve the better bound e leverage that the group size s = n /n 2 of the set L depends on the size n 2 of the set L 2 to be intersected ith it, and use a multiresolution structure hich eeps different partitions of a set, as discussed at the end of Section 3.. A.2 Analysis of Algorithm 3 (Proof of Theorem 3.5) Similar to the proof of Theorem 3.3, the cost of computing L z L z 2 2 using IntersectSmall for each pair of small groups L z and L z 2 2 is determined by the size of I = {(x, x2) x Lz, x2 L z 2 2, and h(x) = h(x2)}. As in A., let I= = {(x, x2) x = x 2} I and I = {(x, x 2) x x 2} I. Obviously, I = = L z L z 2 2, and I is the set of elementpairs that result in a hashcollision. If e can sho E [ I ] O(), the proof is complete: because, since t = t 2 = log n n 2/, there are O( n n 2/) pairs of z and z 2 to be considered (e have z = z 2 = z in every iteration), and thus the total running time is O(E [ I ] + ) = ( I = + (E [ I ] + )) O() z {,} t 2 z {,} t 2 O(r) + n n 2/ O(). z {,} t 2 No e prove that, for each pair (z, z 2), E [ I ] = O(). Letting S z = L z Lz 2 2 and Sz 2 2 = L z 2 2 Lz, if Lz and Lz 2 2 are fixed, e have (similar to (4) in the proof of Theorem 3.3): E h [ I S z, S z 2 2 ] = S z S z 2 2 /. S z and S z 2 2 are random variables determined by the hash function g. From their definition and the property of 2universal (2 independent) hashing, e can prove E g [S z Sz 2 2 ] Eg [Sz E g [S z 2 2 ] (using a random permutation g yields the same result). Also, E g [S z ] Eg [ Lz ] = O( n ) /2 t = O( n /n 2), and ] similarly E g [S z 2 2 ] O( n 2/n ). Therefore, E g [S z S2 z ] = E g [S z ] E g [S2 z ] O(), and thus, E [ I ] = =E g [E h [ I S z, S z 2 ]]=E g [ S z S z 2 hich completes the argument. ] O() =O(), (5) A.3 Analysis of Algorithm 4 (Proof of Theorem 3.6 and Theorem 3.7) Theorem 3.6 is special case of Theorem 3.7 for toset intersection. So e only present the proof of Theorem 3.7 belo. Consider any element x L i for each set L i involved in the intersection computation, i.e., extended IntersectSmall in line 3 of Algorithm 4, here e compute: H = h(l z i i ), and i= i= L z i i = y H ( ) i=h (y, L z i i ). Denote the set of all such elements (ith h(x) = y H) by Γ. The number of such elements Γ dominates the cost of Algorithm 4. We first differentiate to cases of elements in Γ: (i) x i= Li: These r elements are scanned times, and thus contribute a factor of O(r) in the time complexity overall. (ii) x / i= Li: We group all these elements into sets, D 2,..., D (an element x may belong to multiple D i s): D i = {x Γ x L i L i+... L x / L for some < i}. No focus on D i L z i i for each z i {, } t i. For any x L i but x / L for some < i, letting z be the t prefix of z i, e have x D i L z i i implies that h(x) H and thus there exists x ( x) L z such that h(x) = h(x ); so for such an x, Pr [x D i L z i i x L z x L z i i ]=Pr [h(x) H x Lz i i ] Pr [ h(x) = h(x ) ] L z /.
10 Generalizing Equation (5) in the proof of Theorem 3.5, e have E [ D i L z i i x L i E g [ L z ] = Eg [ Eh [ Di L z i i Lz /] Pr [x L z i i ] ni for all z s]] = O() [ z (as E g L ] = for any, and Pr [x L z i i ] = /n i). So E [ D i ] O(n i/ ), as L i is partitioned into 2 t i = n i/ groups L z i i s over all iterations of Algorithm 4. Then e have [ ] E D i O ( n i/ ) = O ( n/ ). (6) i=2 i=2 Running Time: As the D i s are bounded as above, a naive implementation of Algorithm 4 requires O(n / + r) time in expectation. The iteration of lines 4 in Algorithm 4 repeats n / times (suppose n n 2... n ). In each iteration, e compute H in O() time, and each element in D needs O() comparisons to be eliminated. n/ is potentially smaller than n / especially for sets ith seed sizes. With careful memorization of the partial results i= h(lz i i ) and i= h (y, L z i i ) in Algorithm 4, from (i) and (ii), e no prove the promised running time O(n/ + r): The maor cost of Algorithm 4 comes from the computation of (a) H = i= h(lz i i ) and (b) i= h (y, L z i i ) for each y H. Assume n n 2... n. For (a), as z i is the t iprefix of z if i, e can memorize i= h(lz i i ) for each z. Then, for example, reuse h(l ) h(l 2 ) hen computing h(l ) h(l 2 ) h(l 3 ) and h(l ) h(l 2 ) h(l 3 ). In this ay, the computation of H for different combinations of z,..., z requires i O(ni/ ) = O(n/ ) time. For (b), for each combination of z,..., z, e compute the n i result the inverse order (from i = to i = ): the partial results i= h (y, L z i i ) (for all y H, all z s, and some ) have their total size bounded by D + r. Using the hashtablebased approach to compute the intersection, the total running time is bounded by the total size of the partial results. So from (6), the total running time is O(n/ + r) in expectation. A.4 Analysis of the Multiresolution Structure (Proof of Theorem 3.8) The time bound is trivial, because e only require sorting and scanning of each set. The total space for storing h(l z i ) s, left(l z i ) s, and right(l z i ) s is O(n i), as there are O(n i + n i/2 + n i/4 +...) = O(n i) groups of different sizes. For next(x) s e also only need O(n i) space, as there are n i elements in the set. We no analyze the space needed for first(y, L z i ) s to complete the proof. To store first(y, L z i ) for each y [] and each z, storing the difference beteen first(y, L z i ) and left(l z i ) suffices; so e need O(log L z i ) bits. To store first(y, L z i ) s for all y [] in a group L z i, e need O( log L z i /) = O(log L z i ) ords. Consider the partitioning induced by g t : Σ {, } t for some t, letting t = log n i t, there are O(n i/2 t ) groups L z i s generated by g t, so the space e need for all these groups is: (log( ) is concave) O( z {,} t log L z i ) O(2 t log(n i/2 t )) = O((n i/2 t ) t). Therefore, for all resolutions t =, 2,..., log n i, the total space needed for first(y, L z i ) s is O( t t n i/2 t ) O(n i). A.5 Analysis of Algorithm 5 (Proof of Theorem 3.9) A.5. Probability of Successful Filtering Recall in Algorithm 5, sets are partitioned into small groups by hash function g, and m universal hash functions h,..., h m are used to test hether the intersection of small groups is empty. It is efficient because of the folloing observation: if L z L z 2 2 =, then h (L z ) h(lz 2 2 ) = for some =,..., m (socalled successful filtering ) ith high probability. But once a false positive happens (i.e., il z i i = but ih (L z i i ) for any hash function h,..., h m), e have to scan the to or small groups for the intersection. So to analyze Algorithm 5, the ey point e need to establish is that successful filtering happens ith a constant probability for to or small groups.we first verify the above intuition, by assuming that L z i i is : LEMMA A.. For to small groups L z and Lz 2 2 ith Lz = L z 2 2 =, given a universal hash function h : Σ [], if L z Lz 2 2 =, then h(l z ) h(lz 2 2 ) = ith probability at least ( ) (.3436 for = 64). PROOF. Since L z Lz 2 2 =, for each x 2 L z 2 2, e have h(x 2) / h(l z ) holds ith probability h(lz )/. So, Pr [h(l z ) h(l z 2) = ] ( h(lz ) ) L z 2 2 ( Lz ) L z 2 2, as h(l z ) Lz. So, hen Lz = Lz 2 2 =, e have ( ) Pr [h(l z ) h(l z 2) = ]. In general, although the sizes of the small groups L z i are random variables, they are unliely to deviate from by much. This is important since groups of larger sizes result in poorer filtering performance of the ord representations h (L z i i ) s (incurring more false positives). Using Chernoff bounds e can sho that: PROPOSITION A.2. For any group L z i i defined in Algorithm 5 (i.e., partition L i by g ti : Σ {, } t i ith t i = log(n i/ ) ), e have: (i) E [ L z i 2 i ] ; (ii) Pr [ L z i i ( + ɛ) ] exp (iii) Pr [ L z i i δ() ] ( 6 ln(4 ) + 4 ) /2 ( for = 64). ( ɛ 2 3 ), for < ɛ < ;, here the constant δ()= PROOF. In this proof e use a random permutation as g : Σ Σ, and define g t(x) to be the t most significant bits of g(x). Hoever, note that e can use a hash function here as ell, if e use the preimage to brea any ties resulting from hash collisions (thereby resulting in a total ordering). For the group L z i i, define Yx = if x Lz i i (i.e., g ti (x) = z i), and Y x = otherise. So L z i i = x Yx. Then (i) is from the fact that Pr [Y x = ] = /2 t i and the linearity of expectation. For a random permutation g, e can prove that the {Y x x L i} are negatively associated [4], so the Chernoff bounds can be still applied. As in (i), e have µ L = /2 µ = E [ L z i ] i = µh. To prove (iii), e use the Chernoff bound: Pr [ L z i i > ( + ɛ)µ] < exp( µɛ2 /3) exp( µ Lɛ 2 /3) [4]. To prove (ii), e can use a tighter bound: for < ɛ <, Pr [ L z i i > ( + ɛ)µh] < exp( µhɛ2 /3) [4]. Note that the same bounds hold hen a hash function is used as g. Lemma A.3 extends Lemma A. for groups, hose sizes are random variables determined by the hash function g.
11 LEMMA A.3. For groups L z i i s (for i =,...,, partition L i by g ti : Σ {, } t i here t i = log(n i/ ) ), if il z i i =, then ih(l z i i ) = ith at least constant probability ( ) ( ) δ() β() = +δ() 4 δ() (or β2() belo), here δ() is a constant determined by, as in Proposition A.2. PROOF. Since il z i i =, for any x L z, there exists some L z s.t. x / L z ; no, for this small group Lz, if for any x L z e have h(x ) h(x ), i.e., x / h(l z ), e say that x is collisionfree. If L z δ(), from the union bound, z L Pr [x is collisionfree] δ(), (7) here δ() is defined in Proposition A.2. Note that ih(l z i i ) = implies that every x in L z is collisionfree. So, if furthermore L z δ(), e have Pr [ ih(l z i i ) = ] ( δ() ) δ(). The derivation of (7) assumes independence of the randomized hash function h. If h is generated from a random permutation p, i.e., taing the prefix of p(x) as h(x), then by considering negative dependence [4], a similar (a bit eaer) bound can be derived. From Proposition A.2(iii), ith probability at least 4, e have L z i i δ() for a group L z i i. Given Lz δ(), there are at most min{, δ() } L z s involved in the analysis of (7). From the union bound, ith probability at least +δ() 4, e have L z δ() for all of these +δ() groups. So, ith probability at least ( ) ( ) δ() β () = +δ() 4 δ(), (notice the independence beteen g and h) e have ih(l z i i ) =. If e use Proposition A.2(ii) to bound the probability of L z 3 /2 (then there are at most min{, 3 /2} L z s involved in the analysis of (7)), e can derive a tighter bound in a similar ay: ( ( β 2() = exp ) ) ( ) 3 /2 3 δ() 2 8. Thus, e have shon the probability of successful filtering (β () or β 2() as a conservative loer bound) is at least a constant depending only on the machineord idth (but independent on the number and the sizes of sets), and increases ith. It can be magnified to ( β()) m by using m > ord images of independent hash functions for filtering. A.5.2 Filtering Performance in Practice Pr("Successful Filtering") Measured Probability, Synthetic Data Measured Probability, RealLife Data m= m=2 m=4 m=6 m=8 Figure 9: Filtering Performance in Experiments In this section e evaluate the efficiency of the ord images for filtering. In Figure 9, e have plotted the probability that, for different numbers m of hash functions, a given pair of small groups ith an empty intersection is filtered; as before, e use = 64. As the datasets, e use the synthetic data from the first experiment in Section 4 (ith an intersection size of % of the set size) and the 2 ord queries described in the experiments on real data derived from Bing/Wiipedia. As e can see, the probabilities are very similar for both datasets, ith slightly better filtering performance for the asymmetric real data. Moreover, the reallife successfulfiltering probabilities are significantly better than the theoretical bounds derived in Lemma A. and Lemma A.3 (here m = ). A.5.3 Proof of Theorem 3.9 For any z {, } t, let z i be its t iprefix (as in Algorithm 5). For computing i Lz i i, there are to cases: (i) If i Lz i i =, from Lemma A.3, e have i h(lz i i ) ith probability at most β(), and thus i h(lz i i ) for all =,..., m ith probability at most ( β()) m. So e astefully compute i Lz i i ith probability at most ( β()) m. (ii) If i Lz i i, e must have i h(lz i i ) for all, and compute i Lz i i. Case (ii) happens at most r = i Li times. We compute i Lz i i using the linear merge algorithm in linear time O ( i Lz i i ), or O( ) time in expectation. In case (i), since there are n / groups in L, for all groups, this contributes a factor O(max(n, n )( β()) m ); and in case (ii), this contributes a factor O(r ) (since (ii) happens at most r times). We also need to test hether i= h(lz i i ) for all =,..., m. Since there are n/ groups L z i i s, ith careful memorization of partial results (e.g., reusing h (L ) h (L 2 ) hen computing h (L ) h (L 2 ) h (L 3 ) and h (L ) h (L 2 ) h (L 3 )), this contributes a factor O(mn/ ) in total. So from the above analysis, Algorithm 5 needs a total of ( max(n, n ) O + mn + r ) (8) α() m time in expectation, here α() = /( β()). A.6 Analysis of Algorithm HashBin (Proof of Theorem 3.) For HashBin, the intuition is: in the resulting partitioning, e have O() element in each group L z, and O(n 2/n ) elements in each group L z 2. The expected running time is: E ( L z log L z 2 ) z {,} t = (E [ L z ] log L z 2 ) (suppose L z 2 s are fixed) z {,} t = log L z 2 (because E [ L z ] = ) z {,} t n log(n 2/n ) ( L z 2 = n 2 and log( ) is concave). z {,} t A.6. HashBin using the Multiresolution Structure Algorithm HashBin ors on a simplified version of the multiresolution data structure (Figure 2) introduced in Section Here, e use a random permutation g : Σ Σ to partition sets into small groups. To preprocess L i, e first order all the elements in L i according to g(x). Then any small group L z i = {x x L i and g t(x) = z} (for any t) corresponds to a consecutive interval in L i. For each small group L z i, e only need to store its starting position left(l z i ) and ending position right(l z i ). For each x L z, e need to chec hether x L z 2. Suppose L z 2 = {x, x 2,..., x s }. Although elements in L z 2 are not sorted in their on order, they are ordered as g(x ) g(x 2 )... g(x s ) in the preprocessing. So to chec hether x L z 2, e can binarysearch hether g(x) is in {g(x ), g(x 2 ),..., g(x s )}, since the random permutation g is a onetoone mapping from Σ to Σ.
12 Construction Time (ms) Construction Time (ms) HashBin IntGroup RanGroup RanGroupScan4 Sorting M 2M 3M 4M 5M 6M 7M 8M 9M M Set Size Figure : Preprocessing Overhead Sorting RanGroupScan_Lobits RanGroupScan_Gamma RanGroupScan_Delta Merge_Gamma Merge_Delta 64K 28K 256K 52K M 2M 4M 8M Set Size Figure : Preprocessing Overhead (ith compression) B. COMPRESSION FOR ALGORITHM 5 For each small group L z i, e can use the standard techniques based on γ and δcoding (see [23], p.6) to compress the elements stored sequentially at the end of the bloc associated ith L z i. Hoever, the decoding of γ and δcoding is expensive. As an alternative, e describe simple but effective (i.e., efficient in decoding) compression technique for Algorithm 5 in the folloing: (i) Instead of storing the length len of each structure, e can store the size L z i, since the structure length len can be derived from L z i. As proved in Proposition A.2, L z i is usually very small, so e store it using unary code (e.g., = 2). (ii) Only if L z i >, e store h (L z i ), h 2(L z i ),..., h m(l z i ) in the folloing m ords. (iii) To store elements of L z i in the remaining part of this bloc, e can use the standard techniques based on γ or δcoding. We present another compression technique here, hich is specifically designed for our algorithm. Its decoding is much more efficient than γ or δcoding. First, for the purpose of partitioning sets into small groups, e use a random permutation as g. Then, assuming Σ = 2 (the orst case), instead of storing each x L z i, e store lobits ti (x) = g(x) mod 2 t i, i.e., the loest t i bits of g(x); and the remaining highest t i bits of g(x) correspond to z = g ti (x). In onlineprocessing, decoding in this compression scheme can be done efficiently: to get g(x) for an element x L z i, e concatenate z = g ti (x) to lobits ti (x). Since g is a onetoone mapping from Σ to Σ, the intersection of L and L 2 is equivalent to the intersection of g(l ) and g(l 2). Folloing is some basic analysis to establish an upper bound of the space consumed by our compression technique. Recall t i = log(n i/ ). There are n i/ small groups in L i in total. Storing all of them requires: (i) n i + n i/ bits for L z i s (since z Lz i = n i); (ii) at most m n i/ bits for h (L z i ) s; and (iii) ( t i) n i bits for all elements (e store g(x) mod 2 t i ). C. ADDITIONAL EXPERIMENTS C. Preprocessing Overhead In this section, e evaluate the time taen to construct the novel structures hen given a set L i as input. Our approach is similar to inverted indexes (and nearly all of the competing algorithms) in that the elements have to be sorted during preprocessing; thus, to put the construction overhead in perspective, e also measure and plot the overhead of sorting using an inmemory quicsort (averaging the time over random instances). Figure shos the results for the construction time for the data structures ithout compression for different set sizes L i. Note that e use a logscale on the yaxis to better separate the different graphs. As e can see, the additional construction overhead is generally a small fraction of the sorting overhead. Figure shos the overhead for constructing different compressed structures. We also plot the overhead for compressing the sets ithout additional hash images (resulting in the structures used in the compressed Merge, i.e., Merge Gamma and Merge Delta). Again, the required overhead is only a small fraction of the sorting overhead; also, the preprocessing time for the Lobits compression scheme hich yields the best intersection performance in Section 4. is significantly loer than the alternatives. C.2 More Experiments on Real Data In this section, e present a breadon of the experiments on real data in Section 4; to understand ho the number of eyords in a query affect the relative performance in this scenario, e plotted the distribution of average intersection times for 2, 3 and 4 eyord queries separately in Figure 2. As e can see, the relative performances are similar as seen earlier ith three exceptions: (a) the Merge algorithm performs orse ith increasing number of eyords (as it cannot leverage the asymmetry in any ay), (b) in contrast, Hash performs increasingly better, but still remains (close to) the orst performer, and (c) for 4eyord queries, RanGroup slightly outperforms RanGroupScan. Normalized Intersection Time Normalized Intersection Time Normalized Intersection Time Results: 2 Keyord Queries Results: 3 Keyord Queries Results: 4 Keyord Queries Figure 2: Normalized Execution Time on a Real Worload Merge Merge Merge
OPTIMIZING WEB SERVER'S DATA TRANSFER WITH HOTLINKS
OPTIMIZING WEB SERVER'S DATA TRANSFER WIT OTLINKS Evangelos Kranakis School of Computer Science, Carleton University Ottaa,ON. K1S 5B6 Canada kranakis@scs.carleton.ca Danny Krizanc Department of Mathematics,
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationOracle Scheduling: Controlling Granularity in Implicitly Parallel Languages
Oracle Scheduling: Controlling Granularity in Implicitly Parallel Languages Umut A. Acar Arthur Charguéraud Mike Rainey Max Planck Institute for Softare Systems {umut,charguer,mrainey}@mpiss.org Abstract
More informationA Quantitative Approach to the Performance of Internet Telephony to Ebusiness Sites
A Quantitative Approach to the Performance of Internet Telephony to Ebusiness Sites Prathiusha Chinnusamy TransSolutions Fort Worth, TX 76155, USA Natarajan Gautam Harold and Inge Marcus Department of
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahooinc.com John Langford Yahoo! Research New York, NY 10018 jl@yahooinc.com Alex Strehl Yahoo! Research New York,
More informationUniversal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.
Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we
More informationBitlist: New Fulltext Index for Low Space Cost and Efficient Keyword Search
Bitlist: New Fulltext Index for Low Space Cost and Efficient Keyword Search Weixiong Rao [1,4] Lei Chen [2] Pan Hui [2,3] Sasu Tarkoma [4] [1] School of Software Engineering [2] Department of Comp. Sci.
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture  17 ShannonFanoElias Coding and Introduction to Arithmetic Coding
More informationNotes from Week 1: Algorithms for sequential prediction
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 2226 Jan 2007 1 Introduction In this course we will be looking
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationAn Empirical Study of Two MIS Algorithms
An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,
More informationUNIVERSITY OF NOTTINGHAM. Discussion Papers in Economics. Note on a generalized wage rigidity result
UNIVERSITY OF NOTTINGHAM Discussion Papers in Economics Discussion Paper No. 07/04 Note on a generalized age rigidity result By Arijit Mukherjee June 007 007 DP 07/04 Note on a generalized age rigidity
More informationLoad Balancing in MapReduce Based on Scalable Cardinality Estimates
Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching
More informationExperimental Comparison of Set Intersection Algorithms for Inverted Indexing
ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp. 58 64 http://ceurws.org/vol13, Series ISSN 161373, c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing
More informationEfficient Parallel BlockMax WAND Algorithm
Efficient Parallel BlockMax WAND Algorithm Oscar Rojas, Veronica GilCosta 2,3, and Mauricio Marin,2 DIINF, University of Santiago, Chile 2 Yahoo! Labs Santiago, Chile 3 CONICET,University of San Luis,
More informationThe Conference Call Search Problem in Wireless Networks
The Conference Call Search Problem in Wireless Networks Leah Epstein 1, and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. lea@math.haifa.ac.il 2 Department of Statistics,
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT Inmemory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationBig Data Technology MapReduce Motivation: Indexing in Search Engines
Big Data Technology MapReduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationThe Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
More informationUsing InMemory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using InMemory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
More informationBig Data & Scripting storage networks and distributed file systems
Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors ChiaHui Chang and ZhiKai Ding Department of Computer Science and Information Engineering, National Central University, ChungLi,
More informationLecture 7: Approximation via Randomized Rounding
Lecture 7: Approximation via Randomized Rounding Often LPs return a fractional solution where the solution x, which is supposed to be in {0, } n, is in [0, ] n instead. There is a generic way of obtaining
More informationThe Relative Worst Order Ratio for OnLine Algorithms
The Relative Worst Order Ratio for OnLine Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, joan@imada.sdu.dk
More informationWhy? A central concept in Computer Science. Algorithms are ubiquitous.
Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online
More informationLCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
More informationLoad Balancing. Load Balancing 1 / 24
Load Balancing Backtracking, branch & bound and alphabeta pruning: how to assign work to idle processes without much communication? Additionally for alphabeta pruning: implementing the youngbrotherswait
More informationStorage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann
Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies
More informationCSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015
CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis Linda Shapiro Today Registration should be done. Homework 1 due 11:59 pm next Wednesday, January 14 Review math essential
More information24. The Branch and Bound Method
24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NPcomplete. Then one can conclude according to the present state of science that no
More informationGlobally Optimal Crowdsourcing Quality Management
Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford
More informationFPHadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FPHadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel LirozGistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel LirozGistau, Reza Akbarinia, Patrick Valduriez. FPHadoop:
More informationLoad Balancing and Switch Scheduling
EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load
More informationA simple algorithm with no simple verication
A simple algorithm with no simple verication Laszlo Csirmaz Central European University Abstract The correctness of a simple sorting algorithm is resented, which algorithm is \evidently wrong" at the rst
More informationLecture 2: Universality
CS 710: Complexity Theory 1/21/2010 Lecture 2: Universality Instructor: Dieter van Melkebeek Scribe: Tyson Williams In this lecture, we introduce the notion of a universal machine, develop efficient universal
More informationLeveraging Multipath Routing and Traffic Grooming for an Efficient Load Balancing in Optical Networks
Leveraging ultipath Routing and Traffic Grooming for an Efficient Load Balancing in Optical Netorks Juliana de Santi, André C. Drummond* and Nelson L. S. da Fonseca University of Campinas, Brazil Email:
More informationThe LCA Problem Revisited
The LA Problem Revisited Michael A. Bender Martín Faracholton SUNY Stony Brook Rutgers University May 16, 2000 Abstract We present a very simple algorithm for the Least ommon Ancestor problem. We thus
More informationMultidimensional index structures Part I: motivation
Multidimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for
More informationAn example of a computable
An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept
More informationAdaptive Tolerance Algorithm for Distributed TopK Monitoring with Bandwidth Constraints
Adaptive Tolerance Algorithm for Distributed TopK Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of WisconsinMadison Department of Computer Sciences {bauer, srini}@cs.wisc.edu
More informationcharacter E T A S R I O D frequency
Data Compression Data compression is any process by which a digital (e.g. electronic) file may be transformed to another ( compressed ) file, such that the original file may be fully recovered from the
More informationLinear programming approach for online advertising
Linear programming approach for online advertising Igor Trajkovski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Rugjer Boshkovikj 16, P.O. Box 393, 1000 Skopje,
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
More informationCRITERIUM FOR FUNCTION DEFININING OF FINAL TIME SHARING OF THE BASIC CLARK S FLOW PRECEDENCE DIAGRAMMING (PDM) STRUCTURE
st Logistics International Conference Belgrade, Serbia 830 November 03 CRITERIUM FOR FUNCTION DEFININING OF FINAL TIME SHARING OF THE BASIC CLARK S FLOW PRECEDENCE DIAGRAMMING (PDM STRUCTURE Branko Davidović
More informationThe UnionFind Problem Kruskal s algorithm for finding an MST presented us with a problem in datastructure design. As we looked at each edge,
The UnionFind Problem Kruskal s algorithm for finding an MST presented us with a problem in datastructure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
More informationChapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search
Chapter Objectives Chapter 9 Search Algorithms Data Structures Using C++ 1 Learn the various search algorithms Explore how to implement the sequential and binary search algorithms Discover how the sequential
More informationHashbased Digital Signature Schemes
Hashbased Digital Signature Schemes Johannes Buchmann Erik Dahmen Michael Szydlo October 29, 2008 Contents 1 Introduction 2 2 Hash based onetime signature schemes 3 2.1 Lamport Diffie onetime signature
More informationLecture 10: CPA Encryption, MACs, Hash Functions. 2 Recap of last lecture  PRGs for one time pads
CS 7880 Graduate Cryptography October 15, 2015 Lecture 10: CPA Encryption, MACs, Hash Functions Lecturer: Daniel Wichs Scribe: Matthew Dippel 1 Topic Covered Chosen plaintext attack model of security MACs
More informationEfficient Processing of Joins on Setvalued Attributes
Efficient Processing of Joins on Setvalued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Objectoriented
More informationA Catalogue of the Steiner Triple Systems of Order 19
A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of
More informationMathematics Review for MS Finance Students
Mathematics Review for MS Finance Students Anthony M. Marino Department of Finance and Business Economics Marshall School of Business Lecture 1: Introductory Material Sets The Real Number System Functions,
More informationFactoring & Primality
Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount
More informationPhysical Data Organization
Physical Data Organization Database design using logical model of the database  appropriate level for users to focus on  user independence from implementation details Performance  other major factor
More informationPerformance Tuning for the Teradata Database
Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting  i  Document Changes Rev. Date Section Comment 1.0 20101026 All Initial document
More informationPolynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range
THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis
More informationProtocols for Efficient Inference Communication
Protocols for Efficient Inference Communication Carl Andersen and Prithwish Basu Raytheon BBN Technologies Cambridge, MA canderse@bbncom pbasu@bbncom Basak Guler and Aylin Yener and Ebrahim Molavianjazi
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NPCompleteness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More informationInteger Factorization using the Quadratic Sieve
Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 seib0060@morris.umn.edu March 16, 2011 Abstract We give
More informationChapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
More informationEFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com
More informationOnline Scheduling for Cloud Computing and Different Service Levels
2012 IEEE 201226th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum Online Scheduling for
More informationInMemory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
InMemory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationSolutions to InClass Problems Week 4, Mon.
Massachusetts Institute of Technology 6.042J/18.062J, Fall 05: Mathematics for Computer Science September 26 Prof. Albert R. Meyer and Prof. Ronitt Rubinfeld revised September 26, 2005, 1050 minutes Solutions
More information2.1 Complexity Classes
15859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look
More informationCost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
More informationClustering and scheduling maintenance tasks over time
Clustering and scheduling maintenance tasks over time Per Kreuger 20080429 SICS Technical Report T2008:09 Abstract We report results on a maintenance scheduling problem. The problem consists of allocating
More informationAnalysis of Approximation Algorithms for kset Cover using FactorRevealing Linear Programs
Analysis of Approximation Algorithms for kset Cover using FactorRevealing Linear Programs Stavros Athanassopoulos, Ioannis Caragiannis, and Christos Kaklamanis Research Academic Computer Technology Institute
More information2.3 Scheduling jobs on identical parallel machines
2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed
More informationGraph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
More informationA COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES
A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the
More informationEfficient Algorithms for Masking and Finding QuasiIdentifiers
Efficient Algorithms for Masking and Finding QuasiIdentifiers Rajeev Motwani Stanford University rajeev@cs.stanford.edu Ying Xu Stanford University xuying@cs.stanford.edu ABSTRACT A quasiidentifier refers
More informationFPGAbased Multithreading for InMemory Hash Joins
FPGAbased Multithreading for InMemory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More information9th MaxPlanck Advanced Course on the Foundations of Computer Science (ADFOCS) PrimalDual Algorithms for Online Optimization: Lecture 1
9th MaxPlanck Advanced Course on the Foundations of Computer Science (ADFOCS) PrimalDual Algorithms for Online Optimization: Lecture 1 Seffi Naor Computer Science Dept. Technion Haifa, Israel Introduction
More informationOnline Bipartite Perfect Matching With Augmentations
Online Bipartite Perfect Matching With Augmentations Kamalika Chaudhuri, Constantinos Daskalakis, Robert D. Kleinberg, and Henry Lin Information Theory and Applications Center, U.C. San Diego Email: kamalika@soe.ucsd.edu
More informationResearch Statement Immanuel Trummer www.itrummer.org
Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses
More informationPartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2
More informationInvited Applications Paper
Invited Applications Paper   Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM
More informationLZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b
LZ77 The original LZ77 algorithm works as follows: A phrase T j starting at a position i is encoded as a triple of the form distance, length, symbol. A triple d, l, s means that: T j = T [i...i + l] =
More informationAnalysis of Compression Algorithms for Program Data
Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory
More informationApproximate Search Engine Optimization for Directory Service
Approximate Search Engine Optimization for Directory Service KaiHsiang Yang and ChiChien Pan and TzaoLin Lee Department of Computer Science and Information Engineering, National Taiwan University, Taipei,
More informationUnderstanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
More informationStatistical Learning Theory Meets Big Data
Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,
More informationNotes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
More informationEfficient Integration of Data Mining Techniques in Database Management Systems
Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre MendèsFrance 69676 Bron Cedex France
More informationA Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling JingChiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University
More informationA Sublinear Bipartiteness Tester for Bounded Degree Graphs
A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublineartime algorithm for testing whether a bounded degree graph is bipartite
More informationThe Online Set Cover Problem
The Online Set Cover Problem Noga Alon Baruch Awerbuch Yossi Azar Niv Buchbinder Joseph Seffi Naor ABSTRACT Let X = {, 2,..., n} be a ground set of n elements, and let S be a family of subsets of X, S
More informationNotes 11: List Decoding Folded ReedSolomon Codes
Introduction to Coding Theory CMU: Spring 2010 Notes 11: List Decoding Folded ReedSolomon Codes April 2010 Lecturer: Venkatesan Guruswami Scribe: Venkatesan Guruswami At the end of the previous notes,
More informationChapter 8: Bags and Sets
Chapter 8: Bags and Sets In the stack and the queue abstractions, the order that elements are placed into the container is important, because the order elements are removed is related to the order in which
More informationEfficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.
Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machineindependent measure of efficiency Growth rate Complexity
More informationLoadBalancing the Distance Computations in Record Linkage
LoadBalancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr
More informationChapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More information20 Selfish Load Balancing
20 Selfish Load Balancing Berthold Vöcking Abstract Suppose that a set of weighted tasks shall be assigned to a set of machines with possibly different speeds such that the load is distributed evenly among
More informationarxiv:1112.0829v1 [math.pr] 5 Dec 2011
How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly
More informationA Note on Maximum Independent Sets in Rectangle Intersection Graphs
A Note on Maximum Independent Sets in Rectangle Intersection Graphs Timothy M. Chan School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada tmchan@uwaterloo.ca September 12,
More informationLinear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
More informationABSTRACT 1. INTRODUCTION. Kamil BajdaPawlikowski kbajda@cs.yale.edu
Kamil BajdaPawlikowski kbajda@cs.yale.edu Querying RDF data stored in DBMS: SPARQL to SQL Conversion Yale University technical report #1409 ABSTRACT This paper discusses the design and implementation
More informationDiscovery of Frequent Episodes in Event Sequences
Data Mining and Knowledge Discovery 1, 259 289 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Discovery of Frequent Episodes in Event Sequences HEIKKI MANNILA heikki.mannila@cs.helsinki.fi
More informationFaster Set Intersection with SIMD instructions by Reducing Branch Mispredictions
Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions Hiroshi Inoue, Moriyoshi Ohara, and Kenjiro Taura IBM Research Tokyo, University of Tokyo {inouehrs, ohara}@jp.ibm.com,
More information