Fast Set Intersection in Memory

Size: px
Start display at page:

Download "Fast Set Intersection in Memory"

Transcription

1 Fast Set Intersection in Memory Bolin Ding University of Illinois at Urbana-Champaign 2 N. Goodin Avenue Urbana, IL 68, USA Arnd Christian König Microsoft Research One Microsoft Way Redmond, WA 9852, USA ABSTRACT Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a orst-case efficient ay. In general, given (preprocessed) sets, ith totally n elements, e ill sho ho to compute their intersection in expected time O(n/ + r), here r is the intersection size and is the number of bits in a machine-ord. In addition,e introduce a very simple version of this algorithm that has eaer asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and orloads.. INTRODUCTION Fast processing of set intersections is a ey operation in many query processing tass in the context of databases and information retrieval. For example, in the context of databases, set intersections are used in the context of various forms of data mining, text analytics, and evaluation of conunctive predicates. They are also the ey operations in enterprise and eb search. Many of these applications are interactive, meaning that the latency ith hich query results are displayed is a ey concern. It has been shon in the context of search that query latency is critical to user satisfaction, ith increases in latency directly leading to feer search queries being issued and higher rates of query abandonment [, 7]. As a consequence, significant portions of the sets to be intersected are often cached in main memory. This paper ill study the performance of set intersection algorithms for main-memory resident data. Note that these techniques are also relevant in the context of large dis-based (inverted) indexes, hen large fractions of these reside in a main memory cache. There has been considerable study of set intersection algorithms in information retrieval (e.g., [2, 4, ]). Most of these papers assume that the underlying data structure is an inverted index [23]. Much of this or (e.g., [2, 4]) focuses on adaptive algorithms hich use the number of comparisons as measure of overhead. For in-memory data, additional structures hich encode additional sipping-steps [8], tree-based structures [7], or hash-based algo- Permission to mae digital or hard copies of all or part of this or for personal or classroom use is granted ithout fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume ere invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th - September 3rd 2, Seattle, Washington. Proceedings of the VLDB Endoment, Vol. 4, No. 4 Copyright 2 VLDB Endoment //... $.. rithms become possible, hich often outperform inverted indexes; e.g., using hash-based dictionaries, intersecting to sets L, L 2 requires expected time O(min( L, L 2 )), hich is a factor of Θ(log( + max( L / L 2, L / L 2 ))) better than the best possible orst-case performance of comparison-based algorithms [6]. In this or, e propose ne set intersection algorithms aimed at fast performance.these outperform the competing techniques for most inputs and are also robust in that for inputs here they are not optimal they are close to the best-performing algorithm. The tradeoff for this gain is a slight increase in the size of the data structures, hen compared to an inverted index; hoever, in user-facing scenarios here latency is crucial, this tradeoff is often acceptable.. Contributions Our approach leverages to ey observations: (a) If is the size (in bits) of a machine-ord, e can encode a set from a universe of elements in a single machine ord, alloing for very fast intersections. (b) For the data distributions seen in many real-life examples (in particular search applications), the size of intersections is typically much smaller than the smallest set being intersected. To illustrate the second observation, e analyzed the K most frequent queries issued against the Bing Shopping portal. For 94% of all queries it held that the size of the full intersection as at least one order of magnitude smaller than the document frequency of the least frequent eyord; for 76% of the queries the difference as to orders of magnitude. By exploiting these to observations, e mae the folloing contributions. (i) We introduce linear-space data structures to represent sets such that their intersection can be computed in a orst-case efficient ay. Given sets, ith n elements in total, these data structures allo us to compute their intersection in expected time O(n/ + r), here r is the size of the intersection and is the number of bits in a machine-ord; hen the size of the intersection is an order of magnitude (or more) smaller than the size of the smallest set being intersected, our approach yields significant improvements in execution time over previous approaches. To the best of our noledge, the best asymptotic bound for fast set intersection is achieved by the O ( (n(log 2 ) 2 )/ + r ) algorithm of [6]. Hoever, note that the bound relies on a large value of ; in practice, is small (and constant), and < 2 6 = bits implies / < (log 2 ) 2 /. More importantly, [6] requires complex bit-manipulation, maing it slo in practice, hich e ill demonstrate empirically in Section 4. (ii) We describe a much simpler algorithm that computes the intersection in expected O(n/α m + mn/ + r ) time, here α is a constant determined by, and m is a parameter. This algorithm has eaer guarantees in theory, but performs better in practice, and gives significant improvements over the various data structures typically used, hile being very simple to implement.

2 2. BACKGROUND AND RELATED WORK Algorithms based on Ordered Lists: Most or on set intersection focuses on ordered lists as the underlying data structure, in particular algorithms using inverted indexes, hich have become the standard data structure in information retrieval. Here, documents are identified via a document ID, and for each term t, the inverted index stores a sorted list of all document IDs containing t. Using this representation, to sets L, L 2 of similar sizes (i.e., L L 2 ) can be intersected efficiently using a linear merge by scanning both lists in parallel, requiring O( L + L 2 ) operations (the merge step in merge sort). This approach is asteful hen set sizes differ significantly or only small fractions of the sets intersect. For very different set sizes, algorithms have been proposed that exploit this asymmetry, requiring log ( L + L 2 ) L + L comparisons at most (for L < L 2 ) [6]. To improve the performance further, there has recently been significant or on so-called adaptive set-intersection algorithms for set intersections [2, 4, 3,, 2, 5]. These algorithms use the total number of comparisons as measure of the algorithm s complexity and aim to use a number of comparisons as close as possible to the minimum number of comparisons ideally required to establish the intersection. Hoever, the resulting reduction in the number of comparisons does not necessarily result in performance improvements in practice: for example, in [2], binary search based algorithms outperform a parallel scan only hen L 2 < 2 L, even though several times feer comparisons are needed. Hierarchical Representations: There are various algorithms for set intersections based on variants of balanced trees (e.g. [9], treaps [7], and sip-lists [8]), computing the intersection of (preprocessed) sets L, L 2 in O( L log( L 2 / L )) (for L < L 2 ) operations. Hoever, hile some form of sipping is commonly used as part of algorithms based on inverted indexes, sip-lists (or trees) are typically not used in the scenarios outlined above (ith static set data) due to the required space-overhead. A novel and compact to-level representation of posting lists aimed at fast intersections in main memory as proposed in [9]. Algorithms based on Hashing: Using a hash-based representation of sets can speed up the intersection of sets L, L 2 ith L L 2 significantly (expected time O( L ) by looing up all elements of L in the hash-table of L 2); hoever, because of the added indirection, this approach performs poorly for less seed set sizes. A ne hashing-based approach is proposed in [6]: here, the elements in sets L, L 2 are mapped using a hash-function h to smaller (approximate) representations h(l ), h(l 2). These representations are then intersected to compute H = h(l ) h(l 2). Finally, the set of all elements in the original sets that map to H via h are computed and any false positives removed. As the hashed images h(l ), h(l 2) to be intersected are smaller than the original sets (using feer bits), they can be intersected more quicly. Given sets of total size n, their intersection can be computed in expected time O ( (n log 2 )/ + r ), here r = i Li. Score-based pruning: In many IR engines it is possible to avoid computing full intersections by leveraging scoring functions that are monotonic in the individual term-ise scores; this maes it possible to terminate the intersection processing early using approaches such as TA [5] or document-at-a-time (DAAT) processing (e.g., [8]). Hoever, in practice, this is often not possible, either because of the complexity of the scoring function (e.g., nonmonotonic machine-learning based raning functions) or because full intersection results are required. Our approach is based on partitioning the elements in each set into very small ( 8 elements) groups, for hich e have fast intersection schemes. Hence, DAATapproaches can be combined ith our or by using these small groups in place of individual documents. Set intersections using multiple cores: Techniques that exploit multi-core architectures to speed up set intersections are described in [2, 22].The use of multiple cores is orthogonal to our approach in the sense that our algorithms can be parallelized for these architectures as ell; hoever, this is beyond the scope of our paper. 3. OUR APPROACH Notation: We are given a collection of N sets S = {L,..., L N }, here L i Σ and Σ is the universe of elements in the sets; let n i = L i be the size of set L i. Suppose elements in a set are ordered, and for a set L, let inf(l) and sup(l) be the minimum and maximum elements of a set L, respectively. We use to denote the size (number of bits) of a ord on the target processor. Throughout the paper e ill use log to denote log 2. Finally, e use [] to denote the set {,..., }. Our approach can be extended to bag semantics by additionally storing element frequency. Frameor: Our tas is to design data structures such that the intersection of multiple sets can be computed efficiently. We differentiate beteen a pre-processing stage, during hich e reorganize each set and attach additional index structures, and an online processing stage, hich uses the pre-processed data structures to compute intersections. An intersection query is specified via a collection of sets L, L 2,..., L (to simplify notations, e use the offsets, 2,..., to refer to the sets in a query throughout this section); our goal is to compute L L 2... L efficiently. Note that pre-processing is typical of most non-trivial data structures used for computing set intersections; even building simple non-compressed inverted indexes requires sorting the posting lists as a pre-processing step. We require the pre-processing stage to be time/space-efficient in that it does not require more than O(n i log n i) time (necessary for sorting) and linear space O(n i). The size of intersection L L 2 is a loer bound of the time needed to compute the intersection. Our method leverages to ey ideas to approach this loer bound: (i) The intersection of to sets in a small universe can be computed very efficiently; in particular, if the to sets are subsets of {, 2,..., }, e can encode them as single machine-ords and compute their intersection using a bitise-and. (ii) A small number of elements in a large universe can be mapped into a small universe. partitioning via sorting/hashing L L p L h :Σ [] h(l ) h(l 2 ) h(l p ) compute L p h(l p ) h(l q 2) L p L q (ith the help of h(l p L q 2: 2 ) h(l q 2)) h(l 2) h(l 2 2) h(l q 2) L L L q 2 L partitioning via sorting/hashing L 2 Figure : Algorithmic Frameor h :Σ [] We leverage these to ideas by first partitioning each set L i into smaller groups L i s, hich are intersected separately. In the pre-processing stage, e map each small group into a small universe [] = {, 2,..., } using a universal hash function h and encode the image h(l i ) ith a machine-ord. Then, in the on-

3 line processing stage, to compute the intersection of to small groups L p and Lq 2, e first use a bitise-and operation to compute H = h(l p ) h(lq 2 ), and then try to recover Lp Lq 2 using the inverse mapping h from H. The union of L p Lq 2 s forms L L 2. Moreover, if the intersection L L 2 is of a small size compared to L and L 2 (seen in practice), a large fraction of the small groups ith overlapping ranges has an empty intersection; thus, by using the ord-representations of H to detect these groups quicly, e can sip much unnecessary computation, resulting in significant speed-up. The resulting algorithmic frameor is illustrated in Figure. Given this overall approach, the ey questions become ho to form groups, hat structures to be used to represent them, and ho to process intersections of these small groups. We ill discuss these details in the folloing sections. All the formal proofs of analytical results are deferred to the appendix. 3. Intersection via Fixed-Width Partitions We first consider the case hen there are only to sets L and L 2 in the intersection query. We ill present a pair of pre-processing and online processing algorithms, hich e use to illustrate the basic ideas of our algorithms. We subsequently refine and extend our techniques to sets in Section 3.2. In the pre-processing stage, L and L 2 are sorted, and partitioned into groups (recall is the ord idth) L, L 2,..., L n /, and L 2, L 2 2,..., L n 2/ 2 of equal size (except the last ones). In the online processing stage (Algorithm ), the small groups are scanned in order. If the ranges of L p and Lq 2 overlap, e may have L p Lq 2. The intersection Lp Lq 2 of each pair of overlapping groups is computed (line 8) in some iteration. And finally, the union of all these intersections is L L 2. Since each group is scanned once, line 2- repeat for O((n + n 2)/ ) iterations. The maor remaining question no becomes ho to compute L p Lq 2 efficiently ith proper pre-processing? For this purpose, e map each group L p or Lq 2 into a small universe for fast intersection, and e leverage single-ord representations to store and manipulate sets from a small universe. Single-Word Representation of Sets: We represent a set A [] = {, 2,..., } using a single machine-ord of idth by setting the y-th bit as iff y A. We refer to this as the ord representation (A) of A. For to sets A and B, the bitise-and (A) (B) (computed in O() time) is the ord representation of A B. Given a ord representation (A), all the elements of A can be retrieved in linear time O( A ). In the rest of this paper, if A [], e use A to denote both a set and its ord representation. Pre-processing Stage: Elements in a set L i are sorted as {x i, x 2 i,..., x n i i } (i.e., x i < x + i ) and L i is partitioned as follos: L i = {x i,..., x i }, L 2 i = {x + i,..., x 2 i },... () L i = {x( ) + i, x ( ) +2 i,..., x i },... (2) For each small group L i, e compute the ord-representation of its image under a universal hash function h : Σ [], i.e., h(l i ) = {h(x) x L i }. In addition, for each position y [] and each small group L i, e also maintain the inverted mapping h (y, L i ) = {x x L i and h(x) = y}, i.e., for each y [] We use the folloing ell-non technique: ( is bitise-xor) (i) lobit = (((A) ) (A)) (A) is the loest -bit of (A). For the smallest element y in A, e have 2 y = lobit. y = log(lobit) A can be computed using the machine instruction NLZ (number of leading zeros) or pre-computed looup tables. (ii) Set (A) as (A) lobit and repeat (i) to scan the next smallest element until (A) becomes. e store the elements in L i ith hash value y, in a short list hich supports ordered access. We ensure that the order of these elements is identical across different h (y, L i ) s and Li s; in this ay, e can intersect these short lists using a linear merge. EXAMPLE 3.. (PRE-PROCESSING AND DATA STRUCTURES) Suppose e have to sets L = {, 2, 4, 9, 6, 27, 43}, L 2 = {, 3, 5, 9,, 6, 22, 32, 34, 49}. And, let = 6 ( = 4). For simplicity, h is selected to be h(x) = (x ) mod 6. L is partitioned into 2 groups: L = {, 2, 4, 9}, L 2 = {6, 27, 43}, and L 2 is partitioned into 3 groups: L 2 = {, 3, 5, 9}, L 2 2 = {, 6, 22, 32}, L 3 2 = {34, 49}. We pre-compute: h(l ) = {, 2, 4, 9}, h(l 2 ) = {, }, h(l 2) = {, 3, 5, 9}, h(l 2 2) = {, 6, }, h(l 3 2) = {, 2}. We also pre-process h (y, L p i ) s: for example, h (, L 2 ) = {6}, h (, L 2 2) = {6, 32}, h (, L 2 ) = {27, 43}, and h (, L 2 2) = {}. : p, q, 2: hile p n and q n 2 do 3: if inf(l q 2 ) > sup(lp ) then 4: p p + 5: else if inf(l p ) > sup(lq 2 6: ) then q q + 7: else 8: compute (L p Lq 2 ) using IntersectSmall 9: (L p Lq 2 ) : if sup(l p ) < sup(lq 2 ) then p p + else q q + : is the result of L L 2 Algorithm : Intersection via fixed-idth partitioning Online Processing Stage: The algorithm used to intersect to sets is shon in Algorithm. Since elements in L i are sorted, Algorithm ensures that if the ranges of any to small groups L p, Lq 2 overlap, their intersection is computed (line 8). After scanning all such pairs, must then contain the intersection of the hole sets. No the question is: ho to compute the intersection of to small groups L p Lq 2 efficiently? For this purpose, e introduce the algorithm IntersectSmall (Algorithm 2), hich: (i) first computes H = h(l p ) h(lq 2 ) using a bitise-and; (ii) for each (-bit) y H, intersects the corresponding inverted mappings using the linear merge algorithm. IntersectSmall(L p, Lq 2 ): computing Lp Lq 2 : Compute H h(l p ) h(lq 2 ) 2: for each y H do 3: Γ Γ (h (y, L p ) h (y, L q 2 )) 4: Γ is the result of L p Lq 2 Algorithm 2: Computing the intersection of small groups EXAMPLE 3.2. (ONLINE PROCESSING) Folloing Example 3., to compute L L 2, e need to compute L L 2, L 2 L 2 2, and L 2 L 3 2 (pairs ith overlapping ranges): for example, for computing L 2 L 2 2, e first compute h(l 2 ) h(l 2 2) = {, }; then L 2 L 2 2 = ( y=, h (y, L 2 ) h (y, L 2 2) ) = {6}. Similarly, e can compute L L 2 = {, 9}. Finally, e find h(l 2 ) h(l 3 2) =, and thus L 2 L 3 2 =. So, e have L L 2 = {, 9} {6}. Note that ord representations and inverted mappings for L i are pre-computed, and ord-representations can be intersected using one operation. So the running time of IntersectSmall is bounded by the number of pairs of elements, one from L p and one from Lq 2, that are mapped to the same hash-value. This number can be shon to be equal (in expectation) to the intersection size plus O() for each group L i. Using this, e obtain Algorithm s running time:

4 THEOREM 3.3. ) Algorithm computes L L 2 in expected + r time, here r = L L 2. O( n +n 2 To achieve a better bound, e optimize the group sizes: ith L and L 2 partitioned into groups of sizes s = n /n 2 and s 2 = n2/n, respectively, L L 2 can be computed in expected O( n n 2/ + r) time. A detailed analysis of the effect of group size on running times can be found in Section A... Overhead of Pre-processing: If only the bound in Theorem 3.3 is required, then to pre-process a set L i of size n i, it is obvious that O(n i log n i) time and O(n i) space suffice: e only need to partition a sorted list into small groups of size, and for each small group, construct the ord representation and inverted mapping in linear time using the hash function h. To achieve the better bound O( n n 2/+r), e need multiple resolutions of the partitioning of a set L i. This is because, as discussed above, the optimal group size s = n /n 2 of the set L also depends on the size n 2 of the set L 2 to be intersected ith it. For this purpose, e partition a set L i into small groups of size 2, 4,..., 2, etc. To compute L L 2 for the given to sets, suppose s i is the optimal group size of L i; e then select the actual group size s i = 2 t s.t. s i s i 2s i, obtaining the same bound. A carefully-designed multi-resolution data structure enabling access to these groups consumes only O(n i) space for L i. We ill describe and analyze this structure in Section THEOREM 3.4. To pre-process a set L i of size n i for Algorithm, e need O(n i log n i) time and O(n i) space (in ords). Limitations of Fixed-Width Partitions: The main limitation of the proposed approach is that it is difficult to extend to more than to sets, because the partitioning scheme e use is not ell-aligned for more than to sets: for three sets, e.g., there may be more than O((n + n 2 + n 3)/ ) triples of small groups that overlap. We introduce a different partitioning scheme to address this issue in Section 3.2, hich extends to > 2 sets. 3.2 Intersection via Randomized Partitions In this section, e ill introduce an algorithm based on a randomized partitioning scheme to compute the intersection of to or more sets. The general approach is as follos: instead of fixedidth partitions, e use a hash function g to partition each set into small groups, using the most significant bits of g(x) to group an element x Σ. This reduces the number of combinations (pairs) of small groups e have to intersect, alloing us to prove bounds similar to Theorem 3.3 for computing intersections of > 2 sets. Pre-processing Stage: Let g be a hash function g : Σ {, } mapping an element to a bit-string (or binary number); e use g t(x) to denote the t most significant bits of g(x). We say that for to bit-strings z and z 2, z is a t -prefix of z 2, iff z is identical to the highest t bits in z 2; e.g., is a 4-prefix of. To pre-process a set L i, e partition it into groups L z i = {x x L i and g t(x) = z} for all z {, } t (some t). As before, e compute the ord representation of the image of each L z i under another hash function h : Σ [], and inverted mappings h. Online Processing Stage: This stage is similar to our previous algorithm: to compute the intersection of to sets L and L 2, e compute the intersections of pairs of overlapping small groups, one from each set, and finally tae the union of these intersections. In general, suppose L is partitioned using g t : Σ {, } t, and L 2 is partitioned using g t2 : Σ {, } t 2. Assume n n 2 and t t 2. We no intersect sets L and L 2 using Algorithm 3. The maor improvement of Algorithm 3 compared to Algorithm is that in Algorithm, e need compute L p Lq 2 hen the ranges of L p and Lq 2 overlap; in Algorithm 3, e compute Lz Lz 2 2 (also using Algorithm 2) hen z is a t -prefix of z 2 (this is a necessary condition for L z Lz 2 2 ; so Algorithm 3 is correct). This significantly reduces the number of pairs to be intersected. : for each z 2 {, } t 2 do 2: Let z {, } t be the t -prefix of z 2 3: Compute L z Lz 2 2 using IntersectSmall(Lz 4: Let (L z Lz 2 2 5: ) is the result of L L 2, Lz 2 2 ) Algorithm 3: 2-list Intersection via Randomized Partitioning Based on the choices of parameters t and t 2, e can either partition L and L 2 into the same number of small groups (yielding the bound of Theorem 3.5), or into small groups of the (approximately) identical sizes (yielding Theorem 3.6). ( THEOREM 3.5. Algorithm 3 computes L L 2 in expected n ) n O 2 + r time (r = L L 2 ), ith t = t 2 = log n n 2. THEOREM 3.6. ) Algorithm 3 computes L L 2 in expected + r time (r = L L 2 ), using t = log(n / ) O( n +n 2 and t 2 = log(n 2/ ). Note that hen n n 2, Theorem 3.5 has a better bound than Theorem 3.6. But e can extend Theorem 3.6 to -set intersection. Extension to More Than To Sets: Suppose e ant to compute the intersection of sets L,..., L, here n i = L i and n n 2... n. L i is partitioned into groups L z i s using g ti : Σ {, } t i. Note that g ti s are generated from the same hash function g. We use t i = log(n i/ ) and proceed as in Algorithm 4. Algorithm 4 is almost identical to Algorithm 3, but is generalized to sets: for each z {, } t, e pic the group identifiers z i to be the t i-prefix of z, and e only intersect groups L z, Lz 2 2,..., Lz, here z, z 2,..., z share a prefix of size t. Also, e extend IntersectSmall (Algorithm 2) for groups: e first compute the intersection (bitise-and) of hash images (their ord-representations) of the groups L z i i s; and, if the result H = i= h(lz i i ) is not zero, for each (-bit) y H, e intersect the corresponding inverted mappings h (y, L z i i ) s. Details and analysis are deferred to the appendix. THEOREM 3.7. Using t i = log(n i/ ), Algorithm 4 computes the intersection i= Li of sets in expected O(n/ + r) time, here r = i= Li and n = i= ni = i= Li. : for each z {, } t (t i = log(n i / ) ) do 2: Let z i be the t i -prefix of z for i =,..., 3: Compute i= Lz i i using extended IntersectSmall 4: Let ( i= Lz i i ) 5: is the result of i= L i Algorithm 4: -list Intersection via Randomized Partitioning 3.2. A Multi-resolution Data Structure Recall that in some algorithms (e.g., Theorem 3.5), the selection of the number of small groups used for a set L i depends on the (size of) other sets being intersected ith L i. So by naively precomputing the required structures for each possible group size, e ould incur excessive space requirements. In this section, e describe a data structure that supports access to partitions of L i into 2 t groups for any possible t, using only O(n i) space. It is illustrated in Figure 2. To support the algorithms introduced so far, this structure must also allo us: (i) for each L z i, to retrieve the ord-representation h(l z i ), and

5 pointers 2 x L z i x L z i partitioned by g L z i partitioned by g 2 x L z i partitioned by g 3 y pointer to the first element in L z i s.t. h(x) =y next(x) = the smallest x L i s.t. x x and h(x) =h(x ) Figure 2: Multi-Resolution Partition of L i g :Σ {, } g 2 :Σ {, } 2 g 3 :Σ {, } 3 g t :Σ {, } t (ii) for each y [], to access all elements in h (y, L z i ) = {x x L z i and h(x) = y} in time linear in its size h (y, L z i ). Multi-resolution Partitioning: For the ease of explanation, e suppose Σ = {, } and choose g as a random permutation of Σ. To pre-process L i, e first order all the elements x L i according to g(x). Then any small group L z i = {x x L i and g t(x) = z} forms a consecutive interval in L i (partitions of different resolutions are formed for t =, 2,...). Note: in all of our algorithms, universal hash functions and random permutations are almost interchangeable (hen used as g) the differences being that (i) a permutation induces a total ordering of elements (in this data structure, this property is required), hereas hashing may result in collisions (hich e can overcome by using the pre-image to brea ties) and (ii) there is a slight difference in the resulting probability of, e.g., elements being grouped together (hashing results in (limited) independence, hereas permutations result in negative dependence e account for this by using the eaer condition in our proofs). Word Representations of Hash Mappings: No, for each small group L z i, e need to pre-compute and store the ord representation h(l z i ). Note the total number of small groups is n i/2+n i/ n i/2 t +... n i. So this requires O(n i) space. Inverted Mappings: We need to access all elements in h (y, L z i ) in order, for each y []. If e ere to store these mappings for each L z i explicitly, this ould require O(n i log n i) space. Hoever, by storing the inverted mappings h (y, L z i ) s implicitly, e can do better, as follos: For each group L z i, since it corresponds to an interval in L i, e can store the starting and ending positions in L i, denoted by left(l z i ) and right(l z i ). These allo us to determine if an element x belongs to L z i. No, to enable the ordered access to the inverted mappings, e define, for each x L i, next(x) to be the next element x to x on the right s.t. h(x ) = h(x) (i.e., ith minimum g(x ) > g(x) s.t. h(x ) = h(x)). Then, for each L z i and each y [], e store the position first(y, L z i ) of the first element x in L z i s.t. h(x ) = y. No, to access all elements in h (y, L z i ) in order, e can start from the element at first(y, L z i ), and follo the pointers next(x), until passing the right boundary right(l z i ). And, in this ay, all elements in the inverted mapping are retrieved in the same order as g(x) hich e require for IntersectSmall. Space Requirements: For all groups of different sizes, the total space for storing h(l z i ) s, left(l z i ) s, right(l z i ) s, first(y, L z i ) s and next(x) s is O(n i). So the hole multi-resolution data structure requires O(n i) space. A detailed analysis is in the appendix. When the group size t i depends only on n i (e.g., in Algorithm 4), single-resolution in pre-processing suffices, and the above multiresolution scheme (for selecting t i online) is not necessary. THEOREM 3.8. To pre-process a set L i of size n i for Algorithm 3-4, e need O(n i log n i) time and O(n i) space (in ords). 3.3 From Theory to Practice In this section, e describe a more practical version of our methods. This algorithm is simpler, uses significantly less memory, straight-forard data structures, and, hile it has orse theoretical guarantees, is faster in practice. The main difference is that for each small group L z i, e only store the elements in L z i and their images under m hash functions (i.e., e do not maintain inverted mappings, trading off a complex O()-access for a simple scan over a short bloc of data). Also, e use only a single partition for each set L i. Having multiple ord representations of hash images (different hash functions) for each small group allos us to detect empty intersections of small groups ith higher probability. Pre-processing Stage: As before, each set L i is partitioned into groups L z i s using a hash function g ti : Σ {, } t i. We ill sho that a good selection of t i is log(n i/ ), hich depends only on the size of L i. Thus for each set L i, pre-processing ith a single partitioning suffices, saving significant memory. For each group, e compute ord representations of images under m (independent) universal hash functions h,..., h m : Σ []. Note that e only require a small value of m in practice (e.g., m = 2). Online Processing Stage: The algorithm for computing il i e use here (Algorithm 5) is identical to Algorithm 4, ith to exceptions: () When needed, il z i i is directly computed by a linear merge of L z i i s (line 4), using O(Σi Lz i i ) time. (2) We can sip the computation of il z i i if, for some h, the bitise-and of the corresponding ord representations h (L z i i ) s is zero (line 3). : for each z {, } t (t i = log(n i / ) ) do 2: Let z i be the t i -prefix of z for i =,..., 3: if i= h (L z i i ) for all =,..., m then 4: Compute i= Lz i i by a linear merge of L z,..., Lz 5: Let ( i= Lz i i ) 6: is the result of i= L i Algorithm 5: Simple Intersection via Randomized Partitioning Analysis: To see hy Algorithm 5 is efficient, e observe that: if L z Lz 2 2 =, then ith high probability, h(lz ) h(lz 2 2 ) = for some =,..., m. So most empty intersections can be sipped using the test in line 3. With the probability of a successful filtering (i.e. given il z i i =, ih (L z i i ) = for some hash function h, =,..., m) bounded by the Lemmas A. and A.3, e can derive Theorem 3.9. Detailed analysis of this probability (both theoretical and experimental) and overall complexity is deferred to Appendix A.5. THEOREM 3.9. Using t i = log(n i/ ), Algorithm 5 computes ( i= Li in expected O max(n,n ) + mn α() m + r ) time (r= i= L i, n= i= n i, α()= for β() used in Lemma A.3). β() 3.3. Data Structure for Storing L z i In this section, e describe the simple and space-efficient data structure that e use in Algorithm 5. As stated earlier, e only need to partition L i using one hash function g ti ; hence e can represent each L i as an array of small groups L z i s, ordered by z. For each small group, e store the information associated ith it in the structure shon in Figure 3. The first ord in this structure stores z = g ti (L z i ). The second ord stores the structure s length len. The folloing m ords represent the hash images h (L z i ),..., h m(l z i ) of L z i. Finally, e store the elements of L z i as an array in the remaining part. We need n i/ such blocs for

6 z len h (L z i ) m ords h m (L z i ) len Figure 3: The Structure for a Pre-processed Small Group L z i L i in total. The first ord z can be also computed on-the-fly, as these small groups are accessed sequentially in Algorithm 5. So, if e store len using one ord, and one ord for each element of L z i, then e need totally m + + L z i ords for each group L z i, and thus n i( + (m + )/ ) ords to store the pre-processed L i. The overhead of the pre-processing is dominated by the cost of sorting L i (the remaining operations are trivial). THEOREM 3.. To pre-process a set L i of size n i for Algorithm 5, e need O(n i(m + log n i)) time, and O(n i( + m/ )) (ords) space. We describe methods for compressing this structure in Appendix B. 3.4 Intersecting Small and Large Sets An important special case for set intersection are asymmetric intersections here the sizes n and n 2 of the sets that are intersected vary significantly (.l.o.g., assume n n 2). In this subsection, using the same multi-resolution data structure as in Section 3.2., e present an algorithm HashBin that computes L L 2 in O(n log(n 2/n )) time. This bound is also achieved by other previous ors, e.g., SmallAdaptive [5], but our algorithm is even simpler in online processing. It is also non that algorithms based on hash-tables only require O(n ) time for this scenario; hoever, unlie HashBin, they are ill-suited for less asymmetric cases. Algorithm HashBin: When intersecting to sets L and L 2 ith sizes n n 2, e focus on the partitioning induced by g t : Σ {, } t, here t = log n for both of them, and g is a random permutation of Σ. To compute L L 2, e compute L z L z 2 for all z {, } t and tae the union. To compute L z L z 2, e iterate over each x L z, and perform a binary search to chec hether x L z 2 using O(log L z 2 ) time. This scheme can be extended to multiple sets by searching for x in L z i if found in L z,..., L z i. THEOREM ( 3.. The algorithm HashBin computes L L 2 in expected O n log n 2 time. The pre-processing of a list L i n ) requires O(n i log n i) time and O(n i) space. The proof of Theorem 3. and ho HashBin uses the multiresolution data structure is deferred to the Section A.6 in the appendix. The advantage of HashBin is that, since it is based on the same structure as the algorithm introduced in Section 3.2, e can mae the choice beteen algorithms online, based on n /n EXPERIMENTAL EVALUATION We evaluate the performance and space requirements of four of the techniques described in this paper: (a) the fixed-idth partition algorithm described in Section 3. (hich e ill refer to as IntGroup); (b) the randomized partition algorithm in Section 3.2 (RanGroup) (c) the simple algorithm based on randomized partitions described in Section 3.3 (RanGroupScan); and (d) the one for intersecting sets of seed sizes in Section 3.4 (HashBin). Setup: All algorithms are implemented using C and evaluated on a 4GB 64-bit 2.4GHz PC. We employ a random permutation of the document IDs for the hash function g and 2-universal hash functions for h (or h s). For RanGroup, e use m = 4 (the number of hash functions h ), unless noted otherise. We compare our techniques to the folloing competitors: (i) set intersection based on a simple parallel scan of inverted indexes: Merge; (ii) set intersection based on sip lists: SipList [8]; (iii) L z i set intersection based on hash tables: Hash (i.e., e iterate over the smallest set L, looing up every element x L in hash-table representations of L 2,... L ); (iv) the algorithm of [6]: BPP; (v) the algorithm proposed for fast intersection in integer inverted indices in main memory [9, 2]: Looup (using B = 32 as the bucetsize, hich is the best value in our and the authors experience); and (vi) various adaptive intersection algorithms based on binary search/galloping search: SvS, Adaptive [2, 3, 3], BaezaYates [, 2], and SmallAdaptive [5]. Note that BaezaYates is generalized to handle more than to sets as in [5]. Implementation: For each competitor, e tried our best to optimize its performance. For example, for Merge e tried to minimize the number of branches in the inner loop; e also store postings in consecutive memory addresses to speed up parallel scans and reduce page als after TLB misses. Our implementation of sip lists follos [8], ith simplifications since e are focusing on static data and do not need fast insertion/deletion. We also simplified the bit-manipulation in BPP [6] so that it ors faster in practice for small. For the algorithms using inverted indexes, e initially do not consider compression on the posting lists, as e do not ant the decompression step to impact the performance reports. In Section 4. e ill study variants of the algorithms incorporating compression. With regards to sip-operations in the index note that since e use uncompressed posting lists, algorithms such as Adaptive can perform arbitrary sips into the index directly. Datasets: To evaluate these algorithms e use both synthetic and real data. For the experiments ith synthetic datasets, sets are generated randomly (and uniformly) from a universe Σ. The real dataset is a collection of more than 8M Wiipedia pages. In each experiment for the synthetic datasets, 2 combinations of sets are randomly generated, and the average time is reported. Intersection Time (ms) M 2M 3M 4M 5M 6M 7M 8M 9M M Set Size Figure 4: Varying the Set Size Merge SipList Hash IntGroup BPP Adaptive Looup RanGroupScan Varying the Set Size: First, e measure the performance hen intersecting only 2 sets; e use synthetic data, the lists are of equal size and the size of the intersection is fixed at % of the list size; the results are shon in Figure 4. We can see that the performance of the different techniques relative to each other does not change ith varying list size. Hash performs orst, as the (relatively) expensive looup operation needs to be performed many times. SipList performs poorly for the same reason. The BPP algorithm is also slo, but this is because of a number of complex operations that need to be performed, hich are hidden as a constant in the O()-notation. The same trend held for the remaining experiments as ell; hence, for readability, e did not include BPP in the subsequent graphs. For the same reason e only sho the best-performing among the adaptive algorithms in the evaluation; if one adaptive algorithm dominates another on all parameter settings in an experiment, e don t plot the orse one. Among the remaining algorithms, RanGroupScan (4%-5% faster than Merge) and IntGroup perform the best (RanGroup performs similarly to IntGroup and is not plotted). Interestingly,

7 Intersection TIme (ms) MERGE performs best for L L 2 >.7 L Merge SipList Hash Adaptive SvS Looup IntGroup RanGroup Average Intersection Time (ms) K=2 K=3 K=4 RanGroupScan Intersection Size Figure 5: Varying the Intersection Size Figure 6: Varying the Number of Keyords the simple Merge algorithm is next, outperforming the more sophisticated algorithms, folloed by Looup and the best-performing adaptive algorithm. Varying the Intersection Size: The size of the intersection r is an important factor concerning the performance of the algorithms: larger intersections mean feer opportunities to eliminate small groups early for our algorithms or to sip parts of the set for the adaptive and siplist-based approaches. Here, e use synthetic data, intersecting to sets ith M elements and vary r = L L 2 beteen 5 and M. The results are reported in Figure 5. For r < 7M (7% of the set size) RanGroupScan and IntGroup perform best. Otherise, Merge becomes the fastest and RanGroup- Scan the 2nd-fastest alternative; here, the performance of Ran- GroupScan is very similar to Merge, all the ay to r = M. Among the remaining algorithms, RanGroup slightly outperforms Merge for r < 5M, Looup is the next-best algorithm and SvS and Adaptive perform best among the adaptive algorithms. Varying the Sets Size Ratios: As e illustrated in the introduction, the se in set sizes is also an important factor in performance. When sets are very different in size, algorithms that iterate through the smaller set and are able to locate the corresponding values in the larger set quicly, such as HashBin and Hash, perform ell. In this experiment e use synthetic data and vary the ratio of set sizes, setting L 2 = M and varying L beteen 6K and M. The size of the intersection is set to be % of L and e define the ratio beteen the list sizes as sr = L 2 / L. Here, the differences beteen the algorithms become small ith groing sr (for this reason, e also don t report them in a graph, as too many lines overlap). For sr < 32, RanGroupScan performs best; for larger sr, Looup and Hash perform best, until a ratio of sr for this and larger ratios, Hash outperforms the remaining algorithms, folloed by Looup and HashBin. Generally, both HashBin and RanGroupScan perform close to the best-performing algorithm. The adaptive algorithms require more time than RanGroupScan for sr 2 and more time than HashBin for all values of sr; Siplist and BPP perform orst across all values of sr. Varying the Number of Keyords: In this experiment, e varied the number of sets = 2, 3, 4, fixing L i = M for i =,...,, ith the IDs in the sets being randomly generated using a uniform distribution over [, 2 8 ]; the results are reported in Figure 6. In this experiment, e use m = 2 hash images for RanGroupScan. For multiple sets, RanGroupScan is the fastest, ith the difference becoming more pronounced for 3 and 4 eyords, since, ith additional sets, intersecting the hash-images (ord-representations) yields more empty results, alloing us to sip the corresponding groups. RanGroup is the next-best performing algorithm; e don t include results for IntGroup here, as it is designed for intersections of to sets (see Section 3.). In- terestingly, the simple Merge algorithm again performs very ell hen compared to the more sophisticated techniques; the Looup algorithm is next, folloed by the various adaptive techniques. Size of the Data Structure: The improvements in speed come at the cost of an increase in space: our data structures (ithout compression) require more space than an uncompressed posting list the increase is 37% (RanGroupScan for m = 2), 63% (Ran- GroupScan for m = 4), 75% (IntGroup) or 87% (RanGroup). Normalized Intersection Time Results: All Query Sizes Merge Figure 7: Normalized Execution Time on a Real Worload Experiment on Real Data: In this experiment, e used a orload of the 4 most frequent (measured over a ee in 29) queries against the Bing.com search engine. As the text corpus, e used a set of 8 Million Wiipedia documents. Query characteristics: 68% of the queries contain 2 eyords, 23% 3 eyords and 6% 4 eyords. As e have illustrated before, a ey factor for performance is the ratio of set sizes among the 2-ord queries, the average ratio L / L 2 is.2, for 3-ord queries the average ratio L / L 2 is.3 and the average ratio L / L 3 is.9, and for 4-ord queries, the L / L 2 ratio is.36 and the L / L 4 ratio is.6 note that L L 2 L 3 L 4. The average ratio of intersection size to L is.9. To illustrate the relative performance of the algorithms over all queries e plotted their average running times in Figure 7: here, the running time of Merge is normalized to. Both RanGroup and RanGroupScan significantly outperform Merge, ith the latter performing the best overall; interestingly, hen used for all queries (as opposed to only for the large se case it as designed for) HashBin still performed better than Merge. The remaining algorithms performed in similar order to the earlier experiments, ith the one exception being SvS hich outperformed both Merge and Looup for this more realistic data. Overall, the RanGroupScan as the best-performing algorithm for 6.6% of the queries, folloed by RanGroup (6%) and Hashbin (7.7%) among the remaining algorithms not proposed in this paper, Looup performed best in 6.4% of the queries and SvS for 3.6% of the queries. All of the other techniques ere best for 2.% of the queries or feer. We present additional experiments for this data set in the Appendix C.2.

8 4. Experiments on Compressed Structures To illustrate the impact of compression on performance, e repeated the first experiment above, intersecting to sets of identical size, ith the size of the intersection fixed to % of the set size. Varying the set size, e report the execution times and storage requirements for the three algorithms that performed best overall in the earlier experiments Merge, Looup and RanGroupScan (since e are interested in small structures here, e only use m = hash images in RanGroupScan) hen being compressed ith different techniques: e used the standard techniques based on γ- and δ-coding (see [23], p.6) to compress the parts of the posting data stored and accessed sequentially for the three algorithms, and the compression technique described in Appendix B for Ran- GroupScan (hich e refer to as RanGroupScan Lobits). The results are shon in Figure 8; here, e omitted the results for γ- encoding as they ere essentially indistinguishable from ones for δ-coding. RanGroupScan outperforms in terms of speed the other to algorithms using the same compression scheme; the other to algorithms perform similarly to each other, as the decompression no dominates their run-time. Using our encoding scheme of Appendix B improves the performance significantly. Looing at the graph, e can see that the storage requirement for RanGroupScan (using our on encoding) is beteen.3-.9x of the size of the compressed inverted index and beteen.2-.6x of the compressed Looup structure. At the same time, the performance improvements are beteen 7.6-5x (vs. Merge) or 7.4-3x (vs. Looup). Furthermore, by increasing the number of hash images to m = 2, e obtain an algorithm that significantly outperforms the uncompressed Merge, hile requiring less memory. Intersection time (ms) Size of the structure (in ords) Merge_Delta Looup_Delta RanGroupScan_Lobits RanGroupScan_Delta 28K 256K 52K M 2M 4M 8M Number of postings Merge_Delta Looup_Delta RanGroupScan_Lobits RanGroupScan_Delta 28K 256K 52K M 2M 4M 8M Number of postings Figure 8: Running Time and Space Requirement Experiment on Real Data: We repeated this experiment using the real-life data/orload described earlier and the compressed variants of RanGroupScan, Looup and Merge. Again, Ran- GroupScan Lobits performed best, improving run-times by a factor of 8.4x (vs. Merge + δ-coding), 9.x (Merge + γ-coding), 5.7x (Looup + δ-coding), 6.2x (Looup + γ-coding), respectively. Hoever, our approach required the most space (66% of the uncompressed data), hereas Merge (26% / 28% for γ- / δ-coding) and Looup (35% / 37%) required significantly less. Finally, to illustrate the robustness of our techniques, e also measured the orst-case latency for any single query: here, the orst-case latency using Merge + δ-coding as 5.2x the orstcase latency of RanGroupScan Lobits. We sa similar results for Merge + γ-coding (5.6x higher), Looup + δ-coding (4.4x higher), and Looup + γ-coding (4.9x higher). 5. CONCLUSION In this paper e introduced algorithms for set intersection processing for memory-resident data. Our approach provides both novel theoretical orst-case guarantees as ell as very fast performance in practice, at the cost of increased storage space. Our techniques outperform a ide range of existing techniques and are robust in that for inputs for hich they are not the best-performing approach they perform close to the best one. Our techniques have applications in information retrieval and query processing scenarios here performance is of greater concern than space. 6. REFERENCES [] R. A. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In CPM, pages 4 48, 24. [2] R. A. Baeza-Yates and A. Salinger. Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences. In SPIRE, pages 3 24, 25. [3] J. Barbay and C. Kenyon. Adaptive intersection and t-threshold problems. In SODA, pages , 22. [4] J. Barbay, A. López-Ortiz, and T. Lu. Faster Adaptive Set Intersections for Text Searching. In 5th WEA, pages 46 57, 26. [5] J. Barbay, A. López-Ortiz, T. Lu, and A. Salinger. An experimental investigation of set intersection algorithms for text searching. ACM Journal of Experimental Algorithmics, 4:7 24, 2. [6] P. Bille, A. Pagh, and R. Pagh. Fast Evaluation of Union-Intersection Expressions. In ISAAC, pages , 27. [7] G. E. Blelloch and M. Reid-Miller. Fast Set Operations using Treaps. In ACM SPAA, pages 6 26, 998. [8] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a to-level retrieval process. In CIKM, pages , 23. [9] M. R. Bron and R. E. Taran. A Fast Merging Algorithm. Journal of the ACM, 26(2):2 226, 979. [] J. Brutlag. Speed Matters for Google Web Search [] E. Chiniforooshan, A. Farzan, and M. Mirzazadeh. Worst case optimal union-intersection expression evaluation. In ALENEX, pages 79 9, 2. [2] E. Demaine, A. López-Ortiz, and J. Munro. Adaptive Set Intersections, Unions, and Differences. In SODA, pages , 2. [3] E. Demaine, A. López-Ortiz, and J. Munro. Experiments on Adaptive Set Intersections for Text Retrieval Systems. In ALENEX, pages 9 4, 2. [4] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge, 29. [5] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleare. In PODS, pages 2 3, 2. [6] F. K. Hang and S. Lin. A Simple Algorithm for Merging To Disoint Linearly Ordered Sets. SIAM Journal, ():3 39, 972. [7] G. Linden. [8] W. Pugh. A sip list cooboo. Technical Report UMIACS-TR , University of Maryland, 99. [9] P. Sanders and F. Transier. Intersection in Integer Inverted Indices. In ALENEX, pages 7 83, 27. [2] S. Tationda, F. Junqueira, B. B. Cambazoglu, and V. Plachouras. On Efficient Posting List Intersection ith Multicore Processors. In ACM SIGIR, pages , 29. [2] F. Transier and P. Sanders. Compressed inverted indexes for in-memory search engines. In ALENEX, pages 3 2, 28. [22] D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. PVLDB, 2(): , 29. [23] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes - Compressing and Indexing Documents and Images. Morgan Kaufman Publishers, 999.

9 APPENDIX Acnoledgments We than the anonymous revieers for their numerous insights and suggestions that immensely improved the paper. A. PROOFS OF THEOREMS A. Analysis of Algorithm (Proof of Theorem 3.3) There are a total of O((n + n 2)/ ) pairs of L p and Lq 2 to be checed in Algorithm. For each pair, since H = h(l p ) h(lq 2 ) can be computed in O() time and elements in H can be enumerated in linear time, the cost of computing L p Lq 2 is dominated by computing h (y, L p ) h (y, L p 2 ) for every y H, the cost of hich is in turn determined by the number of pairs of elements hich are mapped to the same location by h; e denote this set as I = {(x, x 2) x L p, x2 Lq 2, and h(x) = h(x2)}. Let I = = {(x, x 2) x = x 2} I denote the pairs of identical elements (i.e., elements in the intersection) in I and I = {(x, x 2) x x 2} I the remaining pairs of elements that are hashed to the same value by h but are not identical. Obviously, I = = L p Lq 2. If e can sho E [ I ] = O(), the proof is completed: this is because, for a total of O((n + n 2)/ ) pairs of L p and Lq 2 to be checed, the total running time is O(E [ I ] + ) = ( I = + (E [ I ] + )) O() p,q p,q p,q = O(r) + (n + n 2)/ O(). (3) Indeed, e can sho for each pair of L p and Lq 2 that: E [ I ] = Pr [h(x ) = h(x 2)] = O(), x L p, x 2 L q 2, x x 2 for a universal hash function h, hich completes the proof. (4) A.. Group Size and Optimizing Running Time In Algorithm, the group size is selected as the magical number (i.e., L p = Lq 2 = ). To explain this choice, e no explore the effect of group size on the running time of Algorithm. Suppose in general L i is partitioned into groups of size s i. Extending Equation (4) a bit, e have E [ I ] = O() as long as s s 2. Then folloing the same argument as in (3), a total of O(n /s + n 2/s 2) pairs are to be checed, and the expected running time of Algorithm is O(T (s, s 2)), here T (s, s 2) = n /s + n 2/s 2 + r. Minimizing T (s, s 2) under the constraint s s 2 yields optimal group sizes of s = n/n 2 and s 2 = n 2/n, and the optimal running time is O(T (s, s 2)) = O( n n 2/ + r). If e no use the group sizes s = s 2 =, as in the proof of Theorem 3.3, e obtain a running time of O(T (s, s 2)) = O((n + n 2)/ + r). O( n n 2/ + r) is better than O((n + n 2)/ + r) hen set sizes are seed (e.g., n n 2 or n = n 2). To achieve the better bound e leverage that the group size s = n /n 2 of the set L depends on the size n 2 of the set L 2 to be intersected ith it, and use a multi-resolution structure hich eeps different partitions of a set, as discussed at the end of Section 3.. A.2 Analysis of Algorithm 3 (Proof of Theorem 3.5) Similar to the proof of Theorem 3.3, the cost of computing L z L z 2 2 using IntersectSmall for each pair of small groups L z and L z 2 2 is determined by the size of I = {(x, x2) x Lz, x2 L z 2 2, and h(x) = h(x2)}. As in A., let I= = {(x, x2) x = x 2} I and I = {(x, x 2) x x 2} I. Obviously, I = = L z L z 2 2, and I is the set of element-pairs that result in a hash-collision. If e can sho E [ I ] O(), the proof is complete: because, since t = t 2 = log n n 2/, there are O( n n 2/) pairs of z and z 2 to be considered (e have z = z 2 = z in every iteration), and thus the total running time is O(E [ I ] + ) = ( I = + (E [ I ] + )) O() z {,} t 2 z {,} t 2 O(r) + n n 2/ O(). z {,} t 2 No e prove that, for each pair (z, z 2), E [ I ] = O(). Letting S z = L z Lz 2 2 and Sz 2 2 = L z 2 2 Lz, if Lz and Lz 2 2 are fixed, e have (similar to (4) in the proof of Theorem 3.3): E h [ I S z, S z 2 2 ] = S z S z 2 2 /. S z and S z 2 2 are random variables determined by the hash function g. From their definition and the property of 2-universal (2- independent) hashing, e can prove E g [S z Sz 2 2 ] Eg [Sz E g [S z 2 2 ] (using a random permutation g yields the same result). Also, E g [S z ] Eg [ Lz ] = O( n ) /2 t = O( n /n 2), and ] similarly E g [S z 2 2 ] O( n 2/n ). Therefore, E g [S z S2 z ] = E g [S z ] E g [S2 z ] O(), and thus, E [ I ] = =E g [E h [ I S z, S z 2 ]]=E g [ S z S z 2 hich completes the argument. ] O() =O(), (5) A.3 Analysis of Algorithm 4 (Proof of Theorem 3.6 and Theorem 3.7) Theorem 3.6 is special case of Theorem 3.7 for to-set intersection. So e only present the proof of Theorem 3.7 belo. Consider any element x L i for each set L i involved in the intersection computation, i.e., extended IntersectSmall in line 3 of Algorithm 4, here e compute: H = h(l z i i ), and i= i= L z i i = y H ( ) i=h (y, L z i i ). Denote the set of all such elements (ith h(x) = y H) by Γ. The number of such elements Γ dominates the cost of Algorithm 4. We first differentiate to cases of elements in Γ: (i) x i= Li: These r elements are scanned times, and thus contribute a factor of O(r) in the time complexity overall. (ii) x / i= Li: We group all these elements into sets, D 2,..., D (an element x may belong to multiple D i s): D i = {x Γ x L i L i+... L x / L for some < i}. No focus on D i L z i i for each z i {, } t i. For any x L i but x / L for some < i, letting z be the t -prefix of z i, e have x D i L z i i implies that h(x) H and thus there exists x ( x) L z such that h(x) = h(x ); so for such an x, Pr [x D i L z i i x L z x L z i i ]=Pr [h(x) H x Lz i i ] Pr [ h(x) = h(x ) ] L z /.

10 Generalizing Equation (5) in the proof of Theorem 3.5, e have E [ D i L z i i x L i E g [ L z ] = Eg [ Eh [ Di L z i i Lz /] Pr [x L z i i ] ni for all z s]] = O() [ z (as E g L ] = for any, and Pr [x L z i i ] = /n i). So E [ D i ] O(n i/ ), as L i is partitioned into 2 t i = n i/ groups L z i i s over all iterations of Algorithm 4. Then e have [ ] E D i O ( n i/ ) = O ( n/ ). (6) i=2 i=2 Running Time: As the D i s are bounded as above, a naive implementation of Algorithm 4 requires O(n / + r) time in expectation. The iteration of lines -4 in Algorithm 4 repeats n / times (suppose n n 2... n ). In each iteration, e compute H in O() time, and each element in D needs O() comparisons to be eliminated. n/ is potentially smaller than n / especially for sets ith seed sizes. With careful memorization of the partial results i= h(lz i i ) and i= h (y, L z i i ) in Algorithm 4, from (i) and (ii), e no prove the promised running time O(n/ + r): The maor cost of Algorithm 4 comes from the computation of (a) H = i= h(lz i i ) and (b) i= h (y, L z i i ) for each y H. Assume n n 2... n. For (a), as z i is the t i-prefix of z if i, e can memorize i= h(lz i i ) for each z. Then, for example, reuse h(l ) h(l 2 ) hen computing h(l ) h(l 2 ) h(l 3 ) and h(l ) h(l 2 ) h(l 3 ). In this ay, the computation of H for different combinations of z,..., z requires i O(ni/ ) = O(n/ ) time. For (b), for each combination of z,..., z, e compute the n i result the inverse order (from i = to i = ): the partial results i= h (y, L z i i ) (for all y H, all z s, and some ) have their total size bounded by D + r. Using the hash-tablebased approach to compute the intersection, the total running time is bounded by the total size of the partial results. So from (6), the total running time is O(n/ + r) in expectation. A.4 Analysis of the Multi-resolution Structure (Proof of Theorem 3.8) The time bound is trivial, because e only require sorting and scanning of each set. The total space for storing h(l z i ) s, left(l z i ) s, and right(l z i ) s is O(n i), as there are O(n i + n i/2 + n i/4 +...) = O(n i) groups of different sizes. For next(x) s e also only need O(n i) space, as there are n i elements in the set. We no analyze the space needed for first(y, L z i ) s to complete the proof. To store first(y, L z i ) for each y [] and each z, storing the difference beteen first(y, L z i ) and left(l z i ) suffices; so e need O(log L z i ) bits. To store first(y, L z i ) s for all y [] in a group L z i, e need O( log L z i /) = O(log L z i ) ords. Consider the partitioning induced by g t : Σ {, } t for some t, letting t = log n i t, there are O(n i/2 t ) groups L z i s generated by g t, so the space e need for all these groups is: (log( ) is concave) O( z {,} t log L z i ) O(2 t log(n i/2 t )) = O((n i/2 t ) t). Therefore, for all resolutions t =, 2,..., log n i, the total space needed for first(y, L z i ) s is O( t t n i/2 t ) O(n i). A.5 Analysis of Algorithm 5 (Proof of Theorem 3.9) A.5. Probability of Successful Filtering Recall in Algorithm 5, sets are partitioned into small groups by hash function g, and m universal hash functions h,..., h m are used to test hether the intersection of small groups is empty. It is efficient because of the folloing observation: if L z L z 2 2 =, then h (L z ) h(lz 2 2 ) = for some =,..., m (so-called successful filtering ) ith high probability. But once a false positive happens (i.e., il z i i = but ih (L z i i ) for any hash function h,..., h m), e have to scan the to or small groups for the intersection. So to analyze Algorithm 5, the ey point e need to establish is that successful filtering happens ith a constant probability for to or small groups.we first verify the above intuition, by assuming that L z i i is : LEMMA A.. For to small groups L z and Lz 2 2 ith Lz = L z 2 2 =, given a universal hash function h : Σ [], if L z Lz 2 2 =, then h(l z ) h(lz 2 2 ) = ith probability at least ( ) (.3436 for = 64). PROOF. Since L z Lz 2 2 =, for each x 2 L z 2 2, e have h(x 2) / h(l z ) holds ith probability h(lz )/. So, Pr [h(l z ) h(l z 2) = ] ( h(lz ) ) L z 2 2 ( Lz ) L z 2 2, as h(l z ) Lz. So, hen Lz = Lz 2 2 =, e have ( ) Pr [h(l z ) h(l z 2) = ]. In general, although the sizes of the small groups L z i are random variables, they are unliely to deviate from by much. This is important since groups of larger sizes result in poorer filtering performance of the ord representations h (L z i i ) s (incurring more false positives). Using Chernoff bounds e can sho that: PROPOSITION A.2. For any group L z i i defined in Algorithm 5 (i.e., partition L i by g ti : Σ {, } t i ith t i = log(n i/ ) ), e have: (i) E [ L z i 2 i ] ; (ii) Pr [ L z i i ( + ɛ) ] exp (iii) Pr [ L z i i δ() ] ( 6 ln(4 ) + 4 ) /2 ( for = 64). ( ɛ 2 3 ), for < ɛ < ;, here the constant δ()= PROOF. In this proof e use a random permutation as g : Σ Σ, and define g t(x) to be the t most significant bits of g(x). Hoever, note that e can use a hash function here as ell, if e use the pre-image to brea any ties resulting from hash collisions (thereby resulting in a total ordering). For the group L z i i, define Yx = if x Lz i i (i.e., g ti (x) = z i), and Y x = otherise. So L z i i = x Yx. Then (i) is from the fact that Pr [Y x = ] = /2 t i and the linearity of expectation. For a random permutation g, e can prove that the {Y x x L i} are negatively associated [4], so the Chernoff bounds can be still applied. As in (i), e have µ L = /2 µ = E [ L z i ] i = µh. To prove (iii), e use the Chernoff bound: Pr [ L z i i > ( + ɛ)µ] < exp( µɛ2 /3) exp( µ Lɛ 2 /3) [4]. To prove (ii), e can use a tighter bound: for < ɛ <, Pr [ L z i i > ( + ɛ)µh] < exp( µhɛ2 /3) [4]. Note that the same bounds hold hen a hash function is used as g. Lemma A.3 extends Lemma A. for groups, hose sizes are random variables determined by the hash function g.

11 LEMMA A.3. For groups L z i i s (for i =,...,, partition L i by g ti : Σ {, } t i here t i = log(n i/ ) ), if il z i i =, then ih(l z i i ) = ith at least constant probability ( ) ( ) δ() β() = +δ() 4 δ() (or β2() belo), here δ() is a constant determined by, as in Proposition A.2. PROOF. Since il z i i =, for any x L z, there exists some L z s.t. x / L z ; no, for this small group Lz, if for any x L z e have h(x ) h(x ), i.e., x / h(l z ), e say that x is collision-free. If L z δ(), from the union bound, z L Pr [x is collision-free] δ(), (7) here δ() is defined in Proposition A.2. Note that ih(l z i i ) = implies that every x in L z is collision-free. So, if furthermore L z δ(), e have Pr [ ih(l z i i ) = ] ( δ() ) δ(). The derivation of (7) assumes independence of the randomized hash function h. If h is generated from a random permutation p, i.e., taing the prefix of p(x) as h(x), then by considering negative dependence [4], a similar (a bit eaer) bound can be derived. From Proposition A.2(iii), ith probability at least 4, e have L z i i δ() for a group L z i i. Given Lz δ(), there are at most min{, δ() } L z s involved in the analysis of (7). From the union bound, ith probability at least +δ() 4, e have L z δ() for all of these +δ() groups. So, ith probability at least ( ) ( ) δ() β () = +δ() 4 δ(), (notice the independence beteen g and h) e have ih(l z i i ) =. If e use Proposition A.2(ii) to bound the probability of L z 3 /2 (then there are at most min{, 3 /2} L z s involved in the analysis of (7)), e can derive a tighter bound in a similar ay: ( ( β 2() = exp ) ) ( ) 3 /2 3 δ() 2 8. Thus, e have shon the probability of successful filtering (β () or β 2() as a conservative loer bound) is at least a constant depending only on the machine-ord idth (but independent on the number and the sizes of sets), and increases ith. It can be magnified to ( β()) m by using m > ord images of independent hash functions for filtering. A.5.2 Filtering Performance in Practice Pr("Successful Filtering") Measured Probability, Synthetic Data Measured Probability, Real-Life Data m= m=2 m=4 m=6 m=8 Figure 9: Filtering Performance in Experiments In this section e evaluate the efficiency of the ord images for filtering. In Figure 9, e have plotted the probability that, for different numbers m of hash functions, a given pair of small groups ith an empty intersection is filtered; as before, e use = 64. As the datasets, e use the synthetic data from the first experiment in Section 4 (ith an intersection size of % of the set size) and the 2- ord queries described in the experiments on real data derived from Bing/Wiipedia. As e can see, the probabilities are very similar for both datasets, ith slightly better filtering performance for the asymmetric real data. Moreover, the real-life successful-filtering probabilities are significantly better than the theoretical bounds derived in Lemma A. and Lemma A.3 (here m = ). A.5.3 Proof of Theorem 3.9 For any z {, } t, let z i be its t i-prefix (as in Algorithm 5). For computing i Lz i i, there are to cases: (i) If i Lz i i =, from Lemma A.3, e have i h(lz i i ) ith probability at most β(), and thus i h(lz i i ) for all =,..., m ith probability at most ( β()) m. So e astefully compute i Lz i i ith probability at most ( β()) m. (ii) If i Lz i i, e must have i h(lz i i ) for all, and compute i Lz i i. Case (ii) happens at most r = i Li times. We compute i Lz i i using the linear merge algorithm in linear time O ( i Lz i i ), or O( ) time in expectation. In case (i), since there are n / groups in L, for all groups, this contributes a factor O(max(n, n )( β()) m ); and in case (ii), this contributes a factor O(r ) (since (ii) happens at most r times). We also need to test hether i= h(lz i i ) for all =,..., m. Since there are n/ groups L z i i s, ith careful memorization of partial results (e.g., reusing h (L ) h (L 2 ) hen computing h (L ) h (L 2 ) h (L 3 ) and h (L ) h (L 2 ) h (L 3 )), this contributes a factor O(mn/ ) in total. So from the above analysis, Algorithm 5 needs a total of ( max(n, n ) O + mn + r ) (8) α() m time in expectation, here α() = /( β()). A.6 Analysis of Algorithm HashBin (Proof of Theorem 3.) For HashBin, the intuition is: in the resulting partitioning, e have O() element in each group L z, and O(n 2/n ) elements in each group L z 2. The expected running time is: E ( L z log L z 2 ) z {,} t = (E [ L z ] log L z 2 ) (suppose L z 2 s are fixed) z {,} t = log L z 2 (because E [ L z ] = ) z {,} t n log(n 2/n ) ( L z 2 = n 2 and log( ) is concave). z {,} t A.6. HashBin using the Multi-resolution Structure Algorithm HashBin ors on a simplified version of the multiresolution data structure (Figure 2) introduced in Section Here, e use a random permutation g : Σ Σ to partition sets into small groups. To pre-process L i, e first order all the elements in L i according to g(x). Then any small group L z i = {x x L i and g t(x) = z} (for any t) corresponds to a consecutive interval in L i. For each small group L z i, e only need to store its starting position left(l z i ) and ending position right(l z i ). For each x L z, e need to chec hether x L z 2. Suppose L z 2 = {x, x 2,..., x s }. Although elements in L z 2 are not sorted in their on order, they are ordered as g(x ) g(x 2 )... g(x s ) in the preprocessing. So to chec hether x L z 2, e can binary-search hether g(x) is in {g(x ), g(x 2 ),..., g(x s )}, since the random permutation g is a one-to-one mapping from Σ to Σ.

12 Construction Time (ms) Construction Time (ms) HashBin IntGroup RanGroup RanGroupScan4 Sorting M 2M 3M 4M 5M 6M 7M 8M 9M M Set Size Figure : Preprocessing Overhead Sorting RanGroupScan_Lobits RanGroupScan_Gamma RanGroupScan_Delta Merge_Gamma Merge_Delta 64K 28K 256K 52K M 2M 4M 8M Set Size Figure : Preprocessing Overhead (ith compression) B. COMPRESSION FOR ALGORITHM 5 For each small group L z i, e can use the standard techniques based on γ- and δ-coding (see [23], p.6) to compress the elements stored sequentially at the end of the bloc associated ith L z i. Hoever, the decoding of γ- and δ-coding is expensive. As an alternative, e describe simple but effective (i.e., efficient in decoding) compression technique for Algorithm 5 in the folloing: (i) Instead of storing the length len of each structure, e can store the size L z i, since the structure length len can be derived from L z i. As proved in Proposition A.2, L z i is usually very small, so e store it using unary code (e.g., = 2). (ii) Only if L z i >, e store h (L z i ), h 2(L z i ),..., h m(l z i ) in the folloing m ords. (iii) To store elements of L z i in the remaining part of this bloc, e can use the standard techniques based on γ- or δ-coding. We present another compression technique here, hich is specifically designed for our algorithm. Its decoding is much more efficient than γ- or δ-coding. First, for the purpose of partitioning sets into small groups, e use a random permutation as g. Then, assuming Σ = 2 (the orst case), instead of storing each x L z i, e store lobits ti (x) = g(x) mod 2 t i, i.e., the loest t i bits of g(x); and the remaining highest t i bits of g(x) correspond to z = g ti (x). In online-processing, decoding in this compression scheme can be done efficiently: to get g(x) for an element x L z i, e concatenate z = g ti (x) to lobits ti (x). Since g is a one-to-one mapping from Σ to Σ, the intersection of L and L 2 is equivalent to the intersection of g(l ) and g(l 2). Folloing is some basic analysis to establish an upper bound of the space consumed by our compression technique. Recall t i = log(n i/ ). There are n i/ small groups in L i in total. Storing all of them requires: (i) n i + n i/ bits for L z i s (since z Lz i = n i); (ii) at most m n i/ bits for h (L z i ) s; and (iii) ( t i) n i bits for all elements (e store g(x) mod 2 t i ). C. ADDITIONAL EXPERIMENTS C. Preprocessing Overhead In this section, e evaluate the time taen to construct the novel structures hen given a set L i as input. Our approach is similar to inverted indexes (and nearly all of the competing algorithms) in that the elements have to be sorted during pre-processing; thus, to put the construction overhead in perspective, e also measure and plot the overhead of sorting using an in-memory quicsort (averaging the time over random instances). Figure shos the results for the construction time for the data structures ithout compression for different set sizes L i. Note that e use a log-scale on the y-axis to better separate the different graphs. As e can see, the additional construction overhead is generally a small fraction of the sorting overhead. Figure shos the overhead for constructing different compressed structures. We also plot the overhead for compressing the sets ithout additional hash images (resulting in the structures used in the compressed Merge, i.e., Merge Gamma and Merge Delta). Again, the required overhead is only a small fraction of the sorting overhead; also, the preprocessing time for the Lobits compression scheme hich yields the best intersection performance in Section 4. is significantly loer than the alternatives. C.2 More Experiments on Real Data In this section, e present a breadon of the experiments on real data in Section 4; to understand ho the number of eyords in a query affect the relative performance in this scenario, e plotted the distribution of average intersection times for 2-, 3- and 4- eyord queries separately in Figure 2. As e can see, the relative performances are similar as seen earlier ith three exceptions: (a) the Merge algorithm performs orse ith increasing number of eyords (as it cannot leverage the asymmetry in any ay), (b) in contrast, Hash performs increasingly better, but still remains (close to) the orst performer, and (c) for 4-eyord queries, RanGroup slightly outperforms RanGroupScan. Normalized Intersection Time Normalized Intersection Time Normalized Intersection Time Results: 2 Keyord Queries Results: 3 Keyord Queries Results: 4 Keyord Queries Figure 2: Normalized Execution Time on a Real Worload Merge Merge Merge

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

OPTIMIZING WEB SERVER'S DATA TRANSFER WITH HOTLINKS

OPTIMIZING WEB SERVER'S DATA TRANSFER WITH HOTLINKS OPTIMIZING WEB SERVER'S DATA TRANSFER WIT OTLINKS Evangelos Kranakis School of Computer Science, Carleton University Ottaa,ON. K1S 5B6 Canada kranakis@scs.carleton.ca Danny Krizanc Department of Mathematics,

More information

Oracle Scheduling: Controlling Granularity in Implicitly Parallel Languages

Oracle Scheduling: Controlling Granularity in Implicitly Parallel Languages Oracle Scheduling: Controlling Granularity in Implicitly Parallel Languages Umut A. Acar Arthur Charguéraud Mike Rainey Max Planck Institute for Softare Systems {umut,charguer,mrainey}@mpi-ss.org Abstract

More information

A Quantitative Approach to the Performance of Internet Telephony to E-business Sites

A Quantitative Approach to the Performance of Internet Telephony to E-business Sites A Quantitative Approach to the Performance of Internet Telephony to E-business Sites Prathiusha Chinnusamy TransSolutions Fort Worth, TX 76155, USA Natarajan Gautam Harold and Inge Marcus Department of

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,

More information

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m. Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search

Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search Weixiong Rao [1,4] Lei Chen [2] Pan Hui [2,3] Sasu Tarkoma [4] [1] School of Software Engineering [2] Department of Comp. Sci.

More information

An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Notes from Week 1: Algorithms for sequential prediction

Notes from Week 1: Algorithms for sequential prediction CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

More information

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

More information

Efficient Parallel Block-Max WAND Algorithm

Efficient Parallel Block-Max WAND Algorithm Efficient Parallel Block-Max WAND Algorithm Oscar Rojas, Veronica Gil-Costa 2,3, and Mauricio Marin,2 DIINF, University of Santiago, Chile 2 Yahoo! Labs Santiago, Chile 3 CONICET,University of San Luis,

More information

Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

Experimental Comparison of Set Intersection Algorithms for Inverted Indexing ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp. 58 64 http://ceur-ws.org/vol-13, Series ISSN 1613-73, c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

More information

The Conference Call Search Problem in Wireless Networks

The Conference Call Search Problem in Wireless Networks The Conference Call Search Problem in Wireless Networks Leah Epstein 1, and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. lea@math.haifa.ac.il 2 Department of Statistics,

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Big Data & Scripting storage networks and distributed file systems

Big Data & Scripting storage networks and distributed file systems Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Why? A central concept in Computer Science. Algorithms are ubiquitous. Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

More information

LCs for Binary Classification

LCs for Binary Classification Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

More information

The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies

More information

The Relative Worst Order Ratio for On-Line Algorithms

The Relative Worst Order Ratio for On-Line Algorithms The Relative Worst Order Ratio for On-Line Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, joan@imada.sdu.dk

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Load Balancing and Switch Scheduling

Load Balancing and Switch Scheduling EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

More information

Leveraging Multipath Routing and Traffic Grooming for an Efficient Load Balancing in Optical Networks

Leveraging Multipath Routing and Traffic Grooming for an Efficient Load Balancing in Optical Networks Leveraging ultipath Routing and Traffic Grooming for an Efficient Load Balancing in Optical Netorks Juliana de Santi, André C. Drummond* and Nelson L. S. da Fonseca University of Campinas, Brazil Email:

More information

Lecture 2: Universality

Lecture 2: Universality CS 710: Complexity Theory 1/21/2010 Lecture 2: Universality Instructor: Dieter van Melkebeek Scribe: Tyson Williams In this lecture, we introduce the notion of a universal machine, develop efficient universal

More information

Load Balancing. Load Balancing 1 / 24

Load Balancing. Load Balancing 1 / 24 Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

More information

Multi-dimensional index structures Part I: motivation

Multi-dimensional index structures Part I: motivation Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for

More information

Linear programming approach for online advertising

Linear programming approach for online advertising Linear programming approach for online advertising Igor Trajkovski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Rugjer Boshkovikj 16, P.O. Box 393, 1000 Skopje,

More information

Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

More information

CRITERIUM FOR FUNCTION DEFININING OF FINAL TIME SHARING OF THE BASIC CLARK S FLOW PRECEDENCE DIAGRAMMING (PDM) STRUCTURE

CRITERIUM FOR FUNCTION DEFININING OF FINAL TIME SHARING OF THE BASIC CLARK S FLOW PRECEDENCE DIAGRAMMING (PDM) STRUCTURE st Logistics International Conference Belgrade, Serbia 8-30 November 03 CRITERIUM FOR FUNCTION DEFININING OF FINAL TIME SHARING OF THE BASIC CLARK S FLOW PRECEDENCE DIAGRAMMING (PDM STRUCTURE Branko Davidović

More information

24. The Branch and Bound Method

24. The Branch and Bound Method 24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

More information

Hash-based Digital Signature Schemes

Hash-based Digital Signature Schemes Hash-based Digital Signature Schemes Johannes Buchmann Erik Dahmen Michael Szydlo October 29, 2008 Contents 1 Introduction 2 2 Hash based one-time signature schemes 3 2.1 Lamport Diffie one-time signature

More information

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints

More information

Efficient Processing of Joins on Set-valued Attributes

Efficient Processing of Joins on Set-valued Attributes Efficient Processing of Joins on Set-valued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Object-oriented

More information

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search Chapter Objectives Chapter 9 Search Algorithms Data Structures Using C++ 1 Learn the various search algorithms Explore how to implement the sequential and binary search algorithms Discover how the sequential

More information

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

More information

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015 CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis Linda Shapiro Today Registration should be done. Homework 1 due 11:59 pm next Wednesday, January 14 Review math essential

More information

A simple algorithm with no simple verication

A simple algorithm with no simple verication A simple algorithm with no simple verication Laszlo Csirmaz Central European University Abstract The correctness of a simple sorting algorithm is resented, which algorithm is \evidently wrong" at the rst

More information

Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

More information

Physical Data Organization

Physical Data Organization Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

More information

The LCA Problem Revisited

The LCA Problem Revisited The LA Problem Revisited Michael A. Bender Martín Farach-olton SUNY Stony Brook Rutgers University May 16, 2000 Abstract We present a very simple algorithm for the Least ommon Ancestor problem. We thus

More information

Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range

Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis

More information

Protocols for Efficient Inference Communication

Protocols for Efficient Inference Communication Protocols for Efficient Inference Communication Carl Andersen and Prithwish Basu Raytheon BBN Technologies Cambridge, MA canderse@bbncom pbasu@bbncom Basak Guler and Aylin Yener and Ebrahim Molavianjazi

More information

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

More information

An example of a computable

An example of a computable An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

More information

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

More information

Clustering and scheduling maintenance tasks over time

Clustering and scheduling maintenance tasks over time Clustering and scheduling maintenance tasks over time Per Kreuger 2008-04-29 SICS Technical Report T2008:09 Abstract We report results on a maintenance scheduling problem. The problem consists of allocating

More information

Online Scheduling for Cloud Computing and Different Service Levels

Online Scheduling for Cloud Computing and Different Service Levels 2012 IEEE 201226th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum Online Scheduling for

More information

A Catalogue of the Steiner Triple Systems of Order 19

A Catalogue of the Steiner Triple Systems of Order 19 A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of

More information

Efficient Algorithms for Masking and Finding Quasi-Identifiers

Efficient Algorithms for Masking and Finding Quasi-Identifiers Efficient Algorithms for Masking and Finding Quasi-Identifiers Rajeev Motwani Stanford University rajeev@cs.stanford.edu Ying Xu Stanford University xuying@cs.stanford.edu ABSTRACT A quasi-identifier refers

More information

Lecture 10: CPA Encryption, MACs, Hash Functions. 2 Recap of last lecture - PRGs for one time pads

Lecture 10: CPA Encryption, MACs, Hash Functions. 2 Recap of last lecture - PRGs for one time pads CS 7880 Graduate Cryptography October 15, 2015 Lecture 10: CPA Encryption, MACs, Hash Functions Lecturer: Daniel Wichs Scribe: Matthew Dippel 1 Topic Covered Chosen plaintext attack model of security MACs

More information

Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

More information

Mathematics Review for MS Finance Students

Mathematics Review for MS Finance Students Mathematics Review for MS Finance Students Anthony M. Marino Department of Finance and Business Economics Marshall School of Business Lecture 1: Introductory Material Sets The Real Number System Functions,

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Factoring & Primality

Factoring & Primality Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

More information

Research Statement Immanuel Trummer www.itrummer.org

Research Statement Immanuel Trummer www.itrummer.org Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs

Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs Stavros Athanassopoulos, Ioannis Caragiannis, and Christos Kaklamanis Research Academic Computer Technology Institute

More information

Online Bipartite Perfect Matching With Augmentations

Online Bipartite Perfect Matching With Augmentations Online Bipartite Perfect Matching With Augmentations Kamalika Chaudhuri, Constantinos Daskalakis, Robert D. Kleinberg, and Henry Lin Information Theory and Applications Center, U.C. San Diego Email: kamalika@soe.ucsd.edu

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Invited Applications Paper

Invited Applications Paper Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM

More information

9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1

9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1 9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1 Seffi Naor Computer Science Dept. Technion Haifa, Israel Introduction

More information

Approximate Search Engine Optimization for Directory Service

Approximate Search Engine Optimization for Directory Service Approximate Search Engine Optimization for Directory Service Kai-Hsiang Yang and Chi-Chien Pan and Tzao-Lin Lee Department of Computer Science and Information Engineering, National Taiwan University, Taipei,

More information

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2

More information

Integer Factorization using the Quadratic Sieve

Integer Factorization using the Quadratic Sieve Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 seib0060@morris.umn.edu March 16, 2011 Abstract We give

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

More information

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b LZ77 The original LZ77 algorithm works as follows: A phrase T j starting at a position i is encoded as a triple of the form distance, length, symbol. A triple d, l, s means that: T j = T [i...i + l] =

More information

Statistical Learning Theory Meets Big Data

Statistical Learning Theory Meets Big Data Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,

More information

Analysis of Compression Algorithms for Program Data

Analysis of Compression Algorithms for Program Data Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory

More information

Efficient Integration of Data Mining Techniques in Database Management Systems

Efficient Integration of Data Mining Techniques in Database Management Systems Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France

More information

Load-Balancing the Distance Computations in Record Linkage

Load-Balancing the Distance Computations in Record Linkage Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr

More information

2.1 Complexity Classes

2.1 Complexity Classes 15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look

More information

The Online Set Cover Problem

The Online Set Cover Problem The Online Set Cover Problem Noga Alon Baruch Awerbuch Yossi Azar Niv Buchbinder Joseph Seffi Naor ABSTRACT Let X = {, 2,..., n} be a ground set of n elements, and let S be a family of subsets of X, S

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Chapter 8: Bags and Sets

Chapter 8: Bags and Sets Chapter 8: Bags and Sets In the stack and the queue abstractions, the order that elements are placed into the container is important, because the order elements are removed is related to the order in which

More information

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the

More information

Notes 11: List Decoding Folded Reed-Solomon Codes

Notes 11: List Decoding Folded Reed-Solomon Codes Introduction to Coding Theory CMU: Spring 2010 Notes 11: List Decoding Folded Reed-Solomon Codes April 2010 Lecturer: Venkatesan Guruswami Scribe: Venkatesan Guruswami At the end of the previous notes,

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

Solutions to In-Class Problems Week 4, Mon.

Solutions to In-Class Problems Week 4, Mon. Massachusetts Institute of Technology 6.042J/18.062J, Fall 05: Mathematics for Computer Science September 26 Prof. Albert R. Meyer and Prof. Ronitt Rubinfeld revised September 26, 2005, 1050 minutes Solutions

More information

20 Selfish Load Balancing

20 Selfish Load Balancing 20 Selfish Load Balancing Berthold Vöcking Abstract Suppose that a set of weighted tasks shall be assigned to a set of machines with possibly different speeds such that the load is distributed evenly among

More information

Discovery of Frequent Episodes in Event Sequences

Discovery of Frequent Episodes in Event Sequences Data Mining and Knowledge Discovery 1, 259 289 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Discovery of Frequent Episodes in Event Sequences HEIKKI MANNILA heikki.mannila@cs.helsinki.fi

More information

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

A Note on Maximum Independent Sets in Rectangle Intersection Graphs A Note on Maximum Independent Sets in Rectangle Intersection Graphs Timothy M. Chan School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada tmchan@uwaterloo.ca September 12,

More information

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski kbajda@cs.yale.edu

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski kbajda@cs.yale.edu Kamil Bajda-Pawlikowski kbajda@cs.yale.edu Querying RDF data stored in DBMS: SPARQL to SQL Conversion Yale University technical report #1409 ABSTRACT This paper discusses the design and implementation

More information

Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions

Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions Hiroshi Inoue, Moriyoshi Ohara, and Kenjiro Taura IBM Research Tokyo, University of Tokyo {inouehrs, ohara}@jp.ibm.com,

More information

New Hash Function Construction for Textual and Geometric Data Retrieval

New Hash Function Construction for Textual and Geometric Data Retrieval Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

A Sublinear Bipartiteness Tester for Bounded Degree Graphs

A Sublinear Bipartiteness Tester for Bounded Degree Graphs A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublinear-time algorithm for testing whether a bounded degree graph is bipartite

More information

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2) Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse

More information

Permuting Data on Random-Access Block Storage

Permuting Data on Random-Access Block Storage Permuting Data on Random-Access Block Storage Risi Thonangi Duke University rvt@cs.duke.edu Jun Yang Duke University junyang@cs.duke.edu ABSTRACT Permutation is a fundamental operator for array data, with

More information

Prime Numbers and Irreducible Polynomials

Prime Numbers and Irreducible Polynomials Prime Numbers and Irreducible Polynomials M. Ram Murty The similarity between prime numbers and irreducible polynomials has been a dominant theme in the development of number theory and algebraic geometry.

More information

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

CSE 135: Introduction to Theory of Computation Decidability and Recognizability CSE 135: Introduction to Theory of Computation Decidability and Recognizability Sungjin Im University of California, Merced 04-28, 30-2014 High-Level Descriptions of Computation Instead of giving a Turing

More information

Testing Cost Inefficiency under Free Entry in the Real Estate Brokerage Industry

Testing Cost Inefficiency under Free Entry in the Real Estate Brokerage Industry Web Appendix to Testing Cost Inefficiency under Free Entry in the Real Estate Brokerage Industry Lu Han University of Toronto lu.han@rotman.utoronto.ca Seung-Hyun Hong University of Illinois hyunhong@ad.uiuc.edu

More information

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

More information

Notes on Factoring. MA 206 Kurt Bryan

Notes on Factoring. MA 206 Kurt Bryan The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

More information

Lecture 1: Course overview, circuits, and formulas

Lecture 1: Course overview, circuits, and formulas Lecture 1: Course overview, circuits, and formulas Topics in Complexity Theory and Pseudorandomness (Spring 2013) Rutgers University Swastik Kopparty Scribes: John Kim, Ben Lund 1 Course Information Swastik

More information