Fast Set Intersection in Memory

Size: px
Start display at page:

Download "Fast Set Intersection in Memory"

Transcription

1 Fast Set Intersection in Memory Bolin Ding University of Illinois at Urbana-Champaign 2 N. Goodin Avenue Urbana, IL 68, USA bding3@uiuc.edu Arnd Christian König Microsoft Research One Microsoft Way Redmond, WA 9852, USA chriso@microsoft.com ABSTRACT Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a orst-case efficient ay. In general, given (preprocessed) sets, ith totally n elements, e ill sho ho to compute their intersection in expected time O(n/ + r), here r is the intersection size and is the number of bits in a machine-ord. In addition,e introduce a very simple version of this algorithm that has eaer asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and orloads.. INTRODUCTION Fast processing of set intersections is a ey operation in many query processing tass in the context of databases and information retrieval. For example, in the context of databases, set intersections are used in the context of various forms of data mining, text analytics, and evaluation of conunctive predicates. They are also the ey operations in enterprise and eb search. Many of these applications are interactive, meaning that the latency ith hich query results are displayed is a ey concern. It has been shon in the context of search that query latency is critical to user satisfaction, ith increases in latency directly leading to feer search queries being issued and higher rates of query abandonment [, 7]. As a consequence, significant portions of the sets to be intersected are often cached in main memory. This paper ill study the performance of set intersection algorithms for main-memory resident data. Note that these techniques are also relevant in the context of large dis-based (inverted) indexes, hen large fractions of these reside in a main memory cache. There has been considerable study of set intersection algorithms in information retrieval (e.g., [2, 4, ]). Most of these papers assume that the underlying data structure is an inverted index [23]. Much of this or (e.g., [2, 4]) focuses on adaptive algorithms hich use the number of comparisons as measure of overhead. For in-memory data, additional structures hich encode additional sipping-steps [8], tree-based structures [7], or hash-based algo- Permission to mae digital or hard copies of all or part of this or for personal or classroom use is granted ithout fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume ere invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th - September 3rd 2, Seattle, Washington. Proceedings of the VLDB Endoment, Vol. 4, No. 4 Copyright 2 VLDB Endoment //... $.. rithms become possible, hich often outperform inverted indexes; e.g., using hash-based dictionaries, intersecting to sets L, L 2 requires expected time O(min( L, L 2 )), hich is a factor of Θ(log( + max( L / L 2, L / L 2 ))) better than the best possible orst-case performance of comparison-based algorithms [6]. In this or, e propose ne set intersection algorithms aimed at fast performance.these outperform the competing techniques for most inputs and are also robust in that for inputs here they are not optimal they are close to the best-performing algorithm. The tradeoff for this gain is a slight increase in the size of the data structures, hen compared to an inverted index; hoever, in user-facing scenarios here latency is crucial, this tradeoff is often acceptable.. Contributions Our approach leverages to ey observations: (a) If is the size (in bits) of a machine-ord, e can encode a set from a universe of elements in a single machine ord, alloing for very fast intersections. (b) For the data distributions seen in many real-life examples (in particular search applications), the size of intersections is typically much smaller than the smallest set being intersected. To illustrate the second observation, e analyzed the K most frequent queries issued against the Bing Shopping portal. For 94% of all queries it held that the size of the full intersection as at least one order of magnitude smaller than the document frequency of the least frequent eyord; for 76% of the queries the difference as to orders of magnitude. By exploiting these to observations, e mae the folloing contributions. (i) We introduce linear-space data structures to represent sets such that their intersection can be computed in a orst-case efficient ay. Given sets, ith n elements in total, these data structures allo us to compute their intersection in expected time O(n/ + r), here r is the size of the intersection and is the number of bits in a machine-ord; hen the size of the intersection is an order of magnitude (or more) smaller than the size of the smallest set being intersected, our approach yields significant improvements in execution time over previous approaches. To the best of our noledge, the best asymptotic bound for fast set intersection is achieved by the O ( (n(log 2 ) 2 )/ + r ) algorithm of [6]. Hoever, note that the bound relies on a large value of ; in practice, is small (and constant), and < 2 6 = bits implies / < (log 2 ) 2 /. More importantly, [6] requires complex bit-manipulation, maing it slo in practice, hich e ill demonstrate empirically in Section 4. (ii) We describe a much simpler algorithm that computes the intersection in expected O(n/α m + mn/ + r ) time, here α is a constant determined by, and m is a parameter. This algorithm has eaer guarantees in theory, but performs better in practice, and gives significant improvements over the various data structures typically used, hile being very simple to implement. 255

2 2. BACKGROUND AND RELATED WORK Algorithms based on Ordered Lists: Most or on set intersection focuses on ordered lists as the underlying data structure, in particular algorithms using inverted indexes, hich have become the standard data structure in information retrieval. Here, documents are identified via a document ID, and for each term t, the inverted index stores a sorted list of all document IDs containing t. Using this representation, to sets L, L 2 of similar sizes (i.e., L L 2 ) can be intersected efficiently using a linear merge by scanning both lists in parallel, requiring O( L + L 2 ) operations (the merge step in merge sort). This approach is asteful hen set sizes differ significantly or only small fractions of the sets intersect. For very different set sizes, algorithms have been proposed that exploit this asymmetry, requiring log ( L + L 2 ) L + L comparisons at most (for L < L 2 ) [6]. To improve the performance further, there has recently been significant or on so-called adaptive set-intersection algorithms for set intersections [2, 4, 3,, 2, 5]. These algorithms use the total number of comparisons as measure of the algorithm s complexity and aim to use a number of comparisons as close as possible to the minimum number of comparisons ideally required to establish the intersection. Hoever, the resulting reduction in the number of comparisons does not necessarily result in performance improvements in practice: for example, in [2], binary search based algorithms outperform a parallel scan only hen L 2 < 2 L, even though several times feer comparisons are needed. Hierarchical Representations: There are various algorithms for set intersections based on variants of balanced trees (e.g. [9], treaps [7], and sip-lists [8]), computing the intersection of (preprocessed) sets L, L 2 in O( L log( L 2 / L )) (for L < L 2 ) operations. Hoever, hile some form of sipping is commonly used as part of algorithms based on inverted indexes, sip-lists (or trees) are typically not used in the scenarios outlined above (ith static set data) due to the required space-overhead. A novel and compact to-level representation of posting lists aimed at fast intersections in main memory as proposed in [9]. Algorithms based on Hashing: Using a hash-based representation of sets can speed up the intersection of sets L, L 2 ith L L 2 significantly (expected time O( L ) by looing up all elements of L in the hash-table of L 2); hoever, because of the added indirection, this approach performs poorly for less seed set sizes. A ne hashing-based approach is proposed in [6]: here, the elements in sets L, L 2 are mapped using a hash-function h to smaller (approximate) representations h(l ), h(l 2). These representations are then intersected to compute H = h(l ) h(l 2). Finally, the set of all elements in the original sets that map to H via h are computed and any false positives removed. As the hashed images h(l ), h(l 2) to be intersected are smaller than the original sets (using feer bits), they can be intersected more quicly. Given sets of total size n, their intersection can be computed in expected time O ( (n log 2 )/ + r ), here r = i Li. Score-based pruning: In many IR engines it is possible to avoid computing full intersections by leveraging scoring functions that are monotonic in the individual term-ise scores; this maes it possible to terminate the intersection processing early using approaches such as TA [5] or document-at-a-time (DAAT) processing (e.g., [8]). Hoever, in practice, this is often not possible, either because of the complexity of the scoring function (e.g., nonmonotonic machine-learning based raning functions) or because full intersection results are required. Our approach is based on partitioning the elements in each set into very small ( 8 elements) groups, for hich e have fast intersection schemes. Hence, DAATapproaches can be combined ith our or by using these small groups in place of individual documents. Set intersections using multiple cores: Techniques that exploit multi-core architectures to speed up set intersections are described in [2, 22].The use of multiple cores is orthogonal to our approach in the sense that our algorithms can be parallelized for these architectures as ell; hoever, this is beyond the scope of our paper. 3. OUR APPROACH Notation: We are given a collection of N sets S = {L,..., L N }, here L i Σ and Σ is the universe of elements in the sets; let n i = L i be the size of set L i. Suppose elements in a set are ordered, and for a set L, let inf(l) and sup(l) be the minimum and maximum elements of a set L, respectively. We use to denote the size (number of bits) of a ord on the target processor. Throughout the paper e ill use log to denote log 2. Finally, e use [] to denote the set {,..., }. Our approach can be extended to bag semantics by additionally storing element frequency. Frameor: Our tas is to design data structures such that the intersection of multiple sets can be computed efficiently. We differentiate beteen a pre-processing stage, during hich e reorganize each set and attach additional index structures, and an online processing stage, hich uses the pre-processed data structures to compute intersections. An intersection query is specified via a collection of sets L, L 2,..., L (to simplify notations, e use the offsets, 2,..., to refer to the sets in a query throughout this section); our goal is to compute L L 2... L efficiently. Note that pre-processing is typical of most non-trivial data structures used for computing set intersections; even building simple non-compressed inverted indexes requires sorting the posting lists as a pre-processing step. We require the pre-processing stage to be time/space-efficient in that it does not require more than O(n i log n i) time (necessary for sorting) and linear space O(n i). The size of intersection L L 2 is a loer bound of the time needed to compute the intersection. Our method leverages to ey ideas to approach this loer bound: (i) The intersection of to sets in a small universe can be computed very efficiently; in particular, if the to sets are subsets of {, 2,..., }, e can encode them as single machine-ords and compute their intersection using a bitise-and. (ii) A small number of elements in a large universe can be mapped into a small universe. partitioning via sorting/hashing L L p L h :Σ [] h(l ) h(l 2 ) h(l p ) compute L p h(l p ) h(l q 2) L p L q (ith the help of h(l p L q 2: 2 ) h(l q 2)) h(l 2) h(l 2 2) h(l q 2) L L L q 2 L partitioning via sorting/hashing L 2 Figure : Algorithmic Frameor h :Σ [] We leverage these to ideas by first partitioning each set L i into smaller groups L i s, hich are intersected separately. In the pre-processing stage, e map each small group into a small universe [] = {, 2,..., } using a universal hash function h and encode the image h(l i ) ith a machine-ord. Then, in the on- 256

3 line processing stage, to compute the intersection of to small groups L p and Lq 2, e first use a bitise-and operation to compute H = h(l p ) h(lq 2 ), and then try to recover Lp Lq 2 using the inverse mapping h from H. The union of L p Lq 2 s forms L L 2. Moreover, if the intersection L L 2 is of a small size compared to L and L 2 (seen in practice), a large fraction of the small groups ith overlapping ranges has an empty intersection; thus, by using the ord-representations of H to detect these groups quicly, e can sip much unnecessary computation, resulting in significant speed-up. The resulting algorithmic frameor is illustrated in Figure. Given this overall approach, the ey questions become ho to form groups, hat structures to be used to represent them, and ho to process intersections of these small groups. We ill discuss these details in the folloing sections. All the formal proofs of analytical results are deferred to the appendix. 3. Intersection via Fixed-Width Partitions We first consider the case hen there are only to sets L and L 2 in the intersection query. We ill present a pair of pre-processing and online processing algorithms, hich e use to illustrate the basic ideas of our algorithms. We subsequently refine and extend our techniques to sets in Section 3.2. In the pre-processing stage, L and L 2 are sorted, and partitioned into groups (recall is the ord idth) L, L 2,..., L n /, and L 2, L 2 2,..., L n 2/ 2 of equal size (except the last ones). In the online processing stage (Algorithm ), the small groups are scanned in order. If the ranges of L p and Lq 2 overlap, e may have L p Lq 2. The intersection Lp Lq 2 of each pair of overlapping groups is computed (line 8) in some iteration. And finally, the union of all these intersections is L L 2. Since each group is scanned once, line 2- repeat for O((n + n 2)/ ) iterations. The maor remaining question no becomes ho to compute L p Lq 2 efficiently ith proper pre-processing? For this purpose, e map each group L p or Lq 2 into a small universe for fast intersection, and e leverage single-ord representations to store and manipulate sets from a small universe. Single-Word Representation of Sets: We represent a set A [] = {, 2,..., } using a single machine-ord of idth by setting the y-th bit as iff y A. We refer to this as the ord representation (A) of A. For to sets A and B, the bitise-and (A) (B) (computed in O() time) is the ord representation of A B. Given a ord representation (A), all the elements of A can be retrieved in linear time O( A ). In the rest of this paper, if A [], e use A to denote both a set and its ord representation. Pre-processing Stage: Elements in a set L i are sorted as {x i, x 2 i,..., x n i i } (i.e., x i < x + i ) and L i is partitioned as follos: L i = {x i,..., x i }, L 2 i = {x + i,..., x 2 i },... () L i = {x( ) + i, x ( ) +2 i,..., x i },... (2) For each small group L i, e compute the ord-representation of its image under a universal hash function h : Σ [], i.e., h(l i ) = {h(x) x L i }. In addition, for each position y [] and each small group L i, e also maintain the inverted mapping h (y, L i ) = {x x L i and h(x) = y}, i.e., for each y [] We use the folloing ell-non technique: ( is bitise-xor) (i) lobit = (((A) ) (A)) (A) is the loest -bit of (A). For the smallest element y in A, e have 2 y = lobit. y = log(lobit) A can be computed using the machine instruction NLZ (number of leading zeros) or pre-computed looup tables. (ii) Set (A) as (A) lobit and repeat (i) to scan the next smallest element until (A) becomes. e store the elements in L i ith hash value y, in a short list hich supports ordered access. We ensure that the order of these elements is identical across different h (y, L i ) s and Li s; in this ay, e can intersect these short lists using a linear merge. EXAMPLE 3.. (PRE-PROCESSING AND DATA STRUCTURES) Suppose e have to sets L = {, 2, 4, 9, 6, 27, 43}, L 2 = {, 3, 5, 9,, 6, 22, 32, 34, 49}. And, let = 6 ( = 4). For simplicity, h is selected to be h(x) = (x ) mod 6. L is partitioned into 2 groups: L = {, 2, 4, 9}, L 2 = {6, 27, 43}, and L 2 is partitioned into 3 groups: L 2 = {, 3, 5, 9}, L 2 2 = {, 6, 22, 32}, L 3 2 = {34, 49}. We pre-compute: h(l ) = {, 2, 4, 9}, h(l 2 ) = {, }, h(l 2) = {, 3, 5, 9}, h(l 2 2) = {, 6, }, h(l 3 2) = {, 2}. We also pre-process h (y, L p i ) s: for example, h (, L 2 ) = {6}, h (, L 2 2) = {6, 32}, h (, L 2 ) = {27, 43}, and h (, L 2 2) = {}. : p, q, 2: hile p n and q n 2 do 3: if inf(l q 2 ) > sup(lp ) then 4: p p + 5: else if inf(l p ) > sup(lq 2 6: ) then q q + 7: else 8: compute (L p Lq 2 ) using IntersectSmall 9: (L p Lq 2 ) : if sup(l p ) < sup(lq 2 ) then p p + else q q + : is the result of L L 2 Algorithm : Intersection via fixed-idth partitioning Online Processing Stage: The algorithm used to intersect to sets is shon in Algorithm. Since elements in L i are sorted, Algorithm ensures that if the ranges of any to small groups L p, Lq 2 overlap, their intersection is computed (line 8). After scanning all such pairs, must then contain the intersection of the hole sets. No the question is: ho to compute the intersection of to small groups L p Lq 2 efficiently? For this purpose, e introduce the algorithm IntersectSmall (Algorithm 2), hich: (i) first computes H = h(l p ) h(lq 2 ) using a bitise-and; (ii) for each (-bit) y H, intersects the corresponding inverted mappings using the linear merge algorithm. IntersectSmall(L p, Lq 2 ): computing Lp Lq 2 : Compute H h(l p ) h(lq 2 ) 2: for each y H do 3: Γ Γ (h (y, L p ) h (y, L q 2 )) 4: Γ is the result of L p Lq 2 Algorithm 2: Computing the intersection of small groups EXAMPLE 3.2. (ONLINE PROCESSING) Folloing Example 3., to compute L L 2, e need to compute L L 2, L 2 L 2 2, and L 2 L 3 2 (pairs ith overlapping ranges): for example, for computing L 2 L 2 2, e first compute h(l 2 ) h(l 2 2) = {, }; then L 2 L 2 2 = ( y=, h (y, L 2 ) h (y, L 2 2) ) = {6}. Similarly, e can compute L L 2 = {, 9}. Finally, e find h(l 2 ) h(l 3 2) =, and thus L 2 L 3 2 =. So, e have L L 2 = {, 9} {6}. Note that ord representations and inverted mappings for L i are pre-computed, and ord-representations can be intersected using one operation. So the running time of IntersectSmall is bounded by the number of pairs of elements, one from L p and one from Lq 2, that are mapped to the same hash-value. This number can be shon to be equal (in expectation) to the intersection size plus O() for each group L i. Using this, e obtain Algorithm s running time: 257

4 THEOREM 3.3. ) Algorithm computes L L 2 in expected + r time, here r = L L 2. O( n +n 2 To achieve a better bound, e optimize the group sizes: ith L and L 2 partitioned into groups of sizes s = n /n 2 and s 2 = n2/n, respectively, L L 2 can be computed in expected O( n n 2/ + r) time. A detailed analysis of the effect of group size on running times can be found in Section A... Overhead of Pre-processing: If only the bound in Theorem 3.3 is required, then to pre-process a set L i of size n i, it is obvious that O(n i log n i) time and O(n i) space suffice: e only need to partition a sorted list into small groups of size, and for each small group, construct the ord representation and inverted mapping in linear time using the hash function h. To achieve the better bound O( n n 2/+r), e need multiple resolutions of the partitioning of a set L i. This is because, as discussed above, the optimal group size s = n /n 2 of the set L also depends on the size n 2 of the set L 2 to be intersected ith it. For this purpose, e partition a set L i into small groups of size 2, 4,..., 2, etc. To compute L L 2 for the given to sets, suppose s i is the optimal group size of L i; e then select the actual group size s i = 2 t s.t. s i s i 2s i, obtaining the same bound. A carefully-designed multi-resolution data structure enabling access to these groups consumes only O(n i) space for L i. We ill describe and analyze this structure in Section THEOREM 3.4. To pre-process a set L i of size n i for Algorithm, e need O(n i log n i) time and O(n i) space (in ords). Limitations of Fixed-Width Partitions: The main limitation of the proposed approach is that it is difficult to extend to more than to sets, because the partitioning scheme e use is not ell-aligned for more than to sets: for three sets, e.g., there may be more than O((n + n 2 + n 3)/ ) triples of small groups that overlap. We introduce a different partitioning scheme to address this issue in Section 3.2, hich extends to > 2 sets. 3.2 Intersection via Randomized Partitions In this section, e ill introduce an algorithm based on a randomized partitioning scheme to compute the intersection of to or more sets. The general approach is as follos: instead of fixedidth partitions, e use a hash function g to partition each set into small groups, using the most significant bits of g(x) to group an element x Σ. This reduces the number of combinations (pairs) of small groups e have to intersect, alloing us to prove bounds similar to Theorem 3.3 for computing intersections of > 2 sets. Pre-processing Stage: Let g be a hash function g : Σ {, } mapping an element to a bit-string (or binary number); e use g t(x) to denote the t most significant bits of g(x). We say that for to bit-strings z and z 2, z is a t -prefix of z 2, iff z is identical to the highest t bits in z 2; e.g., is a 4-prefix of. To pre-process a set L i, e partition it into groups L z i = {x x L i and g t(x) = z} for all z {, } t (some t). As before, e compute the ord representation of the image of each L z i under another hash function h : Σ [], and inverted mappings h. Online Processing Stage: This stage is similar to our previous algorithm: to compute the intersection of to sets L and L 2, e compute the intersections of pairs of overlapping small groups, one from each set, and finally tae the union of these intersections. In general, suppose L is partitioned using g t : Σ {, } t, and L 2 is partitioned using g t2 : Σ {, } t 2. Assume n n 2 and t t 2. We no intersect sets L and L 2 using Algorithm 3. The maor improvement of Algorithm 3 compared to Algorithm is that in Algorithm, e need compute L p Lq 2 hen the ranges of L p and Lq 2 overlap; in Algorithm 3, e compute Lz Lz 2 2 (also using Algorithm 2) hen z is a t -prefix of z 2 (this is a necessary condition for L z Lz 2 2 ; so Algorithm 3 is correct). This significantly reduces the number of pairs to be intersected. : for each z 2 {, } t 2 do 2: Let z {, } t be the t -prefix of z 2 3: Compute L z Lz 2 2 using IntersectSmall(Lz 4: Let (L z Lz 2 2 5: ) is the result of L L 2, Lz 2 2 ) Algorithm 3: 2-list Intersection via Randomized Partitioning Based on the choices of parameters t and t 2, e can either partition L and L 2 into the same number of small groups (yielding the bound of Theorem 3.5), or into small groups of the (approximately) identical sizes (yielding Theorem 3.6). ( THEOREM 3.5. Algorithm 3 computes L L 2 in expected n ) n O 2 + r time (r = L L 2 ), ith t = t 2 = log n n 2. THEOREM 3.6. ) Algorithm 3 computes L L 2 in expected + r time (r = L L 2 ), using t = log(n / ) O( n +n 2 and t 2 = log(n 2/ ). Note that hen n n 2, Theorem 3.5 has a better bound than Theorem 3.6. But e can extend Theorem 3.6 to -set intersection. Extension to More Than To Sets: Suppose e ant to compute the intersection of sets L,..., L, here n i = L i and n n 2... n. L i is partitioned into groups L z i s using g ti : Σ {, } t i. Note that g ti s are generated from the same hash function g. We use t i = log(n i/ ) and proceed as in Algorithm 4. Algorithm 4 is almost identical to Algorithm 3, but is generalized to sets: for each z {, } t, e pic the group identifiers z i to be the t i-prefix of z, and e only intersect groups L z, Lz 2 2,..., Lz, here z, z 2,..., z share a prefix of size t. Also, e extend IntersectSmall (Algorithm 2) for groups: e first compute the intersection (bitise-and) of hash images (their ord-representations) of the groups L z i i s; and, if the result H = i= h(lz i i ) is not zero, for each (-bit) y H, e intersect the corresponding inverted mappings h (y, L z i i ) s. Details and analysis are deferred to the appendix. THEOREM 3.7. Using t i = log(n i/ ), Algorithm 4 computes the intersection i= Li of sets in expected O(n/ + r) time, here r = i= Li and n = i= ni = i= Li. : for each z {, } t (t i = log(n i / ) ) do 2: Let z i be the t i -prefix of z for i =,..., 3: Compute i= Lz i i using extended IntersectSmall 4: Let ( i= Lz i i ) 5: is the result of i= L i Algorithm 4: -list Intersection via Randomized Partitioning 3.2. A Multi-resolution Data Structure Recall that in some algorithms (e.g., Theorem 3.5), the selection of the number of small groups used for a set L i depends on the (size of) other sets being intersected ith L i. So by naively precomputing the required structures for each possible group size, e ould incur excessive space requirements. In this section, e describe a data structure that supports access to partitions of L i into 2 t groups for any possible t, using only O(n i) space. It is illustrated in Figure 2. To support the algorithms introduced so far, this structure must also allo us: (i) for each L z i, to retrieve the ord-representation h(l z i ), and 258

5 pointers 2 x L z i x L z i partitioned by g L z i partitioned by g 2 x L z i partitioned by g 3 y pointer to the first element in L z i s.t. h(x) =y next(x) = the smallest x L i s.t. x x and h(x) =h(x ) Figure 2: Multi-Resolution Partition of L i g :Σ {, } g 2 :Σ {, } 2 g 3 :Σ {, } 3 g t :Σ {, } t (ii) for each y [], to access all elements in h (y, L z i ) = {x x L z i and h(x) = y} in time linear in its size h (y, L z i ). Multi-resolution Partitioning: For the ease of explanation, e suppose Σ = {, } and choose g as a random permutation of Σ. To pre-process L i, e first order all the elements x L i according to g(x). Then any small group L z i = {x x L i and g t(x) = z} forms a consecutive interval in L i (partitions of different resolutions are formed for t =, 2,...). Note: in all of our algorithms, universal hash functions and random permutations are almost interchangeable (hen used as g) the differences being that (i) a permutation induces a total ordering of elements (in this data structure, this property is required), hereas hashing may result in collisions (hich e can overcome by using the pre-image to brea ties) and (ii) there is a slight difference in the resulting probability of, e.g., elements being grouped together (hashing results in (limited) independence, hereas permutations result in negative dependence e account for this by using the eaer condition in our proofs). Word Representations of Hash Mappings: No, for each small group L z i, e need to pre-compute and store the ord representation h(l z i ). Note the total number of small groups is n i/2+n i/ n i/2 t +... n i. So this requires O(n i) space. Inverted Mappings: We need to access all elements in h (y, L z i ) in order, for each y []. If e ere to store these mappings for each L z i explicitly, this ould require O(n i log n i) space. Hoever, by storing the inverted mappings h (y, L z i ) s implicitly, e can do better, as follos: For each group L z i, since it corresponds to an interval in L i, e can store the starting and ending positions in L i, denoted by left(l z i ) and right(l z i ). These allo us to determine if an element x belongs to L z i. No, to enable the ordered access to the inverted mappings, e define, for each x L i, next(x) to be the next element x to x on the right s.t. h(x ) = h(x) (i.e., ith minimum g(x ) > g(x) s.t. h(x ) = h(x)). Then, for each L z i and each y [], e store the position first(y, L z i ) of the first element x in L z i s.t. h(x ) = y. No, to access all elements in h (y, L z i ) in order, e can start from the element at first(y, L z i ), and follo the pointers next(x), until passing the right boundary right(l z i ). And, in this ay, all elements in the inverted mapping are retrieved in the same order as g(x) hich e require for IntersectSmall. Space Requirements: For all groups of different sizes, the total space for storing h(l z i ) s, left(l z i ) s, right(l z i ) s, first(y, L z i ) s and next(x) s is O(n i). So the hole multi-resolution data structure requires O(n i) space. A detailed analysis is in the appendix. When the group size t i depends only on n i (e.g., in Algorithm 4), single-resolution in pre-processing suffices, and the above multiresolution scheme (for selecting t i online) is not necessary. THEOREM 3.8. To pre-process a set L i of size n i for Algorithm 3-4, e need O(n i log n i) time and O(n i) space (in ords). 3.3 From Theory to Practice In this section, e describe a more practical version of our methods. This algorithm is simpler, uses significantly less memory, straight-forard data structures, and, hile it has orse theoretical guarantees, is faster in practice. The main difference is that for each small group L z i, e only store the elements in L z i and their images under m hash functions (i.e., e do not maintain inverted mappings, trading off a complex O()-access for a simple scan over a short bloc of data). Also, e use only a single partition for each set L i. Having multiple ord representations of hash images (different hash functions) for each small group allos us to detect empty intersections of small groups ith higher probability. Pre-processing Stage: As before, each set L i is partitioned into groups L z i s using a hash function g ti : Σ {, } t i. We ill sho that a good selection of t i is log(n i/ ), hich depends only on the size of L i. Thus for each set L i, pre-processing ith a single partitioning suffices, saving significant memory. For each group, e compute ord representations of images under m (independent) universal hash functions h,..., h m : Σ []. Note that e only require a small value of m in practice (e.g., m = 2). Online Processing Stage: The algorithm for computing il i e use here (Algorithm 5) is identical to Algorithm 4, ith to exceptions: () When needed, il z i i is directly computed by a linear merge of L z i i s (line 4), using O(Σi Lz i i ) time. (2) We can sip the computation of il z i i if, for some h, the bitise-and of the corresponding ord representations h (L z i i ) s is zero (line 3). : for each z {, } t (t i = log(n i / ) ) do 2: Let z i be the t i -prefix of z for i =,..., 3: if i= h (L z i i ) for all =,..., m then 4: Compute i= Lz i i by a linear merge of L z,..., Lz 5: Let ( i= Lz i i ) 6: is the result of i= L i Algorithm 5: Simple Intersection via Randomized Partitioning Analysis: To see hy Algorithm 5 is efficient, e observe that: if L z Lz 2 2 =, then ith high probability, h(lz ) h(lz 2 2 ) = for some =,..., m. So most empty intersections can be sipped using the test in line 3. With the probability of a successful filtering (i.e. given il z i i =, ih (L z i i ) = for some hash function h, =,..., m) bounded by the Lemmas A. and A.3, e can derive Theorem 3.9. Detailed analysis of this probability (both theoretical and experimental) and overall complexity is deferred to Appendix A.5. THEOREM 3.9. Using t i = log(n i/ ), Algorithm 5 computes ( i= Li in expected O max(n,n ) + mn α() m + r ) time (r= i= L i, n= i= n i, α()= for β() used in Lemma A.3). β() 3.3. Data Structure for Storing L z i In this section, e describe the simple and space-efficient data structure that e use in Algorithm 5. As stated earlier, e only need to partition L i using one hash function g ti ; hence e can represent each L i as an array of small groups L z i s, ordered by z. For each small group, e store the information associated ith it in the structure shon in Figure 3. The first ord in this structure stores z = g ti (L z i ). The second ord stores the structure s length len. The folloing m ords represent the hash images h (L z i ),..., h m(l z i ) of L z i. Finally, e store the elements of L z i as an array in the remaining part. We need n i/ such blocs for 259

6 z len h (L z i ) m ords h m (L z i ) len Figure 3: The Structure for a Pre-processed Small Group L z i L i in total. The first ord z can be also computed on-the-fly, as these small groups are accessed sequentially in Algorithm 5. So, if e store len using one ord, and one ord for each element of L z i, then e need totally m + + L z i ords for each group L z i, and thus n i( + (m + )/ ) ords to store the pre-processed L i. The overhead of the pre-processing is dominated by the cost of sorting L i (the remaining operations are trivial). THEOREM 3.. To pre-process a set L i of size n i for Algorithm 5, e need O(n i(m + log n i)) time, and O(n i( + m/ )) (ords) space. We describe methods for compressing this structure in Appendix B. 3.4 Intersecting Small and Large Sets An important special case for set intersection are asymmetric intersections here the sizes n and n 2 of the sets that are intersected vary significantly (.l.o.g., assume n n 2). In this subsection, using the same multi-resolution data structure as in Section 3.2., e present an algorithm HashBin that computes L L 2 in O(n log(n 2/n )) time. This bound is also achieved by other previous ors, e.g., SmallAdaptive [5], but our algorithm is even simpler in online processing. It is also non that algorithms based on hash-tables only require O(n ) time for this scenario; hoever, unlie HashBin, they are ill-suited for less asymmetric cases. Algorithm HashBin: When intersecting to sets L and L 2 ith sizes n n 2, e focus on the partitioning induced by g t : Σ {, } t, here t = log n for both of them, and g is a random permutation of Σ. To compute L L 2, e compute L z L z 2 for all z {, } t and tae the union. To compute L z L z 2, e iterate over each x L z, and perform a binary search to chec hether x L z 2 using O(log L z 2 ) time. This scheme can be extended to multiple sets by searching for x in L z i if found in L z,..., L z i. THEOREM ( 3.. The algorithm HashBin computes L L 2 in expected O n log n 2 time. The pre-processing of a list L i n ) requires O(n i log n i) time and O(n i) space. The proof of Theorem 3. and ho HashBin uses the multiresolution data structure is deferred to the Section A.6 in the appendix. The advantage of HashBin is that, since it is based on the same structure as the algorithm introduced in Section 3.2, e can mae the choice beteen algorithms online, based on n /n EXPERIMENTAL EVALUATION We evaluate the performance and space requirements of four of the techniques described in this paper: (a) the fixed-idth partition algorithm described in Section 3. (hich e ill refer to as IntGroup); (b) the randomized partition algorithm in Section 3.2 (RanGroup) (c) the simple algorithm based on randomized partitions described in Section 3.3 (RanGroupScan); and (d) the one for intersecting sets of seed sizes in Section 3.4 (HashBin). Setup: All algorithms are implemented using C and evaluated on a 4GB 64-bit 2.4GHz PC. We employ a random permutation of the document IDs for the hash function g and 2-universal hash functions for h (or h s). For RanGroup, e use m = 4 (the number of hash functions h ), unless noted otherise. We compare our techniques to the folloing competitors: (i) set intersection based on a simple parallel scan of inverted indexes: Merge; (ii) set intersection based on sip lists: SipList [8]; (iii) L z i set intersection based on hash tables: Hash (i.e., e iterate over the smallest set L, looing up every element x L in hash-table representations of L 2,... L ); (iv) the algorithm of [6]: BPP; (v) the algorithm proposed for fast intersection in integer inverted indices in main memory [9, 2]: Looup (using B = 32 as the bucetsize, hich is the best value in our and the authors experience); and (vi) various adaptive intersection algorithms based on binary search/galloping search: SvS, Adaptive [2, 3, 3], BaezaYates [, 2], and SmallAdaptive [5]. Note that BaezaYates is generalized to handle more than to sets as in [5]. Implementation: For each competitor, e tried our best to optimize its performance. For example, for Merge e tried to minimize the number of branches in the inner loop; e also store postings in consecutive memory addresses to speed up parallel scans and reduce page als after TLB misses. Our implementation of sip lists follos [8], ith simplifications since e are focusing on static data and do not need fast insertion/deletion. We also simplified the bit-manipulation in BPP [6] so that it ors faster in practice for small. For the algorithms using inverted indexes, e initially do not consider compression on the posting lists, as e do not ant the decompression step to impact the performance reports. In Section 4. e ill study variants of the algorithms incorporating compression. With regards to sip-operations in the index note that since e use uncompressed posting lists, algorithms such as Adaptive can perform arbitrary sips into the index directly. Datasets: To evaluate these algorithms e use both synthetic and real data. For the experiments ith synthetic datasets, sets are generated randomly (and uniformly) from a universe Σ. The real dataset is a collection of more than 8M Wiipedia pages. In each experiment for the synthetic datasets, 2 combinations of sets are randomly generated, and the average time is reported. Intersection Time (ms) M 2M 3M 4M 5M 6M 7M 8M 9M M Set Size Figure 4: Varying the Set Size Merge SipList Hash IntGroup BPP Adaptive Looup RanGroupScan Varying the Set Size: First, e measure the performance hen intersecting only 2 sets; e use synthetic data, the lists are of equal size and the size of the intersection is fixed at % of the list size; the results are shon in Figure 4. We can see that the performance of the different techniques relative to each other does not change ith varying list size. Hash performs orst, as the (relatively) expensive looup operation needs to be performed many times. SipList performs poorly for the same reason. The BPP algorithm is also slo, but this is because of a number of complex operations that need to be performed, hich are hidden as a constant in the O()-notation. The same trend held for the remaining experiments as ell; hence, for readability, e did not include BPP in the subsequent graphs. For the same reason e only sho the best-performing among the adaptive algorithms in the evaluation; if one adaptive algorithm dominates another on all parameter settings in an experiment, e don t plot the orse one. Among the remaining algorithms, RanGroupScan (4%-5% faster than Merge) and IntGroup perform the best (RanGroup performs similarly to IntGroup and is not plotted). Interestingly, 26

7 Intersection TIme (ms) MERGE performs best for L L 2 >.7 L Merge SipList Hash Adaptive SvS Looup IntGroup RanGroup Average Intersection Time (ms) K=2 K=3 K=4 RanGroupScan Intersection Size Figure 5: Varying the Intersection Size Figure 6: Varying the Number of Keyords the simple Merge algorithm is next, outperforming the more sophisticated algorithms, folloed by Looup and the best-performing adaptive algorithm. Varying the Intersection Size: The size of the intersection r is an important factor concerning the performance of the algorithms: larger intersections mean feer opportunities to eliminate small groups early for our algorithms or to sip parts of the set for the adaptive and siplist-based approaches. Here, e use synthetic data, intersecting to sets ith M elements and vary r = L L 2 beteen 5 and M. The results are reported in Figure 5. For r < 7M (7% of the set size) RanGroupScan and IntGroup perform best. Otherise, Merge becomes the fastest and RanGroup- Scan the 2nd-fastest alternative; here, the performance of Ran- GroupScan is very similar to Merge, all the ay to r = M. Among the remaining algorithms, RanGroup slightly outperforms Merge for r < 5M, Looup is the next-best algorithm and SvS and Adaptive perform best among the adaptive algorithms. Varying the Sets Size Ratios: As e illustrated in the introduction, the se in set sizes is also an important factor in performance. When sets are very different in size, algorithms that iterate through the smaller set and are able to locate the corresponding values in the larger set quicly, such as HashBin and Hash, perform ell. In this experiment e use synthetic data and vary the ratio of set sizes, setting L 2 = M and varying L beteen 6K and M. The size of the intersection is set to be % of L and e define the ratio beteen the list sizes as sr = L 2 / L. Here, the differences beteen the algorithms become small ith groing sr (for this reason, e also don t report them in a graph, as too many lines overlap). For sr < 32, RanGroupScan performs best; for larger sr, Looup and Hash perform best, until a ratio of sr for this and larger ratios, Hash outperforms the remaining algorithms, folloed by Looup and HashBin. Generally, both HashBin and RanGroupScan perform close to the best-performing algorithm. The adaptive algorithms require more time than RanGroupScan for sr 2 and more time than HashBin for all values of sr; Siplist and BPP perform orst across all values of sr. Varying the Number of Keyords: In this experiment, e varied the number of sets = 2, 3, 4, fixing L i = M for i =,...,, ith the IDs in the sets being randomly generated using a uniform distribution over [, 2 8 ]; the results are reported in Figure 6. In this experiment, e use m = 2 hash images for RanGroupScan. For multiple sets, RanGroupScan is the fastest, ith the difference becoming more pronounced for 3 and 4 eyords, since, ith additional sets, intersecting the hash-images (ord-representations) yields more empty results, alloing us to sip the corresponding groups. RanGroup is the next-best performing algorithm; e don t include results for IntGroup here, as it is designed for intersections of to sets (see Section 3.). In- terestingly, the simple Merge algorithm again performs very ell hen compared to the more sophisticated techniques; the Looup algorithm is next, folloed by the various adaptive techniques. Size of the Data Structure: The improvements in speed come at the cost of an increase in space: our data structures (ithout compression) require more space than an uncompressed posting list the increase is 37% (RanGroupScan for m = 2), 63% (Ran- GroupScan for m = 4), 75% (IntGroup) or 87% (RanGroup). Normalized Intersection Time Results: All Query Sizes Merge Figure 7: Normalized Execution Time on a Real Worload Experiment on Real Data: In this experiment, e used a orload of the 4 most frequent (measured over a ee in 29) queries against the Bing.com search engine. As the text corpus, e used a set of 8 Million Wiipedia documents. Query characteristics: 68% of the queries contain 2 eyords, 23% 3 eyords and 6% 4 eyords. As e have illustrated before, a ey factor for performance is the ratio of set sizes among the 2-ord queries, the average ratio L / L 2 is.2, for 3-ord queries the average ratio L / L 2 is.3 and the average ratio L / L 3 is.9, and for 4-ord queries, the L / L 2 ratio is.36 and the L / L 4 ratio is.6 note that L L 2 L 3 L 4. The average ratio of intersection size to L is.9. To illustrate the relative performance of the algorithms over all queries e plotted their average running times in Figure 7: here, the running time of Merge is normalized to. Both RanGroup and RanGroupScan significantly outperform Merge, ith the latter performing the best overall; interestingly, hen used for all queries (as opposed to only for the large se case it as designed for) HashBin still performed better than Merge. The remaining algorithms performed in similar order to the earlier experiments, ith the one exception being SvS hich outperformed both Merge and Looup for this more realistic data. Overall, the RanGroupScan as the best-performing algorithm for 6.6% of the queries, folloed by RanGroup (6%) and Hashbin (7.7%) among the remaining algorithms not proposed in this paper, Looup performed best in 6.4% of the queries and SvS for 3.6% of the queries. All of the other techniques ere best for 2.% of the queries or feer. We present additional experiments for this data set in the Appendix C.2. 26

8 4. Experiments on Compressed Structures To illustrate the impact of compression on performance, e repeated the first experiment above, intersecting to sets of identical size, ith the size of the intersection fixed to % of the set size. Varying the set size, e report the execution times and storage requirements for the three algorithms that performed best overall in the earlier experiments Merge, Looup and RanGroupScan (since e are interested in small structures here, e only use m = hash images in RanGroupScan) hen being compressed ith different techniques: e used the standard techniques based on γ- and δ-coding (see [23], p.6) to compress the parts of the posting data stored and accessed sequentially for the three algorithms, and the compression technique described in Appendix B for Ran- GroupScan (hich e refer to as RanGroupScan Lobits). The results are shon in Figure 8; here, e omitted the results for γ- encoding as they ere essentially indistinguishable from ones for δ-coding. RanGroupScan outperforms in terms of speed the other to algorithms using the same compression scheme; the other to algorithms perform similarly to each other, as the decompression no dominates their run-time. Using our encoding scheme of Appendix B improves the performance significantly. Looing at the graph, e can see that the storage requirement for RanGroupScan (using our on encoding) is beteen.3-.9x of the size of the compressed inverted index and beteen.2-.6x of the compressed Looup structure. At the same time, the performance improvements are beteen 7.6-5x (vs. Merge) or 7.4-3x (vs. Looup). Furthermore, by increasing the number of hash images to m = 2, e obtain an algorithm that significantly outperforms the uncompressed Merge, hile requiring less memory. Intersection time (ms) Size of the structure (in ords) Merge_Delta Looup_Delta RanGroupScan_Lobits RanGroupScan_Delta 28K 256K 52K M 2M 4M 8M Number of postings Merge_Delta Looup_Delta RanGroupScan_Lobits RanGroupScan_Delta 28K 256K 52K M 2M 4M 8M Number of postings Figure 8: Running Time and Space Requirement Experiment on Real Data: We repeated this experiment using the real-life data/orload described earlier and the compressed variants of RanGroupScan, Looup and Merge. Again, Ran- GroupScan Lobits performed best, improving run-times by a factor of 8.4x (vs. Merge + δ-coding), 9.x (Merge + γ-coding), 5.7x (Looup + δ-coding), 6.2x (Looup + γ-coding), respectively. Hoever, our approach required the most space (66% of the uncompressed data), hereas Merge (26% / 28% for γ- / δ-coding) and Looup (35% / 37%) required significantly less. Finally, to illustrate the robustness of our techniques, e also measured the orst-case latency for any single query: here, the orst-case latency using Merge + δ-coding as 5.2x the orstcase latency of RanGroupScan Lobits. We sa similar results for Merge + γ-coding (5.6x higher), Looup + δ-coding (4.4x higher), and Looup + γ-coding (4.9x higher). 5. CONCLUSION In this paper e introduced algorithms for set intersection processing for memory-resident data. Our approach provides both novel theoretical orst-case guarantees as ell as very fast performance in practice, at the cost of increased storage space. Our techniques outperform a ide range of existing techniques and are robust in that for inputs for hich they are not the best-performing approach they perform close to the best one. Our techniques have applications in information retrieval and query processing scenarios here performance is of greater concern than space. 6. REFERENCES [] R. A. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In CPM, pages 4 48, 24. [2] R. A. Baeza-Yates and A. Salinger. Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences. In SPIRE, pages 3 24, 25. [3] J. Barbay and C. Kenyon. Adaptive intersection and t-threshold problems. In SODA, pages , 22. [4] J. Barbay, A. López-Ortiz, and T. Lu. Faster Adaptive Set Intersections for Text Searching. In 5th WEA, pages 46 57, 26. [5] J. Barbay, A. López-Ortiz, T. Lu, and A. Salinger. An experimental investigation of set intersection algorithms for text searching. ACM Journal of Experimental Algorithmics, 4:7 24, 2. [6] P. Bille, A. Pagh, and R. Pagh. Fast Evaluation of Union-Intersection Expressions. In ISAAC, pages , 27. [7] G. E. Blelloch and M. Reid-Miller. Fast Set Operations using Treaps. In ACM SPAA, pages 6 26, 998. [8] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a to-level retrieval process. In CIKM, pages , 23. [9] M. R. Bron and R. E. Taran. A Fast Merging Algorithm. Journal of the ACM, 26(2):2 226, 979. [] J. Brutlag. Speed Matters for Google Web Search [] E. Chiniforooshan, A. Farzan, and M. Mirzazadeh. Worst case optimal union-intersection expression evaluation. In ALENEX, pages 79 9, 2. [2] E. Demaine, A. López-Ortiz, and J. Munro. Adaptive Set Intersections, Unions, and Differences. In SODA, pages , 2. [3] E. Demaine, A. López-Ortiz, and J. Munro. Experiments on Adaptive Set Intersections for Text Retrieval Systems. In ALENEX, pages 9 4, 2. [4] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge, 29. [5] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleare. In PODS, pages 2 3, 2. [6] F. K. Hang and S. Lin. A Simple Algorithm for Merging To Disoint Linearly Ordered Sets. SIAM Journal, ():3 39, 972. [7] G. Linden. [8] W. Pugh. A sip list cooboo. Technical Report UMIACS-TR , University of Maryland, 99. [9] P. Sanders and F. Transier. Intersection in Integer Inverted Indices. In ALENEX, pages 7 83, 27. [2] S. Tationda, F. Junqueira, B. B. Cambazoglu, and V. Plachouras. On Efficient Posting List Intersection ith Multicore Processors. In ACM SIGIR, pages , 29. [2] F. Transier and P. Sanders. Compressed inverted indexes for in-memory search engines. In ALENEX, pages 3 2, 28. [22] D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. PVLDB, 2(): , 29. [23] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes - Compressing and Indexing Documents and Images. Morgan Kaufman Publishers,

Fast Set Intersection in Memory

Fast Set Intersection in Memory Fast Set Intersection in Memory Bolin Ding University of Illinois at Urbana-Champaign 2 N. Goodin Avenue Urbana, IL 68, USA bding3@uiuc.edu Arnd Christian König Microsoft Research One Microsoft Way Redmond,

More information

Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search

Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search Weixiong Rao [1,4] Lei Chen [2] Pan Hui [2,3] Sasu Tarkoma [4] [1] School of Software Engineering [2] Department of Comp. Sci.

More information

Notes on Complexity Theory Last updated: August, 2011. Lecture 1

Notes on Complexity Theory Last updated: August, 2011. Lecture 1 Notes on Complexity Theory Last updated: August, 2011 Jonathan Katz Lecture 1 1 Turing Machines I assume that most students have encountered Turing machines before. (Students who have not may want to look

More information

Efficient Parallel Block-Max WAND Algorithm

Efficient Parallel Block-Max WAND Algorithm Efficient Parallel Block-Max WAND Algorithm Oscar Rojas, Veronica Gil-Costa 2,3, and Mauricio Marin,2 DIINF, University of Santiago, Chile 2 Yahoo! Labs Santiago, Chile 3 CONICET,University of San Luis,

More information

OPTIMIZING WEB SERVER'S DATA TRANSFER WITH HOTLINKS

OPTIMIZING WEB SERVER'S DATA TRANSFER WITH HOTLINKS OPTIMIZING WEB SERVER'S DATA TRANSFER WIT OTLINKS Evangelos Kranakis School of Computer Science, Carleton University Ottaa,ON. K1S 5B6 Canada kranakis@scs.carleton.ca Danny Krizanc Department of Mathematics,

More information

Oracle Scheduling: Controlling Granularity in Implicitly Parallel Languages

Oracle Scheduling: Controlling Granularity in Implicitly Parallel Languages Oracle Scheduling: Controlling Granularity in Implicitly Parallel Languages Umut A. Acar Arthur Charguéraud Mike Rainey Max Planck Institute for Softare Systems {umut,charguer,mrainey}@mpi-ss.org Abstract

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,

More information

LCs for Binary Classification

LCs for Binary Classification Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

More information

Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

Experimental Comparison of Set Intersection Algorithms for Inverted Indexing ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp. 58 64 http://ceur-ws.org/vol-13, Series ISSN 1613-73, c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

More information

Chapter 8: Bags and Sets

Chapter 8: Bags and Sets Chapter 8: Bags and Sets In the stack and the queue abstractions, the order that elements are placed into the container is important, because the order elements are removed is related to the order in which

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Lecture 1: Course overview, circuits, and formulas

Lecture 1: Course overview, circuits, and formulas Lecture 1: Course overview, circuits, and formulas Topics in Complexity Theory and Pseudorandomness (Spring 2013) Rutgers University Swastik Kopparty Scribes: John Kim, Ben Lund 1 Course Information Swastik

More information

Analysis of Compression Algorithms for Program Data

Analysis of Compression Algorithms for Program Data Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory

More information

1 if 1 x 0 1 if 0 x 1

1 if 1 x 0 1 if 0 x 1 Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

CRITERIUM FOR FUNCTION DEFININING OF FINAL TIME SHARING OF THE BASIC CLARK S FLOW PRECEDENCE DIAGRAMMING (PDM) STRUCTURE

CRITERIUM FOR FUNCTION DEFININING OF FINAL TIME SHARING OF THE BASIC CLARK S FLOW PRECEDENCE DIAGRAMMING (PDM) STRUCTURE st Logistics International Conference Belgrade, Serbia 8-30 November 03 CRITERIUM FOR FUNCTION DEFININING OF FINAL TIME SHARING OF THE BASIC CLARK S FLOW PRECEDENCE DIAGRAMMING (PDM STRUCTURE Branko Davidović

More information

Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

More information

History-Independent Cuckoo Hashing

History-Independent Cuckoo Hashing History-Independent Cuckoo Hashing Moni Naor Gil Segev Udi Wieder Abstract Cuckoo hashing is an efficient and practical dynamic dictionary. It provides expected amortized constant update time, worst case

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

A Quantitative Approach to the Performance of Internet Telephony to E-business Sites

A Quantitative Approach to the Performance of Internet Telephony to E-business Sites A Quantitative Approach to the Performance of Internet Telephony to E-business Sites Prathiusha Chinnusamy TransSolutions Fort Worth, TX 76155, USA Natarajan Gautam Harold and Inge Marcus Department of

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

More information

How To Find An Optimal Search Protocol For An Oblivious Cell

How To Find An Optimal Search Protocol For An Oblivious Cell The Conference Call Search Problem in Wireless Networks Leah Epstein 1, and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. lea@math.haifa.ac.il 2 Department of Statistics,

More information

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m. Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

More information

Scalable Bloom Filters

Scalable Bloom Filters Scalable Bloom Filters Paulo Sérgio Almeida Carlos Baquero Nuno Preguiça CCTC/Departamento de Informática Universidade do Minho CITI/Departamento de Informática FCT, Universidade Nova de Lisboa David Hutchison

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

14.1 Rent-or-buy problem

14.1 Rent-or-buy problem CS787: Advanced Algorithms Lecture 14: Online algorithms We now shift focus to a different kind of algorithmic problem where we need to perform some optimization without knowing the input in advance. Algorithms

More information

The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

More information

An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,

More information

Load-Balancing the Distance Computations in Record Linkage

Load-Balancing the Distance Computations in Record Linkage Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Leveraging Multipath Routing and Traffic Grooming for an Efficient Load Balancing in Optical Networks

Leveraging Multipath Routing and Traffic Grooming for an Efficient Load Balancing in Optical Networks Leveraging ultipath Routing and Traffic Grooming for an Efficient Load Balancing in Optical Netorks Juliana de Santi, André C. Drummond* and Nelson L. S. da Fonseca University of Campinas, Brazil Email:

More information

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

A Note on Maximum Independent Sets in Rectangle Intersection Graphs A Note on Maximum Independent Sets in Rectangle Intersection Graphs Timothy M. Chan School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada tmchan@uwaterloo.ca September 12,

More information

Efficient Processing of Joins on Set-valued Attributes

Efficient Processing of Joins on Set-valued Attributes Efficient Processing of Joins on Set-valued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Object-oriented

More information

NOTES ON LINEAR TRANSFORMATIONS

NOTES ON LINEAR TRANSFORMATIONS NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all

More information

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Shortest Inspection-Path. Queries in Simple Polygons

Shortest Inspection-Path. Queries in Simple Polygons Shortest Inspection-Path Queries in Simple Polygons Christian Knauer, Günter Rote B 05-05 April 2005 Shortest Inspection-Path Queries in Simple Polygons Christian Knauer, Günter Rote Institut für Informatik,

More information

Efficient Integration of Data Mining Techniques in Database Management Systems

Efficient Integration of Data Mining Techniques in Database Management Systems Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France

More information

Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search Chapter Objectives Chapter 9 Search Algorithms Data Structures Using C++ 1 Learn the various search algorithms Explore how to implement the sequential and binary search algorithms Discover how the sequential

More information

Symbol Tables. Introduction

Symbol Tables. Introduction Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

More information

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range

Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92.

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92. Name: Email ID: CSE 326, Data Structures Section: Sample Final Exam Instructions: The exam is closed book, closed notes. Unless otherwise stated, N denotes the number of elements in the data structure

More information

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Chapter 3. if 2 a i then location: = i. Page 40

Chapter 3. if 2 a i then location: = i. Page 40 Chapter 3 1. Describe an algorithm that takes a list of n integers a 1,a 2,,a n and finds the number of integers each greater than five in the list. Ans: procedure greaterthanfive(a 1,,a n : integers)

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

More information

The Advantages and Disadvantages of Network Computing Nodes

The Advantages and Disadvantages of Network Computing Nodes Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node

More information

Notes from Week 1: Algorithms for sequential prediction

Notes from Week 1: Algorithms for sequential prediction CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

More information

Load Balancing. Load Balancing 1 / 24

Load Balancing. Load Balancing 1 / 24 Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

More information

Lecture 2: Universality

Lecture 2: Universality CS 710: Complexity Theory 1/21/2010 Lecture 2: Universality Instructor: Dieter van Melkebeek Scribe: Tyson Williams In this lecture, we introduce the notion of a universal machine, develop efficient universal

More information

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.

More information

DATABASE DESIGN - 1DL400

DATABASE DESIGN - 1DL400 DATABASE DESIGN - 1DL400 Spring 2015 A course on modern database systems!! http://www.it.uu.se/research/group/udbl/kurser/dbii_vt15/ Kjell Orsborn! Uppsala Database Laboratory! Department of Information

More information

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015 ECON 459 Game Theory Lecture Notes Auctions Luca Anderlini Spring 2015 These notes have been used before. If you can still spot any errors or have any suggestions for improvement, please let me know. 1

More information

Dynamic TCP Acknowledgement: Penalizing Long Delays

Dynamic TCP Acknowledgement: Penalizing Long Delays Dynamic TCP Acknowledgement: Penalizing Long Delays Karousatou Christina Network Algorithms June 8, 2010 Karousatou Christina (Network Algorithms) Dynamic TCP Acknowledgement June 8, 2010 1 / 63 Layout

More information

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

More information

The WebGraph Framework:Compression Techniques

The WebGraph Framework:Compression Techniques The WebGraph Framework: Compression Techniques Paolo Boldi Sebastiano Vigna DSI, Università di Milano, Italy The Web graph Given a set U of URLs, the graph induced by U is the directed graph having U as

More information

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1. MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

More information

REDUCING RISK OF HAND-ARM VIBRATION INJURY FROM HAND-HELD POWER TOOLS INTRODUCTION

REDUCING RISK OF HAND-ARM VIBRATION INJURY FROM HAND-HELD POWER TOOLS INTRODUCTION Health and Safety Executive Information Document HSE 246/31 REDUCING RISK OF HAND-ARM VIBRATION INJURY FROM HAND-HELD POWER TOOLS INTRODUCTION 1 This document contains internal guidance hich has been made

More information

Vectors Math 122 Calculus III D Joyce, Fall 2012

Vectors Math 122 Calculus III D Joyce, Fall 2012 Vectors Math 122 Calculus III D Joyce, Fall 2012 Vectors in the plane R 2. A vector v can be interpreted as an arro in the plane R 2 ith a certain length and a certain direction. The same vector can be

More information

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

More information

The LCA Problem Revisited

The LCA Problem Revisited The LA Problem Revisited Michael A. Bender Martín Farach-olton SUNY Stony Brook Rutgers University May 16, 2000 Abstract We present a very simple algorithm for the Least ommon Ancestor problem. We thus

More information

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D. 1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D. base address 2. The memory address of fifth element of an array can be calculated

More information

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Why? A central concept in Computer Science. Algorithms are ubiquitous. Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

More information

Storage Management for Files of Dynamic Records

Storage Management for Files of Dynamic Records Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. jz@cs.rmit.edu.au Alistair Moffat Department of Computer Science

More information

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b LZ77 The original LZ77 algorithm works as follows: A phrase T j starting at a position i is encoded as a triple of the form distance, length, symbol. A triple d, l, s means that: T j = T [i...i + l] =

More information

Multi-dimensional index structures Part I: motivation

Multi-dimensional index structures Part I: motivation Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for

More information

7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization 7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

More information

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES I GROUPS: BASIC DEFINITIONS AND EXAMPLES Definition 1: An operation on a set G is a function : G G G Definition 2: A group is a set G which is equipped with an operation and a special element e G, called

More information

Protocols for Efficient Inference Communication

Protocols for Efficient Inference Communication Protocols for Efficient Inference Communication Carl Andersen and Prithwish Basu Raytheon BBN Technologies Cambridge, MA canderse@bbncom pbasu@bbncom Basak Guler and Aylin Yener and Ebrahim Molavianjazi

More information

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012 Binary numbers The reason humans represent numbers using decimal (the ten digits from 0,1,... 9) is that we have ten fingers. There is no other reason than that. There is nothing special otherwise about

More information

Approximate Search Engine Optimization for Directory Service

Approximate Search Engine Optimization for Directory Service Approximate Search Engine Optimization for Directory Service Kai-Hsiang Yang and Chi-Chien Pan and Tzao-Lin Lee Department of Computer Science and Information Engineering, National Taiwan University, Taipei,

More information

Longest Common Extensions via Fingerprinting

Longest Common Extensions via Fingerprinting Longest Common Extensions via Fingerprinting Philip Bille, Inge Li Gørtz, and Jesper Kristensen Technical University of Denmark, DTU Informatics, Copenhagen, Denmark Abstract. The longest common extension

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Scheduling Shop Scheduling. Tim Nieberg

Scheduling Shop Scheduling. Tim Nieberg Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations

More information

Oracle8i Spatial: Experiences with Extensible Databases

Oracle8i Spatial: Experiences with Extensible Databases Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction

More information

Elements of probability theory

Elements of probability theory 2 Elements of probability theory Probability theory provides mathematical models for random phenomena, that is, phenomena which under repeated observations yield di erent outcomes that cannot be predicted

More information

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

CSE 135: Introduction to Theory of Computation Decidability and Recognizability CSE 135: Introduction to Theory of Computation Decidability and Recognizability Sungjin Im University of California, Merced 04-28, 30-2014 High-Level Descriptions of Computation Instead of giving a Turing

More information

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation PLDI 03 A Static Analyzer for Large Safety-Critical Software B. Blanchet, P. Cousot, R. Cousot, J. Feret L. Mauborgne, A. Miné, D. Monniaux,. Rival CNRS École normale supérieure École polytechnique Paris

More information

Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions

Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions Hiroshi Inoue, Moriyoshi Ohara, and Kenjiro Taura IBM Research Tokyo, University of Tokyo {inouehrs, ohara}@jp.ibm.com,

More information

Record Storage and Primary File Organization

Record Storage and Primary File Organization Record Storage and Primary File Organization 1 C H A P T E R 4 Contents Introduction Secondary Storage Devices Buffering of Blocks Placing File Records on Disk Operations on Files Files of Unordered Records

More information

A Catalogue of the Steiner Triple Systems of Order 19

A Catalogue of the Steiner Triple Systems of Order 19 A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Topic 4: Introduction to Labour Market, Aggregate Supply and AD-AS model

Topic 4: Introduction to Labour Market, Aggregate Supply and AD-AS model Topic 4: Introduction to Labour Market, Aggregate Supply and AD-AS model 1. In order to model the labour market at a microeconomic level, e simplify greatly by assuming that all jobs are the same in terms

More information

Electronic Document Management Using Inverted Files System

Electronic Document Management Using Inverted Files System EPJ Web of Conferences 68, 0 00 04 (2014) DOI: 10.1051/ epjconf/ 20146800004 C Owned by the authors, published by EDP Sciences, 2014 Electronic Document Management Using Inverted Files System Derwin Suhartono,

More information

PRIME FACTORS OF CONSECUTIVE INTEGERS

PRIME FACTORS OF CONSECUTIVE INTEGERS PRIME FACTORS OF CONSECUTIVE INTEGERS MARK BAUER AND MICHAEL A. BENNETT Abstract. This note contains a new algorithm for computing a function f(k) introduced by Erdős to measure the minimal gap size in

More information

Victor Shoup Avi Rubin. fshoup,rubing@bellcore.com. Abstract

Victor Shoup Avi Rubin. fshoup,rubing@bellcore.com. Abstract Session Key Distribution Using Smart Cards Victor Shoup Avi Rubin Bellcore, 445 South St., Morristown, NJ 07960 fshoup,rubing@bellcore.com Abstract In this paper, we investigate a method by which smart

More information