Load Balancing and Skew Resilience for Parallel Joins

Size: px
Start display at page:

Download "Load Balancing and Skew Resilience for Parallel Joins"

Transcription

1 Load Balancing and Skew Resilience for Parallel Joins Aleksandar Vitorovic, Mohammed Elseidy, and Christoph Koch École Polytechnique Fédérale de Lausanne ABSTRACT We address the problem of accurate and efficient load balancing for parallel join processing. We show that the distribution of input data received and the output data produced by worker machines are both important to performance. As a result, previous work, which ignores or makes unrealistic assumptions about the properties of the output, stands ineffective for efficient parallel join processing. To resolve this problem, we propose a load-balancing algorithm that considers the properties of both input and output data through sampling of the original join matrix. To do this efficiently, we propose a novel category of histograms, which we call equi-weight histograms. To build them, we take advantage of state-of-the-art computational geometry algorithms for rectangle tiling. To our knowledge, we are the first to employ tiling algorithms for join load-balancing. In addition, we propose a novel tiling algorithm which exploits the output distribution properties for common joins such as equi-, band- and inequality-joins, that has drastically lower time and space complexity than existing algorithms. We experimentally demonstrate that our scheme outperforms stateof-the-art techniques by up to a factor of.. INTRODUCTION Load balancing is crucial for efficient and scalable parallel processing of large amounts of data, as the total execution time depends on the slowest machine. In this paper, we develop algorithms and techniques for efficient and accurate load balancing for parallel joins. Data skew occurs frequently in industrial applications [7, 8]. Load balancing becomes challenging in the presence of skew []. First, Redistribution skew (RS) represents skewed values in the join keys. For hash joins, RS leads to uneven input data partitions and thus affects performance significantly. The -Bucket scheme [2] addresses RS and achieves up to speedup compared to a hash join. Secondly, Join product skew (JPS) [] represents the variance in join selectivity. A partitioning scheme which does not address JPS causes a few machines to produce a huge portion of the output. These machines become bottleneck, severely hindering performance. Depending on the input and output sizes, JPS may impede performance more than RS. Our evaluation shows that our scheme, which addresses both RS and JPS, achieves up to speedup compared to state-of-the-art schemes. Previous work. We classify previous work as follows. JPS-avoidance techniques balance the output-related work among the machines, regardless of the join selectivity. However, they replicate the input tuples, which causes high network and memory consumption, high input-related work per machine, and thus, high execution time. Examples include the -Bucket [2] and PRPD [7] schemes. The -Bucket scheme [2] supports both equi- and non-equi joins, but it replicates each input tuple. The PRPD [7] scheme supports only equi-joins, but it replicates tuples selectively, depending on the join key frequencies. Assumption-based techniques make assumptions about the output distribution. The M-Bucket scheme [2] supports both equi- and non-equi joins, but it does not estimate the output distribution. Thus, it cannot address JPS. The V - optimal scheme [2] supports only equi-joins and it estimates the output distribution from the input relations histograms. It is susceptible to JPS, except for a special case of uniform join key frequency distribution within a bucket. Such assumptions lead to huge estimation errors [4]. Hence, these schemes cause high output-related work. In general, previous work does not capture the output distribution, as this requires the output sample. Building the sample is hard, as a join between uniform random samples from the input relations is not a uniform random sample of the join output []. Our scheme. We are the first to provide a scheme which is both input- and output-optimal, as Table shows. (i) In contrast to most previous work [4, 2, 7], which is limited to equi-joins, we support a broad class of monotonic joins that includes equi-, band-, inequality-joins. (ii) In contrast to the JPS-avoidance techniques, our scheme achieves load balancing on minimal work per machine, which includes both input- and output-related work. This results in better execution times. (iii) Assumption-based techniques result in high execution times when the assumptions do not hold. In contrast, we provide for accurate load balancing without any prior assumptions or knowledge about the data. Our solution is twofold. First, we propose an efficient parallel scheme for capturing the output distribution. We represent the input and output distribution as a matrix, where each dimension of this matrix corresponds to the join keys from an input relation. Second, based on these distributions, we optimally assign portions of the matrix (called regions) to machines. To do so, we introduce a novel family of histograms which we call equi-weight histograms, and a novel histogram algorithm to build them. An equi-weight histogram is a partitioning of the matrix into regions where regions have almost the same weight (the region weight corresponds to the machine s work). Thus, a partitioning scheme based on the equi-weight histogram provides for accurate load balancing. Our histogram algorithm builds on computational geometry (CG) algorithms for rectangle tiling. We are the first to employ CG for join load balancing. Using existing CG algorithms requires O(n log n) time to produce an accurate partitioning. This is impractical, as it is more costly than

2 Partitioning Scheme Input-Optimal Output-Optimal -Bucket PRPD [7]* M-Bucket V-optimal* EWH(ours) * Limited to equi-joins only. Table : Comparison with most important related work. executing the parallel join itself. In contrast, our algorithm runs in O(n) time, while providing for accurate load balancing, close to that of the baseline CG. We achieve efficiency and accuracy as follows. First, we devise a novel CG algorithm that employs the domain-specific knowledge about monotonic joins (the properties of the join output distribution). This algorithm drastically reduces the time and space complexity compared to the baseline CG, while providing the same accuracy. Second, we devise a -stage histogram algorithm (sampling, coarsening and regionalization), where the output of each stage is the input to the next one in the chain. Each stage reduces the input size of the next one, while providing guarantees for its output. As later stages have more coarsegrained input, we employ more precise algorithms for them to preserve accuracy. As more precise (and expensive) algorithms work on smaller inputs, we preserve efficiency. Third, setting the right output size for each stage is a challenging task. Namely, the size must be small enough to keep the algorithm running time low. On the other hand, the size must be big enough, as insufficient output resolution leads to inaccurate load balancing (one machine is assigned much more work than the others). The main contributions of this paper are:. To provide for efficient and accurate load balancing, we devise a multi-stage histogram algorithm which contains a novel, join-specialized computational geometry algorithm. 2. Our scheme achieves minimal work per machine, without imposing any assumptions about the data distribution.. We experimentally validate our scheme. Compared to state-of-the-art, our scheme achieves up to speedup and is up to more efficient in terms of resource consumption. The rest of this paper is organized as follows. First, we introduce the background and define the problem statement; presents the algorithm for building equi-weight histograms; 4 shows how to incorporate the histograms in a join operator; discusses related work and, finally, 6 evaluates the performance and validates the effectiveness of equi-weight histograms in achieving load balancing. 2. BACKGROUND & PRELIMINARIES Next, we introduce definitions used in this paper. Then, we discuss the schemes from Table which go beyond equijoins, and highlight the benefits of our scheme on a concrete example. Finally, we define the problem statement. Important symbols used in this paper are summarized in Table Definitions Join Model. We model a join between relations R and R 2 as a join matrix M. For row i and column j, cell M(i, j) represents a potential output tuple. M(i, j) is if and only if the tuples r i and s j satisfy the join predicate. Figure a shows a matrix for a band-join with a join condition R.A R 2.A. We focus on the joins which are common in practice, and for which state-of-the art techniques perform Symbol Description Value R, R 2 Input relations J The number of machines n Max. input relation size m Join output size ρ oi Output/Input ratio m/n w(r) Weight of region r M Original join matrix M S Sample matrix of size n s n s n s = 2nJ s i Input sample size Θ(n s log n) s o Output sample size Θ(n s) M C Coarsened matrix of size n c n c n c = Θ(J) M H Equi-weight histogram Table 2: Summary of the notation used in the paper. poorly, that is, low-selectivity joins. These joins have sparse join matrices, where only a small portion of the Cartesian space produces output tuples. Regions. We execute a join using J machines in a sharednothing architecture. We refer to a set of cells (that is, the corresponding input tuples) assigned to a single machine for local processing as a region. We adhere to rectangular regions, as opposed to rectilinear or non-contiguous regions, to incur minimal storage and communication costs [7]. Input and Output metrics. A region s input is its semiperimeter. Processing an input tuple consists of receiving the tuple (network traffic and demarshalling costs) and of the join computation i. The output is the number of output tuples (frequency) of a region. The processing cost of an output tuple mainly comes from a post-processing stage (writing the output to disk or transferring it over the network to the next operator in the query plan). For example, region r in Figure b has input = and output = 0. Load balancing is defined as minimizing the maximum work per machine. As each machine is assigned a region, we represent the machine s work as weight function of input and output costs: w(r) = c i(r) + c o(r). As these costs depend on the local join implementation and hardware/software architecture, c i(r) and c o(r) naturally mimic the actual cost of processing input(r) and output(r) tuples, respectively. Thus, the goal is to minimize the maximum w(r). Next, we discuss whether different schemes achieve this goal. 2.2 Content-Insensitive Partitioning Scheme The content-insensitive partitioning scheme, CI (called -Bucket in [2, 7]), illustrated in Figure b, assigns all cells to machines, regardless of the join condition. Thus, regions cover the entire join matrix. An incoming tuple from R (R 2) randomly picks a row (column) in the join matrix, and is assigned to all the regions which intersect with the row(column). For example, the tuple with join key 7 from R 2 randomly picks column 6, which intersects with regions r and r. Thus, the tuple is assigned to these regions. The scheme achieves almost perfect load balancing for output by ensuring that regions have almost the same area. In particular, due to random tuple distribution, almost equalarea regions have almost equal output. However, CI incurs prohibitively high input costs for lowselectivity joins. Namely, as this scheme assigns all the cells (regardless of whether they produce an output tuple) to machines, CI suffers from excessive input tuple replication. We compare partitioning schemes in Figures b, c and d using the weight function w(r) = input(r) + output(r). Due to excessive tuple replication, CI has the highest maximum w(r) (w(r ) = 2 compared to w(r ) = 28 and w(r ) = 20 of the other schemes). In fact, CI works well only if the output i The computation can partially belong to output (see 6.). 2

3 R2.A 27 R.A R2.A27 R.A R2.A R.A pxp=8x R2.A R.A (a) Matrix for a band-join r r r r r2 20 r 22 r2 20 r r (b) CI(-Bucket): I =; O =0 (c) CSI (M-Bucket): I =4; O =4 (d) CSIO : I =0; O =0 Figure : Different partitioning schemes (of machines) on a band-join with a join condition R.A R2.A. Shaded cells represent output tuples. (b)-(d) Ir is input and Or is output metric of a region r with maximum weight wx..j = Ix + Ox. 2.4 costs are much bigger than the input costs, when input tuple replication has small effect on the work per machine. 2. Content-Sensitive Partitioning Scheme A content-sensitive scheme addresses the excessive tuple replication problem. It assigns an input tuple to a machine(s) according to its content (join key). CSI (Figure c), called M-Bucket in [2], is a contentsensitive scheme that uses the input statistics. To simplify notation, we denote both relation sizes as n ii. CSI builds approximate equi-depth histogramsiii with p buckets over join keys of each input relation (p < n), and creates a grid of size p p over the join matrix. In Figure c, p = n/2 = 8, and each grid cell contains h = (n/p)2 = 4 matrix cells. We denote a grid cell which may produce an output tuple as a candidate cell (marked with diagonally engraved lines). To efficiently check if a grid cell is a candidate (in O() time), CSI requires a join condition that permits candidacychecking by examining only join keys on the grid cell boundaries. This holds for monotonic joins, including equi-, bandand inequality-joins, as the boundary join keys are sorted. For example, grid cell (0, ) in Figure c is non-candidate, as the distance between the boundary keys from R2 and R ( = 2) is higher than the width of the band-join (). CSI optimizes the input costs, as it assigns only candidate cells to machines, safely disregarding large contiguous portions in the join matrix that do not produce any output. However, as regions are rectangular, CSI assigns some non-candidates to machines. For example, although grid cell (0, ) in Figure c is a non-candidate, it is assigned to r. In Figure, the maximum input(r) of CSI is only I = 4, compared to I = 2 of CI. The gap between the two schemes deepens with increasing the number of machines J, as the number of non-candidates grows. However, CSI is susceptible to JPS, as it ignores the actual number of output tuples (output) and assigns a constant to each candidate cell. In practice, the output of a grid cell varies from 0 to the grid cell area h (h = 4 in our example). In Figure c, regions r and r2 have the same number of candidate cells (4), but vastly different number of output tuples (4 versus, respectively). This is why the maximum w(r) in CSI (w(r ) = 28) only slightly improves that of CI (w(r ) = 2). In fact, CSI works well only if the input costs are much bigger than the output costs, when JPS does not affect the work much. ii iii Equi-Weight Histogram Scheme We propose a novel, equi-weight histogram scheme, CSIO (Figure d), which achieves the best of both worlds: it avoids both JPS and excessive tuple replication. CSIO is a content-sensitive scheme that accurately estimates the number of output tuples per candidate cell via sampling (see 4.)iv. This is why the maximum w(r) in CSIO (w(r ) = 20) is much smaller than that of CSI (w(r ) = 28). In practice, the gap between the two schemes is much deeper. Namely, to have acceptable time for building the CSI scheme, it must hold that p n [2]. This increases the candidate grid cell area h, making CSI much more susceptible to JPS. In contrast to CI and CSI which work well only if output or input costs dominate, our CSIO is close-to-optimum for a wide range of output/input costs. We build such a scheme using a novel family of equi-weight histograms. The histogram contains buckets of almost equal weight, where each bucket corresponds to a region. Figure 2a depicts weight histograms for different schemes from Figure. CSIO is based on equi-weight histograms and is the only scheme that minimizes the maximum region weight (machine s work), providing for accurate load balancing. Next, we formalize the histogram construction problem. Problem definition. Given a sparse matrix M[..n,..n ] with cell values {0, }, partition it to J n non-overlapping rectangular regions rj R, such that each -cell is covered by exactly one region, and each 0-cell is covered by at most one region. The goal is to minimize the maximum weight of a region, that is min{maxrj R {w(rj )}}. M is the original join matrix where -cells represent output tuples and 0-cells depict empty entries, w(r) represents the work of a machine assigned to region r, and J is the number of machines.. HISTOGRAM ALGORITHM In this section, we show how our efficient histogram algorithm achieves accurate load balancing. We first provide a high-level overview of the different stages of our algorithm, and then give more details in subsequent sections. Previous work. The histogram construction problem is NP-hard. The best known approximate algorithm is BSP [], a tiling algorithm which runs in O(n ) time and has an approximation ratio 2. To create the histogram (that is, the regions), we need to perform a binary search over the BSP (see.). We denote Our analysis also holds when the sizes differ. For the sake of this example, we assume exact histogram. iv For the sake of this example, we assume exact statistics.

4 Unit Weightr r 2 r 0 2 Equi-Weight Histogram 20 r m BSP BSP BSP MonotonicBSP over M over M S over M C over M C O(n log n) O((nJ) 2. log n) O(n / log n) O(n) Table : The time complexity improvements. 0 CI CS I CS IO (a) Weight Histograms Output Costs Input Costs 2 (b) Non-monotonic Matrix r r 2 r m2 (c) Min. candidate rectangles Figure 2: a) Weight Histograms. b) Non-monotonicity in cells and 2 due to candidates marked with black. c) r min (r min2 ) is a minimal candidate rectangle for r (r 2). the entire process as regionalization, and it takes O(n log n) time. This is impractical, as it is more costly than the join. Our solution. We propose a histogram algorithm that takes O(n) time on a single machine, while providing for load balancing, close to the one of the BSP []. Our idea is twofold. First, we reduce the input matrix size of the regionalization (originally, regionalization takes the original matrix M as the input). Second, we drastically improve the regionalization running time by using a novel tiling algorithm which we call MonotonicBSP. These ideas allow us to be the first to use tiling algorithms for join load balancing.. Reducing the regionalization input. We introduce sampling stage, which creates sample matrix M S of size n s n s. M S has much smaller size than M (n s n). To provide for load balancing, n s needs to be (at least) 2nJ (see..2). If the regionalization takes M S as the input, it runs in O(n s log n) = O((nJ) 2. log n) time. This computation cost is still too high. To that end, we introduce coarsening stage, which further reduces the regionalization input. The coarsening takes M S as the input and creates M C of size n c n c (n c < n s). This stage reduces the regionalization input by using the distribution of M S cell weights (that is, we represent multiple small M S cells as one M C cell). To provide for load balancing, n c = 2J (.2). If the regionalization takes M C as the input, it runs in O(J log n) time. As in practice J = O( n/ log 2 n) v, the regionalization takes O(n / log n) time. This is still expensive compared to the join costs. 2. MonotonicBSP: a novel tiling algorithm. In contrast to BSP which takes O(J log n) time, our MonotonicBSP runs in only O(J log 2 n) time. To do so, MonotonicBSP exploits the output properties of monotonic joins (see.). As in practice J = O( n/ log 2 n) vi, the regionalization based on MonotonicBSP, along with the sampling and coarsening, takes only O(n) time (see.-.). Table summarizes all the complexity improvements. Putting everything together. Figure illustrates all the histogram algorithm stages (the sampling, coarsening and regionalization) for w(r) = input(r) + output(r). The sampling stage builds M S of size n s n s (n s = 6 in Figure a) using small input and output samples from M. M S preserves the weight distribution of M. That is, with high probability, regions weights in M S are very close to the v Due to parallelization overhead, adding machines after a certain point provides no additional performance benefits. Our formula for J captures this observation and states that, for example, if n is hundreds of millions, it is then sufficient to use hundreds of machines. vi See footnote v. corresponding weights in M. The coarsening stage creates a non-uniform grid n c n c over M S (n c = 8 in Figure b), such that each grid cell becomes an M C cell. The frequency (output) of an M C cell is the sum of the corresponding M S cell frequencies, e.g. output(m C(, 2)) = output(m S(, )) + output(m S(, 4)) =. The weight is w(m C(, 2)) = 6 + =. The regionalization coalesces multiple M C cells into a region (see Figure c). This stage uses the hierarchical partitioning (recursively dividing rectangles into 2 sub-rectangles) over its input (M C). The hierarchical partitioning has a higher degree of freedom than the grid partitioning (of the coarsening). For example, the grid partitioning cannot produce the hierarchical one from Figure c, as r and r 2 partially overlap over the y-axis. The main ideas in our histogram algorithm that allow for efficient and accurate load balancing are: ) We avoid imposing any assumptions about distribution within a cell, as it leads to incorrect weights and inaccurate load balancing. Thus, we create M C cells (regions) on the granularity of an M S (M C) cell. 2) Careful choice of the matrix sizes n s and n c. ) Using more precise algorithms when it matters more for accuracy of load balancing, that is, when the cell weights in the input matrix of the stage are higher. In particular, as we move forward in the chain, the matrix size drops and the maximum cell weight grows. For example, in Figure, the matrix sizes are n s = 6 and n c = 8, and the maximum cell weights are w(m S(, 0)) = = 4 and w(m C(6, 6)) = = 04. We account for this by using more precise algorithms as we move forward in the chain. In particular, the regionalization is more precise than the coarsening, as it explores all hierarchical partitionings. More precise algorithms are also more expensive per input matrix cell (see.-.). However, as more expensive algorithms work on smaller input matrices (recall n s = 6, n c = 8), the histogram algorithm is efficient. 4) Devising MonotonicBSP, a novel tiling algorithm. ) All the stages use a weight function which accurately estimates the processing costs, and each stage provides guarantees for minimizing the maximum cell weight. We prove an upper bound on the M S cell weight (Lemma.). Coarsening (regionalization) guarantees to produce partitionings within a factor of 2 from the optimum on grid (arbitrary vii ) partitioning on the given M S (M C) matrix. Next, we discuss the details of each stage.. Sampling This stage efficiently builds the sample matrix M S, which provides for accurate load balancing... Accurate load balancing The histogram algorithm requires precise region weights information to achieve accurate load balancing. Thus, M S preserves the region weights of the original matrix M. A region r s from M S corresponds to a region r from the original matrix M if and only if they share all the region boundary keys. M S ensures that for any two corresponding regions r s and r, it holds that w(r s) w(r) with very high probability. Previous work on sample data structures mainly concerns multi-attribute single-relation histograms, which are used vii It allows any partitioning into rectangles. 4

5 n / n s =6 n / n s =6 Unit Weight Input/ Output Random Samples Coarsening Regionalization 0r r r r * 6=2 0 Equi-Weight Histogram Output Costs Input Costs a) Sample Matrix n s X n s = 6 x 6 b) Rounded Matrix n c x n c = 8 x 8 7*6=2 c) Equi-Weight Histogram r r 2 r r 4 d) Equi-Weight Histogram Figure : Histogram algorithm stages. The weight function is w(r) = c i(r) + c o(r) = input(r) + output(r). For instance, w(r 4) = = = 2. for answering range queries (e.g. [26]). As the algorithms for building these data structures consider only frequency (output), they cannot preserve the w(r s) w(r) property. Building M S. In contrast, we build M S of size n s n s (in Figure a, n s = 6), which keeps the w(r s) w(r) property by preserving both the input and output distribution, as we discuss next. a) To preserve the input distribution, we build an approximate equi-depth histogram [0] with n s buckets on each input relation. The histogram boundaries form a grid of size n s n s over the original matrix. Each such grid cell corresponds to an M S cell. Region s input is a product of its semi-perimeter in M S and the expected bucket size n/n s. For example, in Figure a, the region defined by M S(0.., 0) has input of n/n s = 48. b) To preserve the output distribution, we take a uniform random sample of the join output. To do so efficiently, we propose a parallel sampling scheme ( 4.). Once the sample is in place, we increment the corresponding M S cell for each sample output tuple. Region s output is a product of the ratio of output sample tuples within the region and the total output size m (we show how to estimate m in 4.). For example, in Figure a, region M S(0.., 0) has output of =...2 Efficiency and Accuracy Considerations Setting the M S size n s is crucial for efficiency and accuracy. To keep the running time of the coarsening stage low, n s must be small enough. On the other hand, decreasing n s may affect accuracy of load balancing. In particular, if we decrease n s, the M S cells correspond to bigger portions in the original matrix and thus, the M S cell weights grow. For example, the maximum cell weight in Figure a is σ = = 4, which corresponds to M S(, ). Assume that M S differs from the M S in Figure a only by replacing n s = 6 by n s = 4. Then, the maximum cell weight is σ = = 2, which corresponds to M S(0.., 0..). As we avoid imposing assumptions within an M S cell, a region contains at least one M S cell. Thus, the maximum region weight in the partitioning scheme with n s = 4 is at least σ = 2. Such a scheme is suboptimal compared to the one from Figure c,d, which has the maximum region weight w(r ) = = 24. Thus, small n s leads to weighty regions and affects accurate load balancing. It is challenging to set n s, as in the sampling stage we do not know the maximum w(r) of the partitioning scheme. To that end, we find a lower bound on the maximum w(r) of the optimum partitioning scheme, which we call w OP T. We compute w OP T by dividing the lower bound on the total join work (w(m), where input(m) = 2n and output(m) = m) equally among the machines. In fact, w(m) is a lower bound as it assumes no input tuple replication. To ensure load balancing given that the coarsening and regionalization use approximate algorithms, we require that σ 0.w OP T. This holds independently from the join condition and the join key distribution if n s = 2nJ, as the following lemma shows. The proofs are in our technical report [4]. Lemma.. n s = 2nJ is the minimum M S size such that the maximum cell weight σ in the M S is at most half of the maximum region weight of the M H optimum partitioning. This holds independently from the join condition and the join key distribution, given that m n viii. Next, we briefly discuss the sizes of the input and output sample, which are required for building M S. Due to space constraints, we omit the details. The detailed analysis is in [4]. The input sample size is s i = Θ(n s log n). We obtain the output sample size s o = Θ(n s) using Kolmogorov s statistics []. Next, we prove that, given n s = 2nJ, the sampling stage has low running time. Lemma.2. The time complexity of the sampling stage is O(n s log n s). Given n s = 2nJ, if J = O( n/ log 2 n) ix, the time complexity is O(n/J)..2 Coarsening Coarsening stage creates a matrix of a smaller size (coarsened matrix M C) by imposing a grid of size n c n c over the input matrix M S. Figure b illustrates M C with n c = 8. The goal is to minimize the maximum cell weight in M C. This is an NP-hard tiling problem, but it admits approximate solutions. The best one is coarsening algorithm [28], which has an approximation ratio 2. Deciding on n c. To keep the running time low, while achieving accurate load balancing in the regionalization, we opt for n c = 2J. We discuss how such n c brings accuracy in.4. The following lemma proves low time complexity. Lemma.. The running time of the coarsening algorithm is O((n s + n 2 c log n s) n c log n s). Given n c = 2J, if J = O( n/ log 2 n s) ix, the time complexity becomes O(n). Monotonicity is a property of the join output distribution. The property holds for a large portion of join conditions, including equi-joins, band-joins and inequality-joins. It states viii This typically holds in practice. We relax it in [4]. ix See footnote v.

6 Algorithm BSP. : function BSP(rectangles) 2: for each rectangle r in rectangles do : r min = MinimalCandidateRectangle (r) 4: if w(r min ) δ then : r min.regions = {r min } 6: else 7: for each splitter in r min do 8: {r, r 2 } = r min.split : r.regions = min(r.regions r 2.regions) 0: end function that if cell (i, j) is not a candidate cell, then either all cells (k, l) with k i, l j, or all cells (k, l) with k i, l j are also not candidate cells [2]. In other words, candidate cells are consecutive per row/column. All the join matrices in Figure are monotonic, while the one in Figure 2b is not due to the candidate cells marked black. MonotonicCoarsening. Based on monotonicity, we propose an improvement over the coarsening algorithm. The coarsening algorithm iteratively improves the coarsened matrix. To do so, in each iteration it computes the weights of all M C cells. As non-candidate cells have weight 0, it suffices to compute the weights of the candidate cells. Monotonicity allows us to look only at candidate cells. This improves the algorithm s running time in practice, although the complexity does not change asymptotically.. Regionalization This stage creates the equi-weight histogram M H, which consists of (at most) J rectangular regions over M C cells. Figure c illustrates M H with J = 4. The goal is to minimize the maximum region weight (denoted as δ), while covering with regions all the candidate M C cells x. The best such algorithm is Binary Space Partition (BSP) [, 27]. However, BSP solves a dual problem: given the maximum region weight δ, it minimizes the number of regions. To that end, we perform a binary search over δ until BSP returns a partitioning with J regions (machines). BSP [, 27] is a tiling algorithm based on dynamic programming. BSP (Algorithm ) analyzes each rectangle in M C as follows. If the rectangle weight is below δ, we cover the rectangle with a single region (line ). Otherwise, BSP splits the rectangle by a horizontal or vertical line such that the total number of regions used for the two sub-rectangles is minimized (lines 7-). We obtain a minimal set of regions for a rectangle by using the previously found splitters for each sub-rectangle. We acquire the final regions by extracting them from the rectangle encompassing the entire M C. Extending BSP to join load balancing. As we do not need to assign non-candidate M C cells to machines, we minimize a rectangle so that no candidate cell is omitted (line ). We denote such a rectangle as minimal candidate rectangle. For example, in Figure 2c, rectangle r (r 2) contains no candidate cells on the left and lower (right and upper) boundaries. Thus, before computing the weights, we minimize r (r 2) to its minimal candidate rectangle r min (r min2 ). As the input for the BSP is of size n c n c, and n c = 2J (see.2), BSP runs in O(J ) time. With binary search, it takes O(J log n) time. As in practice J = O( n/ log 2 n) xi, the regionalization based on BSP runs in O(n / log n) time. This is expensive compared to the join costs. x Regions are rectangular: they cover some non-candidates. xi See footnote v. Algorithm 2 MonotonicBSP. : function MonotonicBSP 2: rectangles m = GenerateCandidateRectangles() : Sort rectangles m according to the semi-perimeter 4: BSPCandidates(rectangles m) : end function 6: function GenerateCandidateRectangles 7: for x = to p do 8: for y = a cand. cell index in row x do : for x 2 = to p do 0: for y 2 = a cand. cell index in row x 2 do : rectangles m.add(x, y, x 2, y 2 ) 2: return rectangles m : end function 4: function BSPCandidates(rectangles m) : for each rectangle r m in rectangles m do 6: if w(r m) δ then 7: r m.regions = {r m} 8: else : for each splitter in r m do 20: {r, r 2 } = r m.split 2: r m = MinimalCandidateRectangle(r ) 22: r m2 = MinimalCandidateRectangle(r 2 ) 2: r m.regions = min(r m.regions r m2.regions) 24: end function BSP also suffers from high space complexity, which is is proportional to the total number of rectangles in the input matrix (M C). As a rectangle is defined by two corners, and each corner is defined by two points (M C cells), the space complexity is O(n 4 c) = O(J 4 ). MonotonicBSP. We propose MonotonicBSP, a novel tiling algorithm which drastically reduces both time and space complexity of the BSP. To do so, MonotonicBSP exploits the output distribution properties for monotonic joins. In BSP, a large portion of the time/space complexity is due to enumerating all the possible rectangles in M C (O(n 4 c)). However, for monotonic joins, only a small portion of these rectangles are minimal candidates. The main idea of MonotonicBSP is to analyze only minimal candidate rectangles. The main challenge is to enumerate them without even looking at all the rectangles, as this requires O(n 4 c) time. To do so, we use the following lemma. Lemma.4. A rectangle is defined by the upper left and the lower right corner. For monotonic joins, each defining corner of a minimal candidate rectangle is a candidate cell, yielding O(n 2 c) minimal candidate rectangles in total. Thus, by using each pair of candidate cells as the rectangle defining corners, the MonotonicBSP (Algorithm 2) generates all the minimal candidate rectangles (lines 6-). Then, the algorithm sorts the rectangles by their semi-perimeter (line ). Finally, it invokes a version of the BSP which considers only minimal candidate rectangles (lines 4-24). Lemma.. The regionalization stage based on MonotonicBSP runs in O(n c log n c log n) time. Given n c = 2J, if J = O( n/ log 2 n) xii, the stage takes O(n) time. The space complexity of MonotonicBSP is O(n 2 c), as there are n 2 c minimal candidate rectangles (see Lemma.4). MonotonicBSP significantly outperforms the baseline BSP for monotonic joins, both in terms of space and time complexity. Namely, MonotonicBSP requires only O(n 2 c) space and O(n c log n c log n) time. Whereas, the baseline BSP runs in O(n 4 c) space and O(n c log n) time. xii See footnote v. 6

7 .4 Putting it all together We compare our histogram algorithm against the best known algorithm [], that is, the regionalization based on BSP over the original matrix M. The computation cost. By directly applying previous work, computing the histogram requires O(n log n) time. In contrast, our histogram algorithm runs in only O(n) time. Theorem.. The time complexity of the histogram algorithm is O(n). The proof directly follows from Lemmas.2-.. The accuracy of load balancing. As in our algorithm the regionalization creates regions on the M C cell granularity, we next discuss how the coarsening stage affects load balancing. For output-only weight functions, Wang [8] showed that arbitrary partitioning over a grid partitioning is within a factor of 4 from the arbitrary partitioning over the original data. Applied to our case, if n c J and the input matrix of the coarsening stage is M, the coarsening and regionalization produce a partitioning which is at most a factor of 4 from the one produced by regionalization only. Sampling minimally affects load balancing, as with very high probability, M S preserves the weight distribution from M (..). We lessen the factor of 4 by choosing n c = 2J (rather than n c = J) for the M C size. Furthermore, we minimize the effect of the factor to load balancing by ensuring that the maximum cell weight in M S is at most 0. of the maximum region weight in M H (Lemma.). We experimentally validate the accuracy of our scheme (see 6). 4. JOIN OPERATOR Next, we integrate our scheme into a join operator. We show how to collect statistics (input and output samples), which is required by the histogram algorithm. Our join operator works as follows: First, we collect samples of the input and output tuples ( 4.), Then, we build the equi-weight histogram (see ). Finally, we distribute and process the data according to the histogram. Local Join Algorithm. Each machine processes a region using a local join algorithm. Our partitioning scheme is orthogonal to the local join implementation, as long as the implementation is suitable for low-selectivity joins and all machines employ the same implementation. Sampling the Input Tuples. As described in., we need a uniform random sample of size s i from each relation. We build the input sample in one pass in parallel using Bernoulli sampling [8] with a sampling rate of q i = s i/n. 4. Sampling the Output Tuples According to [], we cannot obtain a uniform random sample of the join output by joining uniform random samples from the input relations. Alternatively, performing the entire join and then sampling from the output defeats the purpose of building the equi-weight histogram. There exists Stream-Sample algorithm [], which provides a uniform random output sample without performing the entire join. However, this is a single-machine algorithm. To that end, we devise a parallel version of the Stream-Sample. Next, we discuss its efficiency for join load balancing. Then, we explain the details of the baseline and parallel Stream-Sample. Efficiency. The cost of Parallel Stream-Sample, which mainly comes from scanning the input relations, is small compared to the cost of parallel join. This is due to the following: ) The benefits of using the collected statistics easily surpass the scanning overhead, both in MapReduce [2,, 24] and distributed databases [0]. In both cases, scanning involves repartitioning of the join keys [2, 0]. Our experiments ( 6) also show that scanning pays off, as JPS affects performance much more than scanning. 2) The sample tuples contain only join keys, as we use the samples only for building M S (and not for propagating it further in the query plan). This reduces the network traffic. ) The output sample size is much smaller than the input relation size (s o = Θ(n s) = Θ( nj) n). Stream-Sample. The Stream-Sample [] works only for equi-joins, but we extend it for band- and inequality-joins. First, we introduce the notation. We denote the base relations as R and R 2 and a sample from R as S. Given a tuple t R with a join key t.a, the joinable set of t consists of all the tuples from R 2 which are joinable with t. For equi-joins, the joinable set of t comprises of all the tuples from R 2 with t.a as the join key. For band-joins and inequality-joins, the joinable set of t contains all the tuples from R 2 with a join key within a certain distance (specified by the join condition) from t.a. We denote the joinable set size as d 2(t.A), and the ensemble of them as d 2. Using the keys from d 2 with an equality predicate yields d 2equi. WR (WoR) stands for sampling With (Without) Replacement. The Stream-Sample algorithm works as follows. We take a WR weighted sample S of size s o from R, where the weight of t R is d 2(t.A). Then, for each t s S, we randomly choose t 2 R 2 from the joinable set of t s and produce an output tuple t s t 2. Parallelization. We design a parallel, scalable version of Stream-Sample, which runs efficiently on the same number of machines as the join itself. For the ease of presentation, we describe it in terms of MapReduce [] jobs.. We build d 2equi from R 2 in a single MapReduce job. To reduce the work, we designate R 2 to be smaller of both relations. To partition the work evenly, we assign the R 2 tuples to the machines according to their join keys and the approximate equi-depth histogram on R In this step, we build d 2 and S. We create d 2 as follows: Each reducer obtains a range of sorted join keys from d 2equi along with their multiplicities. As mentioned before, d 2(t.A) is the sum of multiplicities of the join keys from R 2 which are within a certain distance from t.a (according to the join condition). Each time a reducer moves in the sorted key sequence such that the joinable set changes (adding or removing a tuple), a new d 2 key-value pair is created. We build a WR weighted sample S from R, where weights are based on d 2. To do so, we use a parallel one-pass algorithm for WoR weighted sampling [6] which works as follows: It puts each t R into a priority queue of size s o using the priority computed as a function of d 2(t.A). After each reducer produces its reservoir, we merge them into a single reservoir using the same priority function. Finally, we transform S from a WoR to a WR sample using []. We build d 2 and S together in a single MapReduce job. We assign the d 2equi and R tuples to the machines according to their join keys and the approximate equi-depth histogram on R due to the following reasons: First, d 2equi tends to be much smaller than R. Secondly, by doing so, we balance the work for computing S.. Finally, we produce a uniform random output tuple for each tuple t s S. As we use the output tuple only for building M S, it contains only a concatenation of the join keys. This relieves us from choosing uniformly at random an R 2 tuple from the joinable set of t s, which would require processing R 2 again. Instead, we randomly choose a join 7

8 key from the joinable set of t s, with probability directly proportional to the key multiplicity. These multiplicities are available in d 2equi. As S is typically much smaller than d 2equi, we assign S and d 2equi to the machines according to their join keys and the partitioning of d 2equi from step. Thus, we sort S and use a Map-only job for this step. Optimization. If an input relation has a predicate which filters out many tuples, we reduce the scanning overhead for the actual join by materializing the filtered relation in the statistics scan. Synergy. We first build equi-depth histograms on R and R 2. Then, we sample input and output tuples in parallel by sharing mappers (the former requires only one reducer). Parameters. To build the sample matrix, we need to know m (see.). We obtain m from the Parallel Stream-Sample algorithm. In particular, as we iterate over the entire R relation in step 2 of the algorithm, we compute m by summing d 2(t.A) for all tuples t R. 4.2 Discussion and Generalization System architecture. As [2] shows, systems designed for main-memory processing are very popular nowadays (e.g. Shark-Spark [6], Dremel [2]). These systems can fit the data in memory because they use many servers [2] and/or servers have memory capacity in terabytes [0]. For that reason, recent parallel joins are main-memory operators [0, 2]. We follow the same reasoning and implement our operator in a main-memory parallel system. Our operator would certainly work in a disk-based system, but with smaller performance benefits, due to the increased cost for collecting input and output samples. Multi-way joins. Our scheme assumes two-way joins. As our scheme enhances performance especially when the output cost (e.g. transferring tuples between operators) matters a lot, a multi-way join can be efficiently executed using a sequence of our two-way joins. In the future, we plan to extend our approach to support a multi-way join in a single join operator, akin to how [] or [2] extend [2]. Heterogeneous clusters. In heterogeneous clusters, we assign work to machines proportionally to their capacity. To do so, we set the number of regions (J) in the histogram algorithm higher than the number of machines.. RELATED WORK Sample data structures. The most similar data structure to our sample matrix M S is single-relation multi-attribute histogram [26,, 2, 7], which is used for selectivity estimation. These histograms represent the frequency distribution over a multi-dimensional space. In our M S, frequency represents the number of output tuples for the corresponding segments of the input relations. However, multi-attribute histograms cannot provide for load balancing, as they capture the frequency rather than the weight distribution, and cannot guarantee the maximum cell weight nor decide on the M S size n s. Load balancing was extensively studied both in distributed databases and MapReduce. The predominant join type in MapReduce is repartition join [6], which moves each input tuple over the network. In distributed databases, broadcast join duplicates one relation on all the machines. This is efficient only if the duplicated relation is very small [7]. Directed join moves portions of one relation to the corresponding locations of the other relation. It typically requires that one relation is physically partitioned by the join key [6]. This is a limitation when we join a relation with other relations using different join keys [6]. We propose a novel repartition join, as repartition joins are the most widely applicable.. Distributed databases. Track Join [0] represents a dedicated solution for minimizing total network costs rather than for load balancing. Other previous work model skew as a task scheduling problem. In particular, relations are partitioned into small hash buckets [2] or small units of range [4] and are distributed among machines to attempt balancing the load. Techniques based on hash buckets are susceptible to RS. The VP-RR scheme [4], which assigns tasks (ranges) to machines in a round robin fashion, addresses RS of one (build) relation, but ignores JPS and RS of another (probe) relation. The VP-PS scheme [4] and V-optimal scheme [2] try to address both RS and JPS by estimating the join cost per each task. To do so, both schemes [4, 2] estimate the join output size of a task using the input sizes and assuming uniform join key frequency distribution within a task. This can result in huge cost estimation errors (the estimated value diverges from the actual task output), which leads to inaccurate load balancing. This is in accordance to [4], where the authors show that VP-PS, that attempts to address JPS, performs worse than VP-RR due to the cost estimation errors. In comparison, all the previous work: (i) are restricted to equi-joins, whereas our scheme supports non equi-joins as well. (ii) are susceptible to JPS as they make assumptions about the join key frequency distributions. In contrast, our scheme addresses both RS and JPS without any prior assumptions, or knowledge about the data distribution. Hybrid operators perform different types of joins for different join keys. The Partial Redistribution Partial Duplication (PRPD) scheme [7] and F-SkewJoin scheme [8] perform hash repartition join for keys with low frequency in both relations (LL), and broadcast join for keys with low frequency in one relation and high frequency in another (LH). The benefit of such an approach is keeping H keys local, saving network traffic. However, as explained in [8], this approach can still result in JPS if the distribution of H keys is non-uniform. This is often the case, especially if the input relation is the result of another join. To that end, the PRPD scheme repartitions all the keys, including the H keys. The F-SkewJoin [8] improves over [7] by broadcasting tuples to a subset of machines, rather than to all of them. The F-SkewJoin also proposes new outer join operators. The main differences to our work are as follows. First, these schemes support only equi-joins. For equi-joins, we could extend our work by using some ideas from [7, 8]. That is, by using the hash repartition join for LL keys we reduce the join matrix size, which further reduces the cost of building our scheme. Second, these schemes ([7, 8]) consider only discrete points in the relative frequency of a key (HL and LL). However, a key in both relations often has moderate or high frequency in real-world applications, for example, when joining relations on their foreign keys. Due to broadcasting (or multicasting) a large number of tuples, these schemes incur high network cost. In contrast, our scheme addresses JPS without incurring high network cost. We do so by using the join matrix model and by covering the entire spectrum of relative key frequencies. Third, it is hard to chose the best boundary for deciding whether the key is H or L. When doing so, the F-SkewJoin [8] builds a cost model which ensures that the parallel join execution is faster than the serial execution. This is different from minimizing the maximum work per machine, which is our objective. 2. MapReduce Joins. Previous work focuses primarily on equi-join implementations [, 6] by partitioning the in- 8

9 put through hashing. Okcan et al. [2] support other join predicates as well. We discuss [2] in detail in 2.2,2.. In contrast to the -Bucket scheme [2], our scheme achieves load balancing on minimal work per machine. In contrast to the M-Bucket scheme [2], we address JPS.. Adaptive load balancing. Adaptive solutions for skew handling were proposed for hash joins (e.g. [20]) and for general-purpose MapReduce applications (e.g. [2]). In general, adaptive solutions work as follows. When a task is idle, it takes over some of the work from the busiest task. This implies moving the tuples over the network multiple times, which increases network costs. In contrast, we ensure that, after we build the partitioning scheme, each tuple is repartitioned exactly once. 6. EVALUATION This section compares our operator with state-of-the-art operators. After describing the experimental setup, we evaluate the performance of the operators in terms of execution time and resource consumption, i.e., memory requirements and network communication. Then, we asses the scalability of each operator. Finally, we evaluate the accuracy of our partitioning scheme, along with the efficiency of building it. 6. Experimental Setup Environment. We perform our experiments on an Oracle Blade 6000 server with 0 Oracle X6270 M2 blades. Each blade has two Ghz 6-core Intel Xeon X67 CPUs. Out of 20 cores, 64 are available exclusively for our experiments. Each blade runs Ubuntu 2.04 and has 72GB of DDR RAM and a Gbit Ethernet interface. A machine assigned to an operator is a core with an exclusively assigned portion of the blade main memory. Datasets. We experiment on both TPC-H [] and synthetic datasets. We employ the TPC-H generator [], which creates databases with Zipf distributions set through the skew parameter z. We set z to 0.2 to demonstrate how huge the JPS can be even when RS is moderate. The synthetic dataset has two relations, each with two segments. The first and second segment contain 80% and 20% of all the tuples, respectively. In the first segment, we randomly sample the join keys from a domain which is 4 times bigger than the sample size in the segment. Similarly, in the second segment, we randomly sample join keys from a domain which is 6 times smaller. The domains for the same segment among both relations are the same. As a result, we obtain a skewed workload in which a small portion of keys has high multiplicity. Such a workload frequently occurs in practice [7]. Operators. We evaluate three different operator partitioning schemes: (i) CI (-Bucket scheme) [2], (ii) CS I (M- Bucket scheme) [2], and finally, (iii) CS IO, which is our equi-weight histogram scheme. The local join algorithm utilizes indexes to compute the joins, i.e., balanced binary trees for band-joins and hashmaps for equi-joins. Configuration. We run the join queries on J = 2 machines, whereas for the scalability experiments we use 6 to 64 machines. For the joins over the TPC-H data, we set the scale factor to 60 and for the scalability experiments, we set it between 80 and 20 (i.e., 20GBs). We set n c = 2J in the histogram algorithm of CS IO. In the histogram algorithm of CS xiii I, we set the number of buckets p to In the scalability experiments, we scale p proportionally to J. xiii This histogram algorithm does not create equi-weight histograms; it is a heuristic that builds a partitioning scheme. Name Join Predicate input output BICD Orders Orders Band-join(β = 2) 480M 26M B CB R R2 Band-join(β = ) 2M 48M B CB R R2 Band-join(β = 2) 2M 80M B CB R R2 Band-join(β = ) 2M 82M B CB R R2 Band-join(β = 4) 2M 044M B CB R R2 Band-join(β = 8) 2M 72M B CB R R2 Band-join(β = 6) 2M 828M EOCD Orders Orders Equi-join 6.8M 2000M * For TPC-H joins, the database size is 60G and z = 0.2. Table 4: Joins s input and output sizes in millions of tuples. β is the width of the band. Programming model. We use an in-memory MapReducelike system xiv, which is based on Squall xv. The system runs in Java JRE v.7. Mappers shuffle the input tuples according to the partitioning scheme of the operator. Reducers perform the actual join and randomly shuffle the output tuples to the mappers of the next stage (e.g. join, aggregation). Thus, each mapper performs the same amount of work and it suffices to balance the load among the reducers of a job. The job execution time includes the time of sending the tuples over the network to the next stage. Cost model. For our experiments we define the weight function ( 2.) for load balancing among the reducers as xvi : w(r) = c i(r) + c o(r) = w i input(r) + w o output(r) where w i, w o is the average time cost of processing a single input and output tuple, respectively. We determine the values for w i and w o using linear regression on several benchmark runs. The regression method automatically divides all the communication and computation costs into input and output costs. The results of regression in our system suggest the values w i = and w o = 0.2 for band-joins and w i = and w o = 0. for equi-joins. The difference is due to the local index used (balanced binary trees or hashmaps). Joins. We evaluate an equi-join (E OCD) and a band-join (B ICD) on the TPC-H dataset, and a band-join (B CB) with 6 different widths of the band (β) on the synthetic dataset. These joins represent various positions in the cost distribution spectrum, where input cost is c i = w i input and the output cost is c o = w o output. Accordingly, we distinguish input-cost dominated, cost-balanced and output-cost dominated joins. (i) At one end, B ICD is an input-cost dominated join, as c i = 8. c o. (ii) At the other end, E OCD is an output-cost dominated join, as c i = 0.06c o. (iii) In between, B CB is a cost-balanced join, as the input and output cost are comparable (0.2c o c i 2.7c o, depending on β). Table 4 summarizes characteristics of the joins. For the scalability experiments, we run the joins with the input sizes up to 480 million tuples and output sizes up to 8.8 billion tuples. The joins are defined in our technical report [4]. 6.2 Performance Analysis In this section, we evaluate the operators performance in terms of the execution time and resource consumption, i.e., memory requirements and network communication. We show that the execution time mainly depends on the join output/input ratio, which we call ρ oi. Total execution time is shown in Figure 4a as the sum of the time for building the partitioning scheme ( stats time ) and the join execution time ( join time ). CI has only xiv See 4.2 for a discussion about alternative architectures. xv xvi The model can be flexibly adapted to represent any realistic cost function as described in 2.4.

10 Operator Exec. Time (s) Join time CI Stats time CS I Join time CS I Stats time CS IO Join time CS IO Memory Overflow ρ oi =.2 ρ oi =.62 ρ oi =6.04 ρ oi =8.46 ρ oi =0.87 ρ oi =20.4 ρ oi =.88 ρ oi =08.7 B ICD B CB- B CB-2 B CB- B CB-4 B CB-8 B CB-6 E OCD Normalized Total Time 4 CI CS I CS IO Output/Input ratio ρ oi Cluster Memory (GBs) Cons Memory Overflow (a) Total Execution Time. CI CS I CS IO B ICD B CB- E OCD Operator Exec. Time (Sec) CI CS I CS IO Memory Overflow 6M/6 2M/2 84M/64 (b) Normalized Execution Time B CB β. Cluster Memory (GBs) Cons CI CS I CS IO Memory Overflow 6M/6 2M/2 84M/64 Operator Exec. Time (Sec) (c) Cluster Memory Consumption CI CS I CS IO 80G/6 60G/2 20G/64 (d) B CB Scalability Execution Time. Cluster Memory (GBs) Cons CI CS I CS IO 80G/6 60G/2 20G/64 Max. Region Weight (w in Millions) (e) B CB Scalability Memory Cons CI CS I CS IO CS IO -est. B ICD B CB- E OCD (f) E OCD Scalability Execution Time. (g) E OCD Scalability Memory Cons. Figure 4: Performance Evaluation (h) Maximum Region Weight. join time as it has no preprocessing overhead. Figure 4b illustrates the normalized total execution times for B CB. The operators execution times are highly correlated to the output/input ratio: (i) For low values of ρ oi, i.e., on one side of the spectrum, input costs dominate the join execution time. Thus, CI, which replicates each input tuple to 6 machines, performs poorly for B ICD. CS I avoids this problem, but due to lack of output statistics it suffers from JPS. (ii) For high values of ρ oi, i.e., on the other side of the spectrum, output costs dominate the join execution time. Thus, for E OCD, the effect of JPS on the join execution time of CS I escalates, causing CS I to perform poorly. CI achieves almost perfect load balancing for the output costs by randomly assigning the tuples to the machines. However, it still suffers from unnecessary input replication. (iii) As B CB is a cost-balanced join, both existing operators perform poorly. Increasing the width of the band β in B CB leads to the increase in ρ oi, such that the output costs grow relatively to input costs. This improves the performance of CI and degrades the performance of CS I compared to CS IO. Our CS IO outperforms the other operators as it is closeto-optimum on the total work per machine, which includes both input and output costs. CS IO captures the output distribution and avoids high input tuple replication. Thus, in terms of the join execution time, CS IO achieves from.04 (B ICD) to 7.22 (E OCD) speedup compared to CS I, and from. (E OCD) to 4. (B ICD) speedup compared to CI. Similarly, in terms of the total execution time, CS IO achieves up to.6 (E OCD) speedup compared to CS I, and up to.6 (B ICD) speedup compared to CI. In fact, as CI for B ICD runs out of memory, we extrapolate its total execution time using the cost model parameters and the percentages of processed input and output tuples. Choosing among CS I and CI. We cannot choose the better one among existing CS I and CI without knowing the input/output sizes. However, output size estimation using the precomputed statistics is error-prone due to possible predicate correlations [22]. In fact, the estimate can be orders of magnitude away []. This might lead to choosing the worst operator among CI and CS I. As a result, our CS IO achieves from 2.2 (B CB 4) to.6 (E OCD) speedup. More detailed input statistics in CS I. We show that more detailed input statistics (increased number of buckets p) in CS I cannot cure the lack of output statistics nor address JPS. Table shows the join execution and the histogram algorithm time for different p. For both E OCD and B CB, increasing p leads to increased histogram algorithm time, and thus, increased time for building the partitioning scheme. Increasing p also decreases the join execution time. However, even if CS I is given more time for building the partitioning scheme than CS IO, its total execution time is still much longer than that of CS IO. For instance, for E OCD and p = 24000, the histogram algorithm grows to 4 seconds, making the time for building the partitioning scheme.8 0

11 Join Time 2000 Number of buckets p Join execution E OCD Histogram alg B CB Join execution β = Histogram alg Table : Join execution and histogram algorithm time (s) of CS I for different number of buckets p. longer than that of CS IO. Moreover, in that case, the total execution time of CS I is 8. longer than that of CS IO. Resource consumption. In general, CI has better execution times than CS I, but it is very resource-inefficient compared to both CS I and CS IO. Figure 4c illustrates the cluster memory consumption, which also represents the network traffic (input data sent from mappers to reducers). For B ICD and B CB, CI requires around 4 more memory than CS I and CS IO. This is due to the fact that CI suffers from excessive tuple replication (the replication factor is 6). CI requires around 4 rather than 6 more memory as both CS I and CS IO replicate some input tuples (see Figures c, d). CS IO uses slightly more memory than CS I, because CS IO balances on the total work, so it assigns more input for the regions with relatively small output. As CI for B ICD runs out of memory, we extrapolate its memory consumption using the percentage of the processed input tuples. CI in E OCD does not have high memory consumption, as the input size is smaller than for B ICD and B CB. 6. Scalability Next, we evaluate the weak scalability of the operators by scaling the data size and the number of machines evenly. We show that our CS IO, in contrast to CS I and CI, scales well both in the total execution time and resource consumption. B CB total execution time is shown in Figure 4d. We experiment on various input/output/j settings, more specifically, 6M/406M/6, 2M/8M/2 and 84M/.62B/64, where M and B stand for millions and billions of tuples. CI scales worse than CS I and CS IO. This is expected, as the replication factor increases from 4 (J = 6) to 8 (J = 64), which results in doubling the input costs on each machine. Namely, for J = 6 and J = 64, CI has.66 and. longer total execution time than CS IO, respectively. In fact, as CI with J = 64 runs out of memory, we extrapolate its total execution time. B CB resource consumption. Figure 4e shows the cluster memory consumption for B CB. As the replication factor grows from 4 (J = 6) to 8 (J = 64), CI requires increasingly more memory compared to CS I and CS IO. Namely, for J = 6 and J = 64, CI requires. and.2 more memory than CS IO, respectively. As CI with J = 64 runs out of memory, we extrapolate its memory consumption. E OCD total execution time is shown in Figure 4f. The input/output/j settings are 2.2M/62M/6, 6.8M/2B/2 and 62M/8.8B/64. As increasing J from 6 to 64 (4 ) causes the output size to grow 4.46, CS IO and CI achieve good scalability. For CI, this is due to the fact that the output cost becomes much bigger than the input cost. For CS IO, this validates the efficiency of our scheme even for highly output-cost dominated joins (for J = 64, ρ oi = 28.4). On the other hand, CS I scales very poorly as JPS causes that only few machines produce most of the output. Namely, for J = 6, J = 2 and J = 64, CS I has 7.2,.6 and.4 longer total execution time than CS IO, respectively. E OCD resource consumption. Figure 4g shows the cluster memory consumption for E OCD. The gap between CI and the other two operators is smaller than in B CB due to the following. As the number of machines J grows from J = 6 to J = 64 (4 ), the input size grows only 2.2. Thus, for J = 64, CI takes 2.82 more memory than CS IO. Scalability summary. Overall, in terms of the total execution time and resource consumption, only CS IO scales well for both B CB and E OCD. 6.4 Accuracy and Efficiency of CS IO This section evaluates the accuracy of our partitioning scheme, as well as the efficiency of building it. Building the partitioning scheme consists of building the input and output samples and running our -stage histogram algorithm. Accuracy of the cost model. The cost model is accurate if the region weight corresponds to the machine s work, that is, if the maximum region weight corresponds to the join execution time. Figure 4h validates the model accuracy as for each join among B ICD, B CB and E OCD, the maximum region weights of different schemes are proportional to the corresponding join execution times. The proportionality holds only within the same join, as a weight unit corresponds to different amount of work in different joins (due to different indexes and/or join conditions). We obtain the weights after the join execution by computing the weight function on the number of input and output tuples per machine. Accuracy of our CS IO. Figure 4h shows that the estimated maximum region weight in our histogram algorithm, marked as CS IO-est., is at most 6% off the value computed after the execution. Thus, our scheme is accurate. The time for building the partitioning scheme is illustrated in Figure 4a as stats time. It includes the time to collect statistics (the input statistics for CS I and input and output statistics for CS IO), and the running time of the histogram algorithm. We evaluate the efficiency of building our partitioning scheme, and compare it to that of CS I. For CS IO and CS I, as ρ oi grows, the ratio of time to build the partitioning scheme and the total execution time drops. This is expected, as building the partitioning scheme requires processing only input relations and does not depend on the output size. Namely, Figure 4a shows that building CS IO takes from % (B ICD) to only.8% (E OCD) of its total execution time. Similarly, building CS I takes from 24.% (B ICD) to only.6% (E OCD) of the corresponding CS IO total execution time. Building the partitioning scheme for CS IO is at most 6.7% (B ICD) more expensive than that of CS I in terms of the CS IO total execution time. This is due to the following reasons: (i) Collecting the input statistics is much cheaper in CS IO than in CS I. CS I requires two MapReduce stages, while CS IO requires only one MapReduce stage. This is due to the fact that CS I needs to use many more buckets than CS IO to account for the error caused by the lack of output statistics. In the worst case, the required number of buckets for CS I for a join with high JPS is Θ(n). In contrast, the number of buckets for CS IO does not depend on skew at all, and it slowly grows with the increase in n (O( n)). (ii) The time to collect output statistics for CS IO is not much longer than the second pass of collecting input statistics for CS I. In addition to a scan over input relations, which is required by both operators for building the partitioning scheme, CS IO performs a scan over d 2equi (step 2 in 4.) and produces the output sample (step ). The former is cheap as d 2equi tends to be much smaller than its originating relation, which is the smaller one. The latter is very cheap as the output

12 sample size is very small compared to n (s o = Θ( nj)). Worst-case scenario. As we already saw, building CS I performs the best compared to building CS IO for very small ρ oi (B ICD). In that case, CS IO achieves minimal speedups in the join execution time (.04 ) compared to CS I. This is because the output cost is 8. smaller than the input cost, so JPS minimally affects the performance. In fact, joins with very small ρ oi behave almost as if there was no JPS at all. Thus, in the worst case (B ICD), the total execution time of CS IO is.04 higher than that of CS I. This is negligible compared to the speedups that our scheme achieves for the other joins. Accuracy/Efficiency summary. Overall, our equi-weight histogram scheme is practical, as it provides for both accurate and efficient load balancing. 6. Summary Joins are defined in a spectrum of cost distribution. At each end, either input or output costs dominate the join cost. Previous work, i.e., CS I and CI, perform well only at the extreme ends of the output/input spectrum. This is because CI suffers from excessive input tuple replication (which worsens with the increase in the number of joiners), while CS I cannot capture the output cost distribution. Due to errors in the output size estimation, especially for nonequi joins, choosing the wrong operator among CS I and CI becomes plausible, causing severe performance degradations. In contrast, our CS IO is close-to-optimum on the total work per machine, which includes both input and output costs. Thus, it performs very well over a wide spectrum of output/input ratio, and it scales with increasing data sizes. CS IO achieves up to.2 improvement in resource consumption and up to. speedup compared to CI. Moreover, CS IO achieves up to.6 speedup compared to CS I. As these speedups refer to the total execution time, they also validate the efficiency of building our scheme. 7. REFERENCES [] The TPC-H benchmark. [2] A. Aboulnaga, and S. Chaudhuri. Self-tuning histograms: building histograms without looking at data. In SIGMOD,. [] F. Afrati and J. Ullman. Optimizing joins in a MapReduce environment. In EDBT, 200. [4] P. Berman, B. DasGupta, S. Muthukrishnan, and S. Ramaswami. Improved Approximation Algorithms for Rectangle Tiling and Packing. In SODA, 200. [] P. Berman, B. DasGupta, and S. Muthukrishnan. Slice and Dice: A Simple, Improved Approximate Tiling Recipe. In SODA, [6] S. Blanas, J. Patel, V. Ercegovac, J. Rao, E. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In SIGMOD, 200. [7] N. Bruno, S. Chaudhuri, and L. Gravano. STHoles: a multidimensional workload-aware histogram. In SIGMOD, 200. [8] N. Bruno, Y. Kwon, and M. Wu. Advanced Join Strategies for Large-Scale Distributed Computation. In VLDB, 204. [] S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In SIGMOD,. [0] S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD, 8. [] S. Chaudhuri and V. Narasayya. TPC-D data generation with skew. [2] S. Chu, M. Balazinska and D. Suciu. From theory to practice: Efficient join query evaluation in a parallel database system. In SIGMOD, 20. [] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, [4] D. DeWitt, J. Naughton, D Schneider, and S Seshadri. Practical Skew Handling in Parallel Joins. In VLDB, 2. [] C. Doulkeridis and K. Norvag. A survey of large-scale analytical query processing in MapReduce. VLDBJ, 2(): , 204. [6] P. Efraimidisa, and P. Spirakis. Weighted random sampling with a reservoir. In Information Processing Letters, 7(), [7] M. Elseidy, A. Elguindy, A. Vitorovic, and C. Koch. Scalable and Adaptive Online Joins. In VLDB, 204. [8] R. Gemulla, P. Haas, and W. Lehner. Non-Uniformity Issues and Workarounds in Bounded-Size Sampling. VLDBJ, 22(6), 20. [] J. Gibbons. Nonparametric methods for quantitative analysis. Holt, Rinehart and Winston, New York, 76. [20] L. Harada and M. Kitsuregawa. Dynamic Join Product Skew Handling for Hash-Joins in Shared-Nothing Database Systems. DASFAA,. [2] K. Hua and C. Lee. Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning. In VLDB,. [22] Y. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD,. [2] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: mitigating skew in mapreduce applications. In SIGMOD, 202. [24] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. A Study of Skew in MapReduce Applications. In Open Cirrus Summit, 20. [2] S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. In VLDB, 200. [26] M. Muralikrishna, and D. DeWitt. Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries. In SIGMOD, 88. [27] S. Muthukrishnan, V. Poosala, and T. Suel. On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications. In ICDT,. [28] S. Muthukrishnan, and T. Suel. Approximation Algorithms for Array Partitioning Problems. Journal of Algorithms, 4(), 200. [2] A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, 20. [0] O. Polychroniou, R. Sen and K. Ross Track Join: Distributed Joins with Minimal Network Traffic. In SIGMOD, 204. [] V. Poosala, and Y. Ioannidis. Selectivity Estimation Without the Attribute Value Independence Assumption. In VLDB, 7. [2] V. Poosala, and Y. Ioannidis. Estimation of query-result distribution and its application in parallel-join load balancing. In VLDB, 6. [] M. Stillger, G. Lohman, V. Markl and M. Kandil. LEO-DB2 s learning optimizer. In VLDB, 200. [4] A. Vitorovic, M. Elseidy and C. Koch. Load Balancing and Skew Resilience for Parallel Joins. EPFL-REPORT 2066 Technical Report, 20. [] C. Walton, A. Dale, and R. Jenevein. A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. In VLDB,. [6] R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In SIGMOD, 20. [7] Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared-nothing systems. In SIGMOD, [8] Y. Wang. Relations between two common types of rectangular tilings. International Journal of Computational Geometry & Applications, (2), 200. [] X. Zhang, L. Chen, and M. Wang. Efficient multi-way theta-join processing using MapReduce. VLDBJ, ():84-,

13 APPENDIX A. THE HISTOGRAM ALGORITHM: DETAILS AND PROOFS A. Sampling The following lemmas require that w(r) is monotonic xvii (a region weighs at least the weight of any of its subregions) and that c i(r) and c o(r) are superadditive, that is, processing x + y input (output) tuples is at least as expensive as the sum of processing costs for x and y input (output) tuples. This holds for realistic cost models. Lemma.. n s = 2nJ is the minimum M S size such that the maximum cell weight σ in the M S is at most half of the maximum region weight of the M H optimum partitioning. This holds independently from the join condition and the join key distribution, given that m n xviii. Proof. An M S cell corresponds to a region in the original matrix with dimensions n/n s n/n s, where n s is the number of buckets in the equi-depth histogram. Due to the fact that n s = 2nJ, the semi-perimeter of each cell is max cell (s p(cell)) = 2n/n s = 2n/J. The maximum frequency of an M S cell is given by the Cartesian product between the encompassed input tuples from the two relations. That is, max cell (f(cell)) (n/n s) 2 = n/2j. Because m n, it follows that max cell (f(cell)) m/2j. It holds that σ = max cell (w(cell)) c i( 2n/J) + c o(m/2j). As J n (it suffices that J < n/), it follows that 2n/J < n/j and σ c i(n/j) + c o(m/2j). We denote the maximum region weight of the M H optimum partitioning as w OP T. It holds that w OP T (c i(2n) + c o(m))/j, as each incoming tuple is assigned to at least one region. Since c i and c o are superadditive, it follows that w OP T c i(2n/j) + c o(m/j) and that σ w OP T /2. More precisely, as the bucket sizes are probabilistic, we can conclude that with high probability, these bounds are very close to the actual bounds. For the next lemma, we will need to define input and output sample sizes. Input sample size s i. For each relation, we build an approximate equi-depth histogram [0]. Namely, for each relation, we take a random uniform sample of size s i, sort the sampled tuples according to the join key, and then build an equi-depth histogram on them with n s < s i buckets. According to [0], for a given n s, s i needs to be at least 4n s ln(2n/γ)/e 2, where e is the maximum error on the bucket size with probability of at least γ. This implies that a small sample of size s i = Θ(n s log n) is sufficient for building approximate equi-depth histogram. Output sample size s o. In [26], the authors show that the sample size is not a function of the actual data size, and that it can be obtained from standard tables based on Kolmogorov s statistics []. For example, for an error on the region output within % and confidence of at least %, the standard tables only require that the sample size is at least 06. On the other hand, the sample size should be a small integer multiple of the number of scrutinized categories in the population. In our case, this number is the number of candidate M S cells (n sc ), as the non-candidate cells never produce an output tuple. Thus, the output sample size is xvii Monotonicity on the join output and monotonicity on the weight function are different and should not be confused. xviii This typically holds in practice. We relax it in A.. s o max(06, n sc ) for the specified error margin and confidence interval. Consequently, we need an output sample of size s o = Θ(n sc ). To determine n sc, we define a low-selectivity join precisely. As 2.2 states, the content-insensitive (CI) scheme achieves (almost) perfect load balancing for output. Thus, if the output size m > ρ oin (ρ oi is a constant), CI works very well. Hence, it pays off to use a content-sensitive scheme only if m ρ oin xix, that is, if m = O(n). This condition defines a low-selectivity join. We require a similar condition to hold between the input and output of the sample matrix xx : n sc = O(n s) () Thus, we need a small output sample: s o = Θ(n s) = Θ( nj). We next prove that, given this n s = 2nJ from Lemma., the sampling stage time complexity is low. Lemma.2 [Extended version]. The total sample size collected for building M S is Θ(n s log n). The sampling stage running time is O(n s log n s). Given n s = 2nJ, if J = O( n/ log 2 n) xxi, the sample size and the time complexity are both O(n/J). Proof. From J = O( n/ log 2 n), it follows that log n = O( n/j ) (2) As the input sample size is s i = Θ(n s log n), and the output sample size is s o = Θ(n s), the total sample size is dominated by the input. By substituting log n from Equation 2, the total sample size is s i = Θ(n s log n) = Θ( nj log n) = O(n/J). Let us discuss the time complexity for building M S. To create approximate equi-depth histogram on input, we need to sort the input sample tuples. We do it on the sites providing the samples, incurring O(log 2 s i) = O(log 2 (n/j)) time. For each sample output tuple (s o of them), we use binary search to find a M S cell to increment. Thus, processing the output takes O(s o log n s) = O(n s log n s) time. As n s n, it follows that log n s log n, and from Equation 2, log n s = O( n/j ). Hence, O(n s log n s) = O(n/J). Thus, the total time complexity for building M S is bounded by O(n/J). A.2 Coarsening The coarsening algorithm [28] is the Rtile (rectangle tiling) problem with grid (n c n c) partitioning and the Max- Weight-Id metric. This is an approximation algorithm with an approximation ratio of 2 [28]. That is, given M S and n c (the size of M C), where the maximum M C cell weight of the optimum grid partitioning is φ 0, the algorithm returns an M C with maximum cell weight φ 2φ 0. For sparse matrices, if range trees are used for computing the prefix sum, the coarsening algorithm [28] runs in O(n sc log n sc + (n s + n 2 c log n sc ) n c log n s) () time, where n s is the M S size, and n sc is the number of candidates in M S. xix In our experimental setup, our scheme works well even if m is two orders of magnitude bigger than n (see 6.). Thus, we cover a wide range of joins in practice. xx If any of these assumptions do not hold, we fall back to the content-insensitive operator. However, we experimentally show that the assumptions hold for many interesting joins. xxi See footnote v.

14 MonotonicCoarsening. Monotonicity (consecutiveness of candidate cells) allows us to visit all the n cc candidate cells in O(n cc ) time. Along the lines of Equation, we assume n cc = O(n c). Consequently, the coarsening algorithm for monotonic joins performs only O(n c), rather than n 2 c weight computations per iteration. The complexity from Equation then becomes: O(n sc log n sc + (n s + n c log n sc ) n c log n s) (4) Lemma.. The running time of the coarsening algorithm is O((n s + n 2 c log n s) n c log n s). Given n c = 2J, if J = O( n/ log 2 n s) xxii, the time complexity becomes O(n). Proof. Equation shows the total running time for building the coarsened matrix. By substituting n sc = O(n s) (Equation ), log n s from J = O( n/ log 2 n s), n c = Θ(J) and n s = Θ( nj) into Equation, it follows that the complexity is O(n). A. Regionalization Regionalization is an Rtile problem with arbitrary partitioning [27] and the Max-Weight-Id metric. There exist algorithms for this problem in the restricted, output-only case, e.g. [4]. However, they are not applicable for the general case, which entails support for: a) monotonic metrics (including the weight function) and b) segments which may or may not be covered by a region (0-cells). By design, these algorithms generate prolate regions and thus incur excessive input costs. Hence, they can be arbitrarily worse in weight than the optimum. The best algorithm which works for the general case is Binary Space Partition (BSP) [, 27]. BSP. Binary Space Partition [] is a dynamic programming algorithm which creates an optimum hierarchical partitioning, within a factor of 2 from an optimum arbitrary partitioning. We use the Max-Weight-Id metric. Next, we prove lemmas about from.. Lemma.4. A rectangle is defined by the upper left and the lower right corner. For monotonic joins, each defining corner of a minimal candidate rectangle is a candidate cell, yielding O(n 2 c) minimal candidate rectangles in total. Proof. We prove the lemma using contradiction by assuming that a defining corner of a minimal candidate region is not a candidate cell. Let us consider the position of the upper left corner. If it is before the first candidate cell in the row of M C, the left boundary of the rectangle is empty (see rectangles r and r min in Figure 2c). Thus, the rectangle is not a minimal candidate. If the position of the upper left corner is after the last candidate cell in the row, the upper boundary of the rectangle is empty (see rectangles r 2 and r min2 in Figure 2c). Again, the rectangle is not a minimal candidate. Consequently, the upper left corner must be a candidate cell. The proof for the lower right corner is symmetric. Thus, both defining corners of a minimal candidate rectangle are candidate cells. Consequently, there are n c 2 c minimal candidate rectangles, where n cc is the number of candidate cells in M C. Along the lines of Equation, we assume n cc = O(n c). Thus, there are O(n 2 c) minimal candidate rectangles in total. Lemma.. The regionalization stage based on MonotonicBSP runs in O(n c log n c log n) time. Given n c = 2J, if J = O( n/ log 2 n), the stage takes O(n) time. xxii See footnote v. Proof. Generating r N minimal candidate rectangles and sorting them takes O(r N log r N ) time. Then, for each rectangle (there are r N of them), we: a) compute its weight which takes O() time with O(n 2 c) prefix sum precomputation (line 6) and b) for each splitter line within a rectangle, O(n c) of them, for both subrectangles, find the corresponding minimal candidate rectangle (using binary search it takes O(log n c) time) (lines -2). Step b yields O(n c log n c) time per rectangle. Overall, this requires a total time of O(n 2 c + r N (log r N + n c log n c)). From Lemma.4, we know that r N = O(n 2 c). Thus, MonotonicBSP runs in O(n c log n c) time. Due to transformation from DRtile to Rtile, the regionalization stage takes O(n c log n c log n) time. Given n c = Θ(J), log J log n and J = O( n/ log 2 n), it follows that the stage runs in O(n) time. A.4 Putting it all together Theorem. [Extended version]. The histogram algorithm runs in O(n) local time with O(n/J) sample tuples collected. Proof. Lemmas.2,. and. directly imply Theorem.. We next discuss why this cost is affordable. As a parallel join takes Ω((n + m)/j) communication time (m is the join output size), the histogram algorithm can afford O(n/J) time for collecting sample tuples and O(n) local processing time (which is much cheaper than the communication time). A. Generalization and Discussion We next relax some assumptions and outline how we address them to preserve all the guarantees. A small number of output tuples. We relax the assumption m n from Lemma.. If m < n, a sample matrix M S cell frequency can surpass m/j, breaking the Lemma bounds. We distinguish two cases. (i) If m = Θ(n) = cn, where c < is a constant, we increase n s to preserve the bounds. More precisely, it must hold that (n/n s) 2 m/2j, that is, (n/n s) 2 cn/2j. It follows that n s 2nJ/c. Thus, n s grows only by a constant factor of / c. (ii) If c (m n), to avoid a huge increase in n s, and thus the histogram algorithm complexity, we adjust M S such that each cell weight is below the required threshold (w OP T /2, where w OP T is the maximum region weight of the M H optimum partitioning). Namely, we divide only the row and/or column of the overweighted cell(s). Then, we reassign the affected output sample tuples to the new M S cells. Reducing the sample matrix size n s is an important optimization, as it decreases the histogram algorithm running time. Lemma. decides on n s using a conservative assumption that ρ oi in m = ρ oin, that is, m n. (We covered the case when m < n earlier in this section.) Using the actual value of ρ oi decreases n s from 2nJ to 2nJ/ρ oi, without loosing any guarantees. We know m and thus ρ oi from sampling the output tuples (see 4.). This optimization requires rebuilding the sample matrix once m is known (input and output samples are collected as before). Reducing n s is very useful when ρ oi is sufficiently bigger than and when input relations are very large. We use it for B CB. Parameters. The output sample size is s o = O(n sc ). In our experiments we set s o = 2n sc. We compute n cs by counting the candidate M S cells right after taking a sample of input tuples. 4

15 B. JOINS The joins are defined as follows: BICD BCB β EOCD SELECT * FROM ORDERS O, ORDERS O2 WHERE ABS(O.orderkey - 0 * O2.custkey) <= 2 SELECT * FROM R, R2 WHERE ABS(R.key - R2.key) <= β SELECT * FROM ORDERS O, ORDERS O2 WHERE O.custkey = O2.custkey AND O.order-priority = "4-NOT SPECIFIED" AND O2.order-priority = "-URGENT" AND 0.totalprice BETWEEN γ AND AND 02.totalprice BETWEEN γ AND For B CB, we experiment with different widths of the band β:, 2,, 4, 8 and 6. Scaling out E OCD using the same γ leads to highly disturbed output/input ratio ρ oi. As the relative performance of different operators highly depends on ρ oi, we set γ such that ρ oi remains on the same order of magnitude. Namely, we set γ to , and for the scale factor of 80, 60 and 20, respectively.

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at distributing load b. QUESTION: What is the context? i. How

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Data Warehousing und Data Mining

Data Warehousing und Data Mining Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Load Balancing. Load Balancing 1 / 24

Load Balancing. Load Balancing 1 / 24 Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

More information

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

1. Comments on reviews a. Need to avoid just summarizing web page asks you for: 1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of

More information

Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework

Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Fariha Atta MSc Informatics School of Informatics University of Edinburgh 2010 T Abstract he Map/Reduce

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies

More information

Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University [email protected] Aditya G. Parameswaran University of Illinois (UIUC) [email protected] Jennifer Widom Stanford

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

More information

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters

Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Locality-Sensitive Operators for Parallel Main-Memory Database Clusters Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner*, Angelika Reiser, Alfons Kemper, Thomas Neumann Technische Universität München,

More information

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol Data Algorithms Mahmoud Parsian Beijing Boston Farnham Sebastopol Tokyo O'REILLY Table of Contents Foreword xix Preface xxi 1. Secondary Sort: Introduction 1 Solutions to the Secondary Sort Problem 3 Implementation

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

ON THE COMPLEXITY OF THE GAME OF SET. {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu

ON THE COMPLEXITY OF THE GAME OF SET. {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu ON THE COMPLEXITY OF THE GAME OF SET KAMALIKA CHAUDHURI, BRIGHTEN GODFREY, DAVID RATAJCZAK, AND HOETECK WEE {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu ABSTRACT. Set R is a card game played with a

More information

Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

More information

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M. SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA [email protected], [email protected],

More information

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems 215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations

Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations Alonso Inostrosa-Psijas, Roberto Solar, Verónica Gil-Costa and Mauricio Marín Universidad de Santiago,

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University

More information

Differential privacy in health care analytics and medical research An interactive tutorial

Differential privacy in health care analytics and medical research An interactive tutorial Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework Konstantina Palla E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science School of Informatics University of Edinburgh

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

Policy Distribution Methods for Function Parallel Firewalls

Policy Distribution Methods for Function Parallel Firewalls Policy Distribution Methods for Function Parallel Firewalls Michael R. Horvath GreatWall Systems Winston-Salem, NC 27101, USA Errin W. Fulp Department of Computer Science Wake Forest University Winston-Salem,

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Load-Balancing the Distance Computations in Record Linkage

Load-Balancing the Distance Computations in Record Linkage Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr

More information

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo. VLDB 2009 CS 422 Decision Trees: Main Components Find Best Split Choose split

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

More information

Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Binary Image Reconstruction

Binary Image Reconstruction A network flow algorithm for reconstructing binary images from discrete X-rays Kees Joost Batenburg Leiden University and CWI, The Netherlands [email protected] Abstract We present a new algorithm

More information

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m. Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

More information

Research Statement Immanuel Trummer www.itrummer.org

Research Statement Immanuel Trummer www.itrummer.org Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory) WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

More information

On the Interaction and Competition among Internet Service Providers

On the Interaction and Competition among Internet Service Providers On the Interaction and Competition among Internet Service Providers Sam C.M. Lee John C.S. Lui + Abstract The current Internet architecture comprises of different privately owned Internet service providers

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

2009 Oracle Corporation 1

2009 Oracle Corporation 1 The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Recommendations for Performance Benchmarking

Recommendations for Performance Benchmarking Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best

More information

Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

More information

Unprecedented Performance and Scalability Demonstrated For Meter Data Management:

Unprecedented Performance and Scalability Demonstrated For Meter Data Management: Unprecedented Performance and Scalability Demonstrated For Meter Data Management: Ten Million Meters Scalable to One Hundred Million Meters For Five Billion Daily Meter Readings Performance testing results

More information

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE [email protected] 2

More information

Overview of Storage and Indexing

Overview of Storage and Indexing Overview of Storage and Indexing Chapter 8 How index-learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) Database Management Systems 3ed, R. Ramakrishnan

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Datenbanksysteme II: Implementation of Database Systems Implementing Joins

Datenbanksysteme II: Implementation of Database Systems Implementing Joins Datenbanksysteme II: Implementation of Database Systems Implementing Joins Material von Prof. Johann Christoph Freytag Prof. Kai-Uwe Sattler Prof. Alfons Kemper, Dr. Eickler Prof. Hector Garcia-Molina

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 Load Balancing Heterogeneous Request in DHT-based P2P Systems Mrs. Yogita A. Dalvi Dr. R. Shankar Mr. Atesh

More information

MAPREDUCE [1] is proposed by Google in 2004 and

MAPREDUCE [1] is proposed by Google in 2004 and IEEE TRANSACTIONS ON COMPUTERS 1 Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao, Senior Member, IEEE Abstract MapReduce is a widely used parallel

More information