Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework

Size: px
Start display at page:

Download "Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework"

Transcription

1 Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Fariha Atta MSc Informatics School of Informatics University of Edinburgh 2010

2

3 T Abstract he Map/Reduce framework -- a parallel processing paradigm -- is widely being used for large scale distributed data processing. Map/Reduce can perform typical relational database operations like selection, aggregation, and projection etc. However, binary relational operators like join, cartesian product, and set operations are difficult to implement with Map/Reduce. Map/Reduce can process homogeneous data streams easily but does not provide direct support for handling multiple heterogeneous input data streams. Thus the binary relational join operator does not have efficient implementation in the Map/Reduce framework. Some implementations of the join operator exist for the Hadoop distribution of the Map/Reduce framework. However, these implementations do not perform well in case of heavily skewed data. Skew in the input data affects the performance of the join operator in parallel environment where data is distributed among parallel sites for independent joins. Data skew can severely limit the effectiveness of parallel architectures when some processing units (PUs) are overloaded during data distribution and hence take a greater time for completion as compared to other PUs. This also results in wastage of resources of the idle PUs. As data skew naturally occurs in many applications, handling it is an important issue for improving the performance of the join operation. We implement a hash join algorithm which is a hybrid of the map-side and the reduce-side joins of Hadoop with the ability to handle skew and we compare its performance to the other join algorithms of Hadoop. I

4 Acknowledgements My heartfelt gratitude goes to my supervisor Stratis Viglas who provided me guidance and support throughout the project. I specially appreciate his willingness to help anytime. I would like to acknowledge Chris Cooke for administering the Hadoop cluster and answering my queries regarding the cluster. Finally, I would thank my family and friends for their continuous moral support and encouragement. II

5 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Fariha Atta) III

6 Table of Contents Chapter 1. Introduction Motivation Related Work Problem Statement and Aims Thesis Outline... 8 Chapter 2. Background The Map/Reduce Framework The Hadoop Distribution Impact of parallelization on the join operation Hash Join algorithm Join algorithms on a single processor machine Hash join algorithm for parallel implementation Skew and its impact on join operation Hash Partitioning and its skew sensitivity Range Partitioning and its skew sensitivity Pre-processing for joining Semi-join using selection Semi Join using Hash Operator Chapter 3. Join Algorithms on Hadoop Reduce-side joins Map-side partitioned joins Memory-backed joins The hybrid Hadoop join Chapter 4. Evaluation The testbed IV

7 4.2. Data Sets Tests Test 1: To determine time for sampling and finding the splitters Test 2: To determine the size of the bloom filter for pre-processing Test 3: To determine the effectiveness of pre-processing Test 4: To determine the effect of increasing partitions on the join of skewed relations Test 5: To determine the number of partitions for the virtual processor range partitioning Test 6: Comparison of algorithms in the case of skewed and non-skewed joins Discussion on the results of the comparison tests Chapter 5. Conclusion and Future Works Bibliography V

8 List of Figures Figure 1: The Map/Reduce dataflow Figure 2: The HDFS Architecture Figure 3: Hadoop Map/Reduce dataflow (source [43]) Figure 4: Partitioning R and S on join column R.r=S.s in three partitions using hash(k)=k% Figure 5: Grace Join Algorithm (source [44]) Figure 6: Joining on parallel machines Figure 7: Example of data skew -- Patent and Cite Tables Figure 8: A bloom filter with m=15, k=3 for a set {A, B, C} Figure 9: Bit setting in bloom filter in case of collision Figure 10: Construction of a Bloom Filter Figure 11: Data flow of the reduce-side join Figure 12: Key-Value pairs before and after group partitioning Figure 13: Data flow of the map-side partitioned join Figure 14: Data flow of the memory-backed join Figure 15: Data flow of the hybrid Hadoop join Figure 16: Custom range partitioning for Hadoop Figure 17: Time trend for taking samples and selecting splitters Figure 18: Execution time of join algorithms with and without pre-processing in low selectivity situations Figure 19: Effect of pre-processing in high selectivity situations Figure 20: Execution time of the HHH algorithm for different numbers of partitions (i.e. the reduce tasks) Figure 21: The execution times of the reducers handling the skewed keys for different number of partitions VI

9 Figure 22: Stages of Execution for Join Algorithms Figure 23: Join Results for Input1xInput1, output records produced: 4,001, Figure 24: Join Results for Input10KxInput1, output records produced: 2,000, Figure 25: Join Results for Input100KxInput1, output records produced: 2,002, Figure 26: Join Results for Input300KxInput1, output records produced: 1,799, Figure 27: Join Results for Input400KxInput1, output records produced: 2,000, Figure 28: Join Results for Input500KxInput1, output records produced: 1,999, Figure 29: Join Results for Input600KxInput1, output records produced: 1,798, Figure 30: Join Results for Input300KxInput10, output records produced: 4,503, Figure 31: Join Results for Input500KxInput10, output records produced: 6,501, Figure 32: Join Results for Input600KxInput10, output records produced: 7,199, Figure 33: Join Results for Input100KxInput100, output records produced: 11,899, Figure 34: Join Results for Input300KxInput100, output records produced: 31,499, Figure 35: Time taken by the memory-backed join for different number of tuples in the build relation (*DNF=Does Not Finish) VII

10 List of Tables Table 1: Books Relation Table 2: Authors Relation Table 3: Filtered Book Relation Table 4: Conventions for representing the algorithms Table 5: Effects of increasing size of the bloom filter Table 6: Time taken by reducers in case of skewed relation (6 partitions) Table 7: Time taken by reducers in case of skewed relation (12 partitions) Table 8: Time taken by reducers in case of skewed relation (18 partitions) Table 9: Time taken by reducers in case of skewed relation (24 partitions) Table 10: Time taken by reducers in case of skewed relation (30 partitions) Table 11: Effect of the number of partitions on the execution time of skewed reducers for the virtual processor range partitioning VIII

11 Chapter 1: Introduction Chapter 1 Introduction 1.1 Motivation Past are the times when applications used to process kilobytes or megabytes of data. Now is the age of processing gigabyte, terabyte, or even petabyte scale data. Applications these days deal with gigantic amounts of data that do not fit in main memory of one single machine and are also beyond the processing power of one machine. For example, the data volume of the National Climatic Data Centre (NCDC) [34] is 350 gigabytes; E-bay maintains 17 trillion records and has a total size of 6.5 terabytes [36]; Facebook manages more than 25 terabytes of data per day for logging [35]; The Sloan Digital Sky Survey (SDSS) [33] maintains about 42 terabytes of image data and 18 terabytes of catalogue data. Processing this massive amount of data is not an easy feat. Data-intensive applications use a distributed infrastructure containing clusters of computers and employ distributed parallel algorithms to efficiently process huge volumes of data. One of the fairly recent advancements in distributed processing is the development of the Map/Reduce paradigm [1]. Map/Reduce a programming framework developed at Google provides a cost-effective, scalable, flexible, and fault-tolerant distributed platform for large scale distributed data processing across a cluster of hundreds or even thousands of nodes. It allows for processing huge volumes of data in parallel [1, 2], upto multiples of petabytes. Computations are moved to machines having appropriate data on which processing has to be carried out, rather than moving data to machines that can perform computation on it (the traditional parallel processing approach). Map/Reduce makes use of a large number of shared-nothing cheap commodity machines. The use of commodity hardware results in a cheap solution to large data processing problems. Replication of data on multiple nodes ensures availability and reliability on the unreliable underlying hardware. Map/Reduce takes care of data movement, load balancing, fault-tolerance, job scheduling, and other nitty-gritty details of parallel processing. Users of the Map/Reduce framework just have to concentrate on data processing algorithms which have to be 1

12 Chapter 1: Introduction implemented in the map and reduce functions of the framework. Map and Reduce are the two primitives provided by the framework for distributed data processing. The signatures of these primitives are: Map: (k 1, v 1 ) -> [(k 2, v 2 )] Reduce: (k 2, [v 2 ]) -> [v 3 ] The map function converts input key-value pairs of data into intermediate key-value pairs which are distributed among reduce functions for further aggregation. In simple terms, data is distributed among nodes for processing during the map phase and the result is aggregated in the reduce phase. Algorithms for processing distributed data have to be supplied in these primitives of the framework. This simplifies the development of programs for parallel settings. When it comes to the challenge of processing vast amounts of data, the distinction of structured and unstructured data does not matter much for Map/Reduce. The perceived rival of Map/Reduce, the parallel DBMS, is suitable for efficiently processing large volumes of structured data because of its ability to distribute the relational tables among processing nodes, compression, indexing, query optimization, result caching etc. However, the parallel DBMS has some inherent limitations. Once it is deployed and the data is distributed among nodes, adding more nodes to scale the parallel DBMS becomes very difficult. Moreover, the parallel DBMS is not fault-tolerant [5]. In case of a fault at one node during the processing, the whole processing sequence has to be restarted at each node. Because of these reasons, the parallel DBMS cannot cope well with the demand of processing tremendously large amounts of data. In addition, processing unstructured data is out of the jurisdiction of the parallel DBMS. For such a situation, Map/Reduce comes to the rescue. Map/Reduce can process bulks of unstructured data and can also handle structured relations of conventional databases. All a user of the Map/Reduce framework needs is to implement the processing logic in the map and reduce primitives of the framework. Users can specify algorithms for selection, projection, aggregation, grouping, joining, or other similar processing tasks of the relational databases in these primitives. Of all the different available implementations of Map/Reduce e.g. GreenPlum [37], Aster Data [38], Qizmt [39], Disco [40], Skynet [41], Apache s implementation of Map/Reduce (called Hadoop [31]) is the most commonly and widely used implementation for educational and research purposes because of its 2

13 Chapter 1: Introduction open-source and platform-independent nature. Hadoop allows for easily developing robust, scalable, and efficient programs for distributed data processing. Hadoop is based on client-server architecture. Centralized management by the master of the Hadoop cluster makes tasks far more simple and organized. The master distributes the workload among worker/slave nodes, which on completion of the requested task report to the master. The master decides the next course of actions to be taken. The storage of vast amount of the distributed data and its dissemination to worker nodes in the cluster is managed by the Hadoop Distributed File System (HDFS) [32]. As mentioned earlier, different sorts of operations can conveniently be performed on structured as well as unstructured data using Map/Reduce. Among these operations, joining two heterogeneous datasets is the most important and challenging operation. Many Map/Reduce applications require joining data from multiple sources. For example, a search engine maintains many databases such as crawler, log, and webgraph databases. It constructs the index database using both the crawler and webgraph databases so it requires joining these two datasets. However, joining large heterogeneous datasets is quite challenging. Firstly, processing two massive datasets in parallel for finding a match based on some attribute is intimidating even if a large computational cluster is available. Secondly, in a distributed setting, the datasets involved in joining are stored at distributed sites. Thirdly, although the database world is full of different techniques for the join operation, Map/Reduce itself is not built for processing multiple data streams in parallel. By its very nature, Map/Reduce processes a single stream of data at a time and hence does not have an efficient description for joining two parallel streams. In the Hadoop implementation of the framework, some strategies have been documented for the join operation. These are the Map-side, Reduce-side, and Memory backed join techniques but have their own inherent limitations. When it comes to joining skewed datasets, the performance of these join techniques is degraded. Skew in the distribution of join attribute s value can limit the effectiveness of parallel execution. The variation in the processing time of parallel join tasks affects the maximum speedup that can be achieved by virtue of parallel execution. In their popular paper Map-Reduce: A major step backwards, DeWitt and Stonebraker criticize Map/Reduce on various aspects, one of which is its inability to handle skew. They state: One factor that Map/Reduce advocates seem to have overlooked is the issue of skew The problem occurs in the 3

14 Chapter 1: Introduction map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the Map/Reduce community might want to adopt. Therefore, this project aims at developing a hash join algorithm for the Map/Reduce framework with the specialty to handle large skew in data and analyzing its performance with respect to the implementations provided by Hadoop. 1.2 Related Work The join operation is one of the fundamental, most difficult, and hence the most researched query operation. It is an important operation that facilitates combining data from two sources on the basis of some common key. The database literature is full of discussions on techniques, performance, and optimization of this operation. Nested loops, sort/merge, and hash join are the most commonly used join techniques. [3], [20] provide a general discussion on join processing in relational databases for single processor systems. They determine that the nested loops algorithm is useful only when the datasets to be joined are relatively small. When the datasets are large, hash-based algorithms are superior to sort-merge joins provided that the final result needs not be sorted before presenting it to user. The optimization of join techniques for multiprocessor environment has also been widely researched. [10], [12], [15] discuss join algorithm for multiprocessor databases. The performance of the sort-merge, grace, simple, and hybrid join techniques on a shared-nothing multiprocessor setup, GAMMA, is presented in [4]. They prove that the hybrid hash join algorithm dominates other algorithms for all degrees of memory availability. Different join techniques for Map/Reduce have also been researched. Hadoop implements map-side and reduce-side joins [26], [27] in which the join operation is carried out in mappers and reducers respectively. Set-similarity joins using Map/Reduce are discussed in [9]. They use the reduce-side join with custom partitioning to group together most similar tuples. [11] discusses the optimization 4

15 Chapter 1: Introduction of the join operator for multi-way and star joins using the Map/Reduce framework. For a 3-way join among R x S x T, the tuples of R and T are replicated across a number of reducer nodes to avoid communicating the result of the first join. The join is performed on the reduce side. They prove that a multi-way join using a single Map/Reduce job is more efficient than cascading a number of Map/Reduce jobs, each performing a 2-way join. A modification to the Map/Reduce framework for the join operation is presented in [9], called Map-Reduce-Merge. It introduces a new stage called merge where matching tuples from multiple sources are joined. The modified primitives are: Map: (k 1, v 1 ) α -> [(k 2, v 2 )] α Reduce: (k 2, [v 2 ]) α -> (k 2, [v 3 ]) α Merge: ((k 2, [v 3 ]) α, (k 3, [v 4 ]) β ) -> [(k 4, v 5 )] ϒ Where k is the key, v is the value, and α, β, ϒ are the lineages. A map function converts the key-value pair (k 1, v 1 ) from lineage α to an intermediate key-value pair. The reduction operation puts all intermediate values related to k 2 in the list [v 3 ]. Another map-reduce operation does the same with a key-value pair from lineage β and the subsequent reduce produces a key-value pair (k 3, [v 4 ]). Depending on the values of keys k 2 and k 3, the merge operator performs the join and combines the two reduced outputs in another lineage ϒ. A user-defined module partition selector determines which reducers the merge operator gets data from for joining. Thus joining is carried out after the reduce stage in an additional merge phase where each merger receives corresponding data of multiple datasets from the reduce phase. Sort-merge, block-nested, or hash joins can be performed in the merge phase. Hence, all the processing is complete in one Map/Reduce job. However, the Map-Reduce-Merge implementation requires changes in the basic Map/Reduce framework. Many techniques have been implemented for handling skew in parallel joins operation. [18] presents a partitioning strategy for skewed data which assumes that the skewed keys are known in advance, which is not the case in practical situations. The range partitioning has long been studied for assigning tuples to partitions on the basis of ranges rather than hash values of a join key. [23] sorts input datasets according to join keys. Depending on the processing capability of the system, the number of tuples T to be allocated to each partition is determined. The sorted datasets are then divided into n partitions, each containing T tuples, and partitions 5

16 Chapter 1: Introduction are assigned to processing units (PUs) in a round robin fashion. The tuples of the second dataset are assigned to PUs on the basis of partition ranges of the first dataset. This approach sorts the input dataset for determining appropriate ranges for partitioning, which is quite costly if the datasets are very large. [24], [28, [29] determine the ranges for partitioning after the full scan of the input data to determine the skewed keys. The scanning cost may overshadow the benefits obtained from the range partitioning. [25] determines the ranges for partitioning by randomly sampling the input datasets. It also presents the virtual processor partitioning approach which divides an input dataset into a greater number of partitions than PUs to scatter the skewed keys. To our knowledge, no work has yet been done for handling skew in the join operation for Hadoop. We attempt to add skew handling capability for Hadoop using the range partitioning. 1.3 Problem Statement and Aims Although several join techniques are available in the database literature, implementing them for Map/Reduce framework is, if not impossible, not so easy because of the very nature of the Map/Reduce framework. All the join techniques implemented for the Map/Reduce framework discussed in the Related Works section have their limitations. Map-Reduce-Merge, for example, introduces a new stage merge that can be an overhead for implementation and hence is not an efficient solution. The reduce-side join implemented by Hadoop incurs the space and time overhead as a result of tagging each tuple with source information. Although the performance of the map-side join is better than the reduce-side join, the prerequisite for the map-side join is that both input datasets must be pre-partitioned and properly structured before being input to the mappers. Moreover, these algorithms use the hash partitioning for distributing the two datasets among the worker nodes. The hash partitioning is sensitive to skew in the input data when some values appear multiple times. The similar repeated values are routed to one single processing node using the hash partitioning scheme. As a result, the worker nodes handling the repeated values are overloaded with too many records to be joined. Hence, the algorithms using hash partitioning for data distribution are prone to a degraded performance when the input datasets to be joined contain skewed keys. Since all the Hadoop join algorithms use the 6

17 Chapter 1: Introduction hash-partitioning for data distribution, they suffer from a performance hit in the case of skewed data. Problem Statement: We consider the equi-join on two data sources R and S, with cardinalities R and S, on the basis of a single join column such that R.r=S.s. For simplifying the situation, we do not perform any selection and projection on the data sources, although it can easily be incorporated during either the map or the reduce phase while emitting key-value pairs at nodes. To keep things simple, we perform an inner join on the datasets for evaluation. We assume that both dataset are stored in HDFS. Our major tasks in this project are: 1. We provide a detailed discussion of various join algorithms supplied by the Hadoop implementation. We analyze the pros and cons of each of these algorithms i.e. the map-side, reduce-side, and memory-backed joins. 2. We discuss our hybrid algorithm that is a combination of the map-side and the reduce-side join. 3. We discuss some pre-processing techniques such as the semi-join through bit-filtering and semi-join through selection which can reduce the sizes of datasets to be joined by removing those keys that do not take part in the join operation. We experimentally determine the performance of both techniques for filtering the input datasets. 4. We present different partitioning strategies for workload distribution among the nodes performing the join operation. The default hash-based partitioning of Hadoop overloads some nodes in the case of skewed keys. We discuss how to avoid this by using the range partitioning approach. We incorporate the range partitioning in our algorithm for handling skew. 5. We conduct experiments to compare the performance of the map-side, reduce-side, and memory-backed join algorithms with our hybrid algorithm (both versions: hash partitioning and range partitioning) for handling skewed data. 7

18 Chapter 1: Introduction 1.4 Thesis Outline Chapter 2 equips the reader with some background information to better understand the problem statement. Overview of Map/Reduce, its Hadoop implementation, hash join techniques, partitioning strategies, and filtering techniques are presented in this chapter. Chapter 3 discusses the Hadoop implementation of the map-side, reduce-side, and memory-backed joins and presents our hybrid algorithm with the range and hash partitionings. Chapter 4 presents the experiments conducted to compare the performance of these algorithms in case of skewed and non-skewed input datasets and discusses the results. Chapter 5 concludes all the findings and suggests some future extensions to the project. 8

19 Chapter 2: Background Chapter 2 Background 2.1 The Map/Reduce Framework The Map/Reduce framework consists of two operations, map and reduce, which are executed on a cluster of shared-nothing commodity nodes. In a map operation, the input data available through a distributed file system (e.g. GFS [6] or HDFS[32]), is distributed among a number of nodes in the cluster in the form of key-value pairs. Each of these mapper nodes transforms a key-value pair into a list of intermediate key-value pairs (the output keys may not be the same as the input keys). Depending on the map operation, the list may contain 0, 1, or many key-value pairs (shown in Figure 1). The signature of this map operation is: Map(k 1, v 1 ) -> list(k 2, v 2 ) The intermediate key-value pairs are propagated to the reducer nodes such that each reduce process receives values related to one key. The values are processed and the result is written to the file system. Depending on the reduce operation, the result may contain 0, 1, or many values (list of values). The signature of the reduce operation is: Reduce(k 2, list(v 2 )) -> list(v 3 ) An example of a map-reduce job is to count the number of times a particular word appears in some input data [1]. Input to this problem is a data file, the contents of which are distributed among the nodes of a cluster in the form of splits. Each node receives lines one-by-one from its input split in the form of key-value pairs; key in this case is the byte offset of the line and the value is a line in the file. A map operation is performed on each key-value pair which produces a list of intermediate key-value pairs, in this case a word and its count e.g., ( Hadoop, 1 ), ( join, 1 ) etc. The value 1 indicates that the word appears once in a particular line. The key-value pairs from all the mappers are stored in buckets such that values related 9

20 Chapter 2: Background Figure 1: The Map/Reduce dataflow to one key (word) are gathered in one bucket. A set of such buckets, called partition, is then provided to each reducer. A reducer calls a reduce operation for every bucket that performs an aggregate operation on the values to determine the count of the total number of times a word appears in the input file. Thus each call to a reduce operation produces one output for the key. An optional combine phase can be employed by each mapper to minimize the traffic on the network. In this phase, each mapper locally aggregates the output for each key. Hence, for example instead of transferring a key-value pair ( Hadoop, 1 ) thirty times, a mapper sends ( Hadoop, 30 ) only once to the file-system. This reduces the traffic when a reducer picks the value during the shuffling phase. 2.2 The Hadoop Distribution The Hadoop Distribution is based on HDFS a file-system with master-slave architecture. In a Hadoop cluster of n nodes, one node is the master node called NameNode (NN). The other nodes are worker nodes, called DataNodes (DN). The NN maintains metadata about the FileSystem. Files are broken down into splits of 64MB and distributed among the DNs. The larger size of splits as compared to the split size of conventional file systems reduces the amount of metadata to be maintained for each file. Each split is replicated on three DNs to ensure fault-tolerance. Hadoop also ensures locality of data i.e. processes are scheduled 10

21 Chapter 2: Background on the nodes which possess the data on which processing has to be performed, that is, computation is moved to the nodes containing data rather than data being moved to the nodes capable of doing computation. This reduces the amount of the data transferred among nodes and hence improves the performance. The DNs constantly send a heartbeat message to the NN along with the status of the task they have been assigned. If the NN doesn t get any information from a node for a threshold time, it re-schedules the task on another node which contains the replica of the data. Similarly, if tasks are distributed among data-nodes and all of the nodes have finished the processing but a straggler node still is performing the processing, the NN re-schedules the same task on an idle node. Whichever node returns the result earlier, its output is used by the NN and similar processes on the other nodes are killed. The general architecture of HDFS is shown in Figure 2. Figure 2: The HDFS Architecture 11

22 Chapter 2: Background Jobtrackers and tasktrackers work on master and worker nodes respectively to handle jobs and tasks. When a Map/Reduce job is submitted to the master, a jobtracker divides it in m tasks and assigns a task to each mapper. Following is the sequence of steps for conversion of input to output on Hadoop: 1. Mapping Phase: Each mapper works on non-overlapping input splits assigned to it by the NN. An input split consists of a number of records. The records can be in different formats depending on the InputFormat of the input file. A RecordReader for that particular InputFormat reads each and every record, determines the key and value for the records, and supplies the key-value pairs to map functions where the actual processing takes place (Figure 3). Each mapper applies a user defined function on the key-value pairs and converts them to intermediate key-value pairs. The intermediate results of mappers are written to the local file-system in a sorted order. 2. Partitioning Phase: A partitioner determines which reducer an intermediate key-value pair should be directed to. The default partitioner provided by Hadoop computes a hash value for the key and assigns the partition on the basis of the function: (hash_value_of_key) mod (total_number_of_partitions). 3. Shuffling Phase: Each map process, in its heartbeat message, sends information to the master about the location of the partitioned data. The master informs each reducer about the location of the mapper from which it has to pick its partition. This process of moving data to appropriate reducer-nodes is called shuffling. 4. Sorting Phase: Each reducer, on receiving its partitions from all mappers, performs the sort-merge join to sort the tuples on the basis of the keys. Since keys within partitions were already sorted in each mapper, the partitions have to be merged only such that the similar keys are grouped together. 5. Reduce Phase: A user-defined reduce operation is applied on each group of keys and the result is written to HDFS. 12

23 Chapter 2: Background Figure 3: Hadoop Map/Reduce dataflow (source [43]) 2.3 Impact of parallelization on the join operation The massive growth in the input data to be processed hampers the performance of the applications executing on uni-processor machines. If it curtails the performance of single stream operators (selection, projection, aggregation etc.), it doubles the trouble for the join operator which handles two data streams in parallel. Matching the records from gigantic data streams is clearly overwhelming. In large data warehousing applications, this may mean joining of trillions of records. Multi-processor or distributed processing is the solution to this problem and significantly improves the response time. In a multi-processor or distributed setting, the performance of the join operation can be improved by using partition-wise joins [14]. The input data is partitioned among a number of machines such that processing at parallel machines 13

24 Chapter 2: Background can be carried out independently. For parallel evaluation of the join operator, the two datasets are partitioned by applying a hash function in the same way such that each machine handles a subset of keys which can be joined independently. The partitioning of the datasets is performed on the basis of the join key so that a machine gets all tuples with same join key from both datasets. Thus partitioning the input datasets scatter them across a number of machines where a partition-wise join is carried out (Figure 4). This partition-wise join is a key to achieving scalability for massive join operations as it reduces the response time. The amount/degree of parallelism of the partition-wise join is limited by the number of partition-wise joins that can be executed concurrently. The greater the number of such concurrent partition-wise joins, the greater is the degree of parallelism. If the number of parallel partition-wise joins is 8, the degree of parallelism is 8. Figure 4: Partitioning R and S on join column R.r=S.s in three partitions using hash(k)=k%3 Sort-merge and hash joins are the natural choices for the join operation in distributed environments since both of these join techniques can operate independently on subsets of join keys. As each partition contains the same join keys from both datasets, employing sort-merge or hash join techniques exploits the parallel partitioning and hence provides scalability and divisibility. New partitions can be added for processing without effecting the on-going processing. However, in comparison, performance of the hash join is better than that of the sort-merge join since hash joins have linear cost, as long as a minimum amount of memory is available [22]. For the rest of our discussion and for the implementation of our algorithm, we will consider only the hash-join algorithm. 14

25 Chapter 2: Background 2.4 Hash Join algorithms Several variations of the hash join algorithm exist such as the classical hash join, grace hash join, hybrid hash join etc., with little differences. In essence, all of them build a hash table for the keys of the inner relation. The keys of the outer relation are iterated over and matched against the hash table entries. The matching tuples are written to an output table. The different flavors of the hash join algorithm differ in granularity of data used for the join operation. We discuss all of these join algorithms in brief for a single processor machine and then elaborate only the grace join algorithm for distributed setting Join algorithms on a single processor machine Let us consider that relations R and S have to be joined on the join column R.r= S.s. We assume S to be the smaller inner relation and R to be the outer relation. We also assume an inner-join between the relations. 1. Classical hash join This simplest and most basic hash join consists of following steps: 1) For each tuple t s of the inner build relation S a. Add it in an in-memory hash table on the basis of a hash function h applied on the key i.e. h(key). b. If no more tuples can be added to the hash table because the memory of the system is exhausted: i. For each tuple t r of the outer probe relation R, apply the same hash function h(key) on the key. Use the result as an index into the in-memory hash table built in step 1 to find a match. ii. If a match is found, a joined record is produced and the result is outputted. iii. Reset the hash table. 2) In a final scan of the relation R, the resulting join tuples are written to the output. 15

26 Chapter 2: Background 2. Grace hash join A refinement of the classical hash join is the grace hash join algorithm. The grace hash join algorithm partitions the relations and carries out the above hash join technique for each partition rather than for the whole relation (Figure 5). This reduces the memory requirements for keeping the in-memory hash table. 1) R and S are divided into n partitions by applying a hash function h 1 on the join column of each tuple t r and t s of relations R and S respectively. Both partitioned relations are written out to the disk. 2) For the smaller relation S, a partition is loaded and an in-memory hash table is built using a hash function h 2 on the join column of each tuple t s in the build phase. Hash function h 2 divides the partition into a number of buckets so that matching with the probing tuples becomes efficient. 3) During the probe phase, a similar partition from the relation R is read. The same hash function h 2 is applied on the join attribute of each tuple t r and matched against entries in the in-memory hash table. 4) If a match is found, a joined record is produced and written to the disk. Otherwise the tuple t r is discarded. Figure 5: Grace Join Algorithm (source [44]) 3. Hybrid Hash Join A minor refinement of the grace join algorithm is to keep one partition in memory instead of outputting to the disk and then use it for joining. This is the hybrid of the classical and grace join algorithms. The hybrid hash join follows these steps: 16

27 Chapter 2: Background 1) When partitioning the relation S, all partitions except the first one are written to the disk. An in-memory hash table is built for this partition. 2) When partitioning the relation R, all partitions except the first one are written to the disk. The tuples of this partition are used to probe the in-memory hash table of step 1 for matching. If a match is found, the joined record is written to the disk. 3) After exhaustion of the first partition, the procedure of the grace join algorithm is carried out for the rest of partitions. Since first partitions of both relations are never written to the disk and are processed on-the-fly, this avoids the cost of reading back these partitions from the disk to the memory Hash join algorithm for parallel implementation Because of their nature, the grace- and hybrid- hash join algorithms can easily be parallelized. The difference between the single processor and multi-processor/parallel variants of these algorithms is that the partitions are processed in parallel by multiple processors in the parallel variant (Figure 6). Below are the steps for the grace hash join algorithm on a multi-processor system: 1) The input relation R is horizontally divided into n partitions such that each partition carries approximately R /n tuples. A hash function h 1 is applied on the distribution key. Here we make the join key as the distribution key so that tuples with same join key are propagated to the same partition. The range of this hash function is from 0 to n-1 so that keys can be directed to one of the n nodes. The n partitions of R formed as a result of the hash distribution are written to the disk. 2) A similar process is carried out for relation S. It is divided into n partitions, each partition carrying about S /n tuples, by applying the same hash function h 1. This ensures that a partition x of the relation S contains the same join keys as partition x of the relation R. The partitions of S are also written to the disk. 3) Each processor reads in parallel a partition of relation S from the disk. It creates an in-memory hash table for the partition using a hash function h 2. 17

28 Chapter 2: Background 4) A corresponding partition of relation R is also read in parallel from the disk by each processor. For each tuple in this relation, it probes the in-memory hash table for any match. For each matching tuple, a joined record is outputted to the disk. Since all the n partitions of a relation are completely independent of each other, parallel processing can be carried out. Each processor handles the corresponding partitions from both the relations and writes the joined records for the matching tuples. Thus parallelizing the join operation improves the performance by a factor of the number of PUs. Figure 6: Joining on parallel machines 2.5 Skew and its impact on join operation In databases, it is common that certain attribute values occur more frequently than others [19], [21]. This is referred to as data skew. Skew in the input data can limit the effectiveness of parallelization of the join query [14]. As discussed earlier, parallelizing a join operation consists of the following steps: 1) Tuples are read from disk 2) Selection and projection are carried out on the basis of query 3) Tuples are partitioned among the parallel sites 4) Tuples of partitions on each site are joined 18

29 Chapter 2: Background Skew can occur at any of these stages and hence categorized as the tuple placement skew, selectivity skew, redistribution skew, and join product skew for each of the above stages respectively [17]. The initial placement of tuples in partitions may vary, giving rise to the tuple placement skew. The selectivity skew results from the fact that applying a selection predicate on different partitions may result in a varying number of selected tuples left in each partition. The redistribution skew is caused by varying number of tuples in the partitions after applying the redistribution scheme for partitioning. The join product skew is the result of differences in join selectivity at each node. For our implementation, we are not considering the tuple placement skew since Map/Reduce creates the file splits of almost even sizes. The selectivity skew is also ignored because firstly, it does not have any considerable impact on the performance and secondly, we assume in our programs that the selection and projection predicates are not applied. The join product skew cannot be avoided because it is evident only after the partitions of the two relations are joined. The redistribution skew is the most important and major type of skew that impacts the load distribution among nodes. This skew is caused by the selection of an inappropriate redistribution strategy for partitioning. In further discussions of skews, we will be referring only to the redistribution skew and will handle only this skew in our implementation. After the redistribution of tuples in partitions, the hash join algorithm is applied on the partitions of the two datasets at each node. Although the hash join algorithm is easily divisible and scalable, it is very sensitive to skew. Skew in the keys results in the variance in time taken by processing nodes. If some keys appear quite frequently in the input relation, an overly used key is sent to only one processing node on the basis of the hash partitioning. This results in an uneven distribution of the keys since the partitions receiving overly used keys will contain too many tuples. As a result, the nodes processing these partitions take too much time for completion and hence become a performance bottleneck. The performance of the whole distributed system is adversely affected by these heavy-hitter nodes as some nodes remain underutilized. To take full benefit from the parallel distributed environment, it is therefore important that the redistribution strategy should be selected in such a way that partitions are of considerable size and evenly distributed to avoid the load imbalances. Current partitioning strategies are divided into two categories: hash partitioning and range partitioning [16]. Both partitioning strategies have different sensitivities to different degrees of skew in the input keys. In the following 19

30 Chapter 2: Background discussion, we observe their sensitivities and conclude which partitioning strategy is most effective for skew handling and should be incorporated in our algorithm Hash Partitioning and its skew sensitivity As discussed earlier, during the redistribution phase, a partitioning function is applied on the keys of input datasets to distribute the workload among a number of nodes for parallel join computation. The hash partitioning is the most commonly used partitioning strategy. It distributes tuples to PUs on the basis of the hash value of the redistribution key. PU_no = h(k) mod N pu Here, k belongs to the domain of the redistribution key attribute and N pu is the number of PUs in the system. PU_no determines the PU to which a tuple with key k should be forwarded. However, it is the hash partitioning that may end up the redistribution phase in skewed partitions. With the hash partitioning, the number of tuples that hash to a given range of values cannot be accurately determined in advance. Whenever, relations are distributed among the parallel processing nodes on the basis of hashing, a key that is skewed will be directed to one and only one processing node. Selection of a good hash function is not a solution to the problem of data skew. Even a perfect hash function will map every join attribute with same value to one partition and hence a partition that receives all of these overly used keys will be overloaded. Let us consider an example to understand the situation that arises from skewed partitions generated by the hash partitioning. Two relations patent and cite are to be joined (Figure 7). The patent relation lists the patent ID and its grant year. The fields in the cite table are citing patent and cited patent. Data in the patent relation is unique i.e. there is one row for each patent. However, one patent may cite one or many other patents. Therefore the cite relation may contain more than one entry for the same patent. Some patents are too popular and hence are cited by a number of other patents. On the other hand, some patents are once or not at all cited by other patents. This presents a possibility for data skew. Now for each patent cited in the cite table, we may need to determine information about the citing patent. This can be done by joining patent ID attribute in the cite and patent relations. This presents the case of single skew since one of the relations contain the skewed data. 20

31 Chapter 2: Background Figure 7: Example of data skew -- Patent and Cite Tables For example, patent is cited by 5,000 other patents, so the cite table would contain 5,000 entries for this patent. When partitioning the cite relation by applying a hash function on the patent ID attribute, one single node would receive at least those 5,000 tuples irrespective of how many tuples are directed to other nodes. Selection of a good hash function has negligible impact on the skew in partitions. Although selection of a perfect hash function may restrict two different join attribute values to be hashed into the same partition, the imbalance discussed above may still be present even using this ideal hash function since the hash partitioning directs the same keys to one partition. So applying a hash function such that only patent is in one partition would still overload this node with 5,000 tuples while other nodes may not have sufficient load. There is no smarter hash function that can avoid the imbalance because of key repetition since it is by the very nature of the hash function to direct same keys to same partition. This resulting uneven distribution nullifies the gains achievable by parallel infrastructure. A more practical example of heavy data skew can be of the data received from sensors in a sensor network. These sensors continuously send the sensed values (which can be raw bytes, complex records, or uncompressed images etc.) to a monitoring station where the values are logged. Let us consider that a log dataset L logs for each sensor the sensor ID, time stamp, and the sensed humidity of a place being monitored. A sensor dataset S stores information about the sensor ID, the name of the place being monitored, and the sensor manufacturer. A monitoring station may need to join the relations L and S. In practice, some sensors may be of very high frequency and send data for logging very frequently while others may not 21

32 Chapter 2: Background have very high frequency. Therefore, the join operation using the hash partitioning would clearly overload the partitions handling a high frequency sensor. This will eventually throttle the performance of the distributed system Range Partitioning and its skew sensitivity As we have seen, hash functions used in the redistribution phase may result in imbalanced partitions in case of heavy skew in the input data. A good redistribution strategy should distribute the overly used keys to more than one partition. However, the overly used keys should first be determined and then this information can be used for deciding the partition boundaries. Two strategies of partitioning datasets on the basis of their key distribution are the simple range partitioning and the virtual processor partitioning. 1. Simple Range-based Partitioner As opposed to the hash partitioner, which at best allocates a single attribute value to a partition, a range partitioner may allocate a sub-range of the join attribute value to one partition. In the simple range partitioning, the number of partitions is equal to the number of PUs. Since one partition is handled by one PU, allocating only a sub-range of a single value to one partition reduces the burden on one PU in case of heavy skew. A split vector determines the boundaries for distribution of values among partitions. The entries of the split vector may not divide the value range in equally spaced blocks. This has, in fact, a positive impact since these entries may be chosen in such a way as to equalize the number of tuples mapped to each partition. Given p PUs, the split vector contains p-1 entries {e 1, e 2, e 3,, e p-1 }. This split table determines the ranges for range partitioning. From this split vector, each PU is assigned a lower bound and an upper bound for the range partitioning (except the first PU which does not have a lower bound and the last PU which does not have an upper bound). All the tuples that have their join key attribute falling in a particular range are sent to the PU associated with that range i.e. keys e 1 are routed to processor 1, e 1 < keys e 2 are directed to processor 2, and keys > e p-1 find their way to processor p. How to select this split vector? A good split vector can be selected by sampling the input relations so that an estimate of the distribution of join attribute values in the data can be obtained. Sorting the input dataset R with cardinality R and then 22

33 Chapter 2: Background selecting split values after a step size of R /p is one solution. However, this is an inefficient solution since a datasets containing trillions of records takes too long for sorting. Random sampling the input relation on the join attribute value is a better alternative which may give an estimate of the join key distribution in the input relation without parsing the whole input relation. x number of samples are taking randomly from the input relation ( x needs not to be same as p i.e. the number of processors). Theoretically, the greater the number of random samples taken, the better idea we get about the distribution of join attribute values in the input data. [13] experimentally determines that O( n) is the optimal number of samples for efficient random sampling. The resultant split table T, containing the randomly selected join attribute values, is sorted. Since the size of this sample table is many folds smaller than the input dataset, sorting T does not take a significant time. A split vector is determined from this split table by collecting values after a step size of T /p. Since an input dataset is randomly sampled, there are fair chances that the skewed attribute value will occur more than once in the split vector. Let us consider that split vector contains {e 1, e 2, e 2, e 3 } for a 5-node distributed system. Keys e 1 are assigned to node 1 but e 1 < keys e 2 can be directed to either node 2 or 3. When a key can be directed to more than one nodes, a node is selected at random from the candidate nodes and the tuple is sent to it. In this way, the skewed key attribute e 2 will be distributed among more than one nodes and hence only one machine will not be penalized for the skew. A question arises as to which of the input relations should be sampled: the building relation or the probing relation? The building relation is more sensitive to data skew than the probing relation since an in-memory hash table needs to be built for the building relation. Thus the building relation is sampled and then partitioned according to the resultant split vector. The same split vector is used for the probing relation as well. As we discussed earlier, in the case of repetition in the split vector, a tuple of the building relation is sent randomly to any of the nodes that serves this value (candidate nodes). For the probing relation, a value that falls in more than one ranges is sent to all the candidate nodes. For the split vector {e 1, e 2, e 2, e 3 } in the example above, key e 2 of the probing relation will be sent to the nodes 2 and 3. The procedure can be reversed as well i.e. an alternative can be to send a tuple of the building relation whose join key belongs to more than one ranges to all of the candidate nodes and randomly send a corresponding tuple from the probing relation to only one of the candidate nodes. However, this is not an effective alternative 23

34 Chapter 2: Background since the replication of the build keys in multiple nodes will increase the size of the in-memory hash table for all such nodes. In effect, we want to keep the build relation as small as possible to reduce the memory requirement of the in-memory hash table. Thus, for our implementation, we will be duplicating the key of the probing relation whenever a key falls in multiple ranges. 2. Virtual processor partitioner In the simple range partitioner, each PU handles only one partition and hence the number of partitions is the same as the number of PUs. An improvement in the range partitioning can be achieved by making the number of partitions greater than the number of PUs. Skew can be handled efficiently if we have a large number of partitions. We then assign these partitions to the processors either in a round robin fashion or we can dynamically feed the processors with partitions when they are finished with their earlier workload. The key idea behind this virtual processor approach is that if data is skewed, having a large number of partitions will spread the skewed data over a number of partitions. As a result, work gets more evenly distributed and the system does not suffer from the inefficiencies caused by skew. The number of partitions in the virtual processor approach should be a multiple of the number of PUs. So in the case of a 5-PU system, the number of partitions should be 10, 15, 20, etc. Load imbalance may occur if the number of partitions is not a multiple of number of PUs. For example, in case of 11 partitions, each PU handles two partitions except one which handles three partitions. So while this PU is processing the last partition, all other PUs remain idle. For the example discussed earlier, if we take '2' as a factor for the virtual processor partitioning, our split vector will contain 10 instead of 5 splits. So our split vector becomes {e 1, e 2, e 2, e 2, e 3, e 4, e 5, e 6, e 7 }. A key falling in a particular range is allocated to its associated partition e.g. keys e 1 are assigned to partition 1; e 1 < keys e 2 are assigned to partition 2, 3, or 4; e 2 < keys e 4 are allocated to partition 5, and so on. As it can be observed, join keys e 2 are spread apart to three partitions instead of two and hence reduces the accumulation of skewed keys in a single partition. The evaluation section experimentally determines the number of virtual processors that may provide an optimal performance. As evident from the discussion, the hash join does not efficiently handle the situation when input data is heavily skewed on a particular key. For such situations, the range partitioning performs better since it spreads the skewed keys to more than 24

35 Chapter 2: Background one partitions. For the implementation of our algorithm, we use the range partitioning strategy to handle skewed keys. 2.6 Pre-processing for joining Sometimes a dataset may be very large but a big portion of it may not be participating in the join operation. One example of this scenario can be a log table that stores the logs of activities on Facebook for one hour. Another table U contains information about Facebook users. We have to create a join between these two tables to associate the log of a user s activities with some additional information of that user. Facebook currently has 400 million registered users [42] but not all of these users may be active in one hour. Loading the complete user table U table to create a join with log table L is definitely a waste of resources since the complete U table will be shuffled across the network but eventually a big portion of this table will not be used in the join. The performance of join operations can be enhanced by filtering the U table with only those users whose activities are logged in L table and then L can be joined with the reduced size U table. This pre-processing significantly reduces the amount of data distributed over the interconnection network and also the sizes of tables to be joined. The pre-processing can be achieved through semi-joins using selection or bit-filtering. Semi join is a relational operator used to reduce processing cost of queries involving binary operators (e.g., join). A semi join from S to R is represented by R S on attribute X. The result of a semi join between R and S is a subset of tuples of R for which there are matching tuples in S for the attribute X. Mathematically, a semi join from S to R is represented as: Rx: {r R s S, r. X = s. X} The general computational steps are: 1) Project S on attribute X and select only the unique values of the attribute (S x ). 2) Reduce R by the unique keys S x. This eliminates un-necessary tuples from R that are not going to take part in an operation between R and S. 25

36 Chapter 2: Background Let us consider an example of semi join for two relations: Books and Authors shown in Table 1 and Table 2. Table 1: Books Relation Title ISBN Author Pro Hadoop Jason Venner Hadoop in Action Chuck Lam Hadoop: The Definitive Guide Tom White Data Intensive Text Processing Jimmy Lin with Map/Reduce Table 2: Authors Relation Author Home page Jason Venner Tom White The semi join of the Authors relation to the Books relation lists only those books for which there is an author in the Authors relation. The result is shown in Table 3. Table 3: Filtered Book Relation Title ISBN Author Pro Hadoop Jason Venner Hadoop: The Definitive Guide Tom White Since now both the relations contain only those tuples that take part in the actual join, the data to be processed for the join operation is reduced. This clearly will not make any significant difference for the trivial example explained above. However, for massive datasets containing billions of records from which only 26

37 Chapter 2: Background hundreds or thousands are going to take part in the actual operation, the reduction in the size of datasets significantly reduces the transmission, storage, and computation overhead. The result is an improved performance of the actual operation. However, this pre-processing incurs some cost. We determine in the evaluation section whether the improvement in the performance of the actual operation is worth the pre-processing cost. Semi-joins can be computed through two methods: using selection or using bit-filtering. We discuss both of them here with respect to Hadoop implementation Semi-join using selection If a relation R is semi-joined by a relation S, R is filtered by only the tuples present in S. The semi-join is completed in three stages, each representing one Map/Reduce job: 1) Stage 1: We determine the unique keys from relation S. Each mapper in a Map/Reduce job receives an input split of S. An in-memory hash table is built as tuples are read from the input split. If a tuple with a new key arrives, we place the key in the hash-table and also emit the key. If the key is already present in the hash table, we do not emit anything. Here we do not need any value, so we emit null instead of the value. In the reduce phase, we emit (key, null) for each key-value bucket. We restrict the number of reducers here to 1 so that we get only one single file S keys listing all the unique keys of S. 2) Stage 2: The second phase filters the relation R with unique tuples of relation S, stored in S keys. The table S keys is broadcast to all mappers through the distributed cache, an in-memory hash table is built, and each mapper emits a record of R only if its matching key is present in the hash table. The reducer here is an identity reducer since we don t need any further operation. The result is r files for Relation R, where r is the number of reducers. The result is outputted to HDFS. In the second stage, we try to emit the records such that the key is the join key and apply the same hash function on it for partitioning as will be used for partitioning the relation S in stage 3. This will help us save one Map/Reduce job to partition the relation R on the join attribute for the map-side, reduce-size, and our hybrid join. 27

38 Chapter 2: Background 3) Stage 3: Any of the above hash join algorithm is used to compute the actual join between the filtered R relation produced in Stage 2 and the relation S. The semi-join using selection has some inherent limitations. An in-memory hash table has to be built to record the unique keys. Each entry of the hash table is the key itself which may occupy several bytes. This can result in a huge space occupied by this data structure if keys are quite big and the input dataset contains huge number of records Semi Join using Hash Operator The semi-join using hash operator produces a search filter using a bit array. This bit array maintains the hash projection of the unique keys of S. The tuples of R are filtered from this search filter. One such filter is called Bloom filter. Bloom Filter: The bloom filter, named after its inventor Burton Howard Bloom, is an efficient mechanism to test membership of an element in a set. It is a bit array in which n elements are mapped into m bits using k hash function. A bloom filter may have false positives but no false negatives i.e. an element can falsely be shown as member of a set; however, if an element is a member of a set, the membership test always returns true. To construct a bloom filter, k hash functions are applied on each member of the set. For example, for an element e, k hash functions h 1 (e), h 2 (e), h 3 (e),, h k (e) produce k index values in the range 1 to m, where m is the size of the filter. These k indices in the bit array are set to 1. In order to test the containment of an element in the set, the same k hash functions are applied on an element to determine the indices in the bit array. If all those bit array locations are set, either the element is contained in the set (true positive) or the bits were set for some other element (false positive). However, if any of these bits are 0, the element in not contained in the set. The generation of indices for setting the bit values using hash functions is depicted in Figure 8. If an index returned by a hash function is already set (for some earlier element), the bit remains set. The changes in the bit filter by adding new element over time is shown in Figure 9. 28

39 Chapter 2: Background Figure 8: A bloom filter with m=15, k=3 for a set {A, B, C} Figure 9: Bit setting in bloom filter in case of collision The size of a bloom filter is set at the beginning and it remains constant throughout, no matter how many elements are added to it. However, adding more elements results in increasing the false positive rate since a bit value will be a representative of the increased number of elements. The accuracy can be enhanced by increasing the size of the bit array. The bigger the bit array, the smaller is the probability that two elements are represented by the same bit indices. However, increasing the size may result in a space overhead. The number of hash functions k also affects the accuracy since these hash functions map the elements of a set into the bits. Increasing k reduces the collisions and hence decreases the false positive rate. The k hash functions should be chosen to be as independent as possible. There should be little, if any, correlation so that the indices to which an element is mapped are unique. For constructing a bloom filter, the trade-off between accuracy and size of the filter and number of hash functions should be taken into account. One of the major advantages of the bloom filter is its space compactness. The space requirement for a bloom filter is the least as compared to other data structures 29

40 Chapter 2: Background like linked list, hash maps, sets, and arrays for testing the containment. Therefore, as compared to the semi-join through selection (which uses Java s HashMap to store unique keys), the bit array of a bloom filter occupies little space and hence adds to the efficiency. In our test, we make a bloom filter of 50,000,000 bits which occupies about 5.96MB of disk space. For the semi-join through selection, we store the unique keys in a HashMap (the size of each key is 8 bytes). For 10,000,000 records, this results in about 76MB of space requirement (if we assume that all keys are unique). Thus the space requirement of the semi-join through selection is about 12 times greater than the semi-join through bloom filter. This difference between the space requirements of the semi-join through bloom filter and the semi-join through selection increases quite drastically when the key size increases. For example, for a key size of 100 bytes, the semi-join through selection in the above case will need 953MB of storage which is about 160 times more than that required for the bloom filter. Bloom Filter using Map/Reduce: Since a bloom filter reduces the amount of data to be processed, we can construct a bloom filter for one of the relations (let us call it S) and then distribute this filter to all the map tasks that process the input splits of the second relation (R). In each map task, each record of the relation R is checked for containment in the bloom filter. If the record does not pass through the filter, it is discarded and hence never processed. Applying a bloom filter is carried out in two independent stages, each being a separate Map/Reduce job: 1) Stage 1 - Construction of a bloom filter: We want to construct a bloom filter that records the unique join keys of the relation S (where S is the smaller of the two relations). Each mapper implements a bloom filter of size 5,000,000 bits using Java s BitSet. For each record of its input split of S, each mapper applies k hash functions on the key to generate k indices. These k indices are turned on in the bloom filter to add the key to the set. Depending on the size of the input dataset, the framework can initiate more than one mappers. In that case, each mapper will generate one bit filter (Figure 10). To combine all these bit filters in one, we employ only one reducer so that bit filters from all the mappers are processed by the same reducer. The reducer takes the UNION of all the bit filters and writes the resultant in binary format to HDFS. 30

41 Chapter 2: Background 2) Stage 2 - Filtering using a Bloom Filter: For the Map/Reduce job processing the relation R, each mapper needs the bloom filter constructed in stage 1. This bloom filter file can be distributed among the mappers using the Distributed Cache mechanism of the framework. Each mapper reads the binary file and reconstructs the bloom filter in the configure() method i.e. before the start of any map task. In each map task, an input record of relation R is checked for containment in the bloom filter set by applying the same hash functions on the key of the record. If the key passes through the filter, it is processed further. Otherwise the key is discarded and is never used. Figure 10: Construction of a Bloom Filter In the evaluation section, we determine which semi-join method provides better performance. This method is then used for the subsequent experimentations. 31

42 Chapter 3: Join Algorithms on Hadoop Chapter 3 Join Algorithms on Hadoop A number of different join algorithms are provided by Hadoop: map-side, reduce-side, and memory-backed join algorithms. In this section, we discuss these algorithms, present our own algorithm, and provide some important implementation-specific details w.r.t. Hadoop. 3.1 Reduce-side joins As the name suggests, a reduce-side join is carried out on reducer nodes. A two-way join using the reduce-side join algorithm is completed in one Map/Reduce job. The key idea behind the reduce-side join is that each mapper tags input tuples from both datasets with their source information and generates tagged key-value pairs such that the emitted key is the join key. All the tagged key-value pairs are shuffled across the network and each reducer receives the tagged key-value pairs from both datasets such that the pairs share the same join key attribute. The join is carried out by each reducer between the records of the two datasets and the joined records are outputted. Implementation details: Consider two relations R(A,B) and S(B, C) that have to be joined together. Both relations are stored in separate files. These two relations have to be joined on the key B. Keys in Map/Reduce are not essentially the same as in relational databases. The Map/Reduce keys are not unique. They are just the attributes used to distribute data among reduce processes. To keep things simple, we consider an inner-join between R and S. The sequence of steps to join R and S using Hadoop are as under (depicted in Figure 11): 1) Splitting: The input files of both the datasets are split by Hadoop into manageable splits that are assigned to a collection of map processes. 2) Mapping: Each map process receives key-value pairs from the input datasets and generates intermediate key-value pairs such that in each pair, 32

43 Chapter 3: Join Algorithms on Hadoop the emitted key is the join key. Before emitting, it also tags the intermediate key-value pairs with information about their source relation. Thus, the intermediate key-value pair emitted by a mapper consists of the join key tagged with data source and a value tagged with data source. In case of relation R, a map process will turn each tuple (a, b) from the input split of R into a key-value pair with composite key (b, 0) and value (a, 0). Similarly, for relation S, each map process will turn tuple (b, c) from S into a key-value pair with key (b, 1) and value (c, 1). The numbering 0, 1, 2 etc. indicates the source of data. Instead of tagging a key-value pair with full name of the dataset, we use here this numbering scheme in order to save some bytes since the tag has to be included with each key-value pair. This numbering scheme has another use in sorting the pairs during the reduce phase as well, which will be discussed later on. Tagging is important because in the reduce phase, we want tuples from relation R and S to be joined together rather than joining R tuples with R or S tuples with S. Having an identifier for each relation helps distinguish the tuples according to their data sources and hence tuples from different relations are joined. Figure 11: Data flow of the reduce-side join 33

44 Chapter 3: Join Algorithms on Hadoop 3) Partitioning: After producing (composite_key, value) pairs, partitioning is to be done so that pairs with the same join key reach the same reducer. If we partition the key-value pairs on the basis of the composite key, values belonging to keys (b, 0) and (b, 1) will be routed to different reducers. We want the partitioning on the basis of just the join key i.e. b in this case. Since the output of the map phase is a composite key i.e. (join key, data source), we have to implement our own partitioner that extracts the join key and assigns an appropriate partition number to the tuple depending on the hash value of this join key. In order to override the default partitioner, we have to make a custom partitioner class and extend it from the Partitioner class provided by Hadoop. In the getpartition() function of this custom partitioner class, we extract the join key from the composite key and on the basis of the hash code of this join key, we assign it a particular partition number. The number of partitions depends on the number of reducers available. Suppose there are k reduce tasks, a hash-function h is applied on each key from mapper such that the key-value pair maps to one of the k-partitions on the basis of h(b). 4) Shuffling: The intermediate key-value pairs of mappers are shuffled across the network such that each reducer gets the key-value pairs of its partition. 5) Sorting: In each reducer, we want the incoming tuples to be sorted in such a way that tuples from one dataset are ahead of the tuples from the other dataset so that joining the tuples becomes easy. However, such ordering is not guaranteed in this given scenario. Tuples from relation R may come before the tuples from relation S in some reducers and vice-versa in others because of the shuffling of the key-value pairs. Since key-value pairs are arriving at a reducer from different map processes, the ordering of keys within a group is unpredictable. Map/Reduce implements a sort mechanism to sort the key-value pairs received by a reducer. However, since we have a composite key, we need to sort on the basis of this composite key and hence need to implement a custom key comparator. We implement a key comparator extended from WritableComparator class of Hadoop and set in the job configuration by setoutputkeycomparatorclass parameter. Now in each reduce process, the key-value pairs are presented in such a way that the tuples tagged with 0 i.e. belonging to the relation R are always ahead of 34

45 Chapter 3: Join Algorithms on Hadoop the tuples tagged with 1 i.e. belonging to the relation S. This is also known as a Secondary Sort. 6) Grouping: Each reducer groups the keys within a partition and presents each group to a separate reduce process. For example, key-value pairs (b 1, 0, 65, 0), (b 1, 0, 43, 0) will be presented to one reduce process and key-value pairs (b 1, 1, 13, 1), (b 1, 1, 4, 1), (b 1, 1, 58, 1) will be assigned to a separate reduce process. This behavior is due to the fact that sorting was done on the basis of the composite key. This is not what we want. The key-value pairs with same join attribute should be received by one reduce process. This can be accomplished by grouping the keys according to the join attribute (Figure 12). To achieve this, we implement our custom group comparator derived from WritableComparator class of Hadoop which considers only the join attribute for grouping. We set the comparator for grouping in the jobconf property setoutputvaluegroupingcomparator. Thus, the key-partitioner ensures that all key-value pairs with same join attributes are directed to the same reduce task and the group comparator ensures that all key-value pairs with same join attributes are routed to the same reduce process inside the reduce task. Figure 12: Key-Value pairs before and after group partitioning 7) Reduction: The groups (b1, 0, <(65,0),(43,0),(13,1),(4,1),(58,1)>) and (b2, 0, <(3,1),(19,1),(7,1)>) in the example above are provided to separate reduce processes. In each reduce process, a join is computed between the values tagged as 0 with values tagged as 1 after decoupling the tags from the 35

46 Chapter 3: Join Algorithms on Hadoop (value, tag) pairs. The joined records produced as a result of this cross-product are written to the output part file of the reducer. The above steps describe the reduce-side join in very simple terms. However, three cases can be encountered while joining the records: 1) The simplest case is when there is at most one tuple with same join key in each of the datasets. Each reducer will get at most two values associated with a key, one of which must be from R and the other must be from S. Both the values are joined together in one joined record. If there is only one value associated with a key, it means that there does not exist a value in the other dataset with the same join key. In this case, the reducer doesn t emit anything. 2) The second scenario may arise when a one-to-many join is to be performed. Relation R contains one value and relation S contains many values for the same join key. In this case, the first value in a reducer belongs to the relation R and the rest belong to the relation S. Each reducer buffers this first value and then proceeds to create a cross product with subsequent values encountered (which are guaranteed to belong to the relation S due to the sorting process). 3) The third and the last case is the situation when the first relation contains many values belonging to a particular key while the second relation contains one or many values belonging to that key. This results in many-to-one or many-to-many joins. If there are w tuples in relation R and z tuples in relation S sharing the same join key, the join result will be a cross product of w tuples of R and z tuples of S resulting in w x z joined records. For this joining case, each reducer buffers in memory all the w tuples from the relation R. For each of the z tuples of the relation S, a joined result is produced with each of the buffered w tuples from R whenever a tuple belonging to the relation S is encountered. The assumption here is that the w tuples should fit in memory since we are buffering them. Implementing case (3) ensures that cases (1) and (2) would themselves be handled. We handle case (3) in our implementation. As evident from the description above, the key-value pairs from both the datasets have to be shuffled across the network and sorted at each reducer. Moreover, tagging the key-value pairs with source information results in a time and 36

47 Chapter 3: Join Algorithms on Hadoop space overhead. Also, in the case of skewed data, all the records for a particular skewed key are sent to one reducer since a hash partitioner is used. This causes some reducers to be overloaded and hence results in workload imbalance, eventually limiting the benefits of the parallel architecture. 3.2 Map-side partitioned joins If both the datasets are already partitioned, the shuffling and sorting incurred by the reduce-side join can be avoided by carrying out the join in the map phase. The two datasets are read in parallel and joined together. This is known as a map-side join since there is no need for the reduce phase and the join is completed in the map phase. An essential condition for this join is that the two datasets should already be partitioned and sorted on the same key i.e. the join key. If datasets are partitioned alike, we can be sure that a particular partition x of both datasets will contain the same join keys and hence same partitions from the two datasets can be assigned to a worker node for computing the join. If datasets do not fulfill this condition, two additional Map/Reduce jobs have to be carried out to partition the relations. Consider again the relations R(A, B) and S(B, C) that have to be joined on the key B. Let us assume that R and S are not partitioned on the join key. Following is a sequence of steps for carrying out the map-side join: 1) Splitting: The input files of both the datasets are split by Hadoop into manageable splits that are assigned to a collection of map processes. 2) Partitioning: In two separate Map/Reduce jobs, we pass input splits of both datasets through same map-reduce functions and apply the same partitioner to generate same number of partitions for the two relations on the join key. The number of partitions depends on the number of reducers and the number of reducers can be set through the JobConf object of Hadoop. The size of a partition should be appropriate so that it can fit in the memory of a mapper. The result of partitioning is written to HDFS as part files (part-0%5d, e.g. part for the output of reducer 0 and so on). We write the partitions of the two datasets into two separate folders R_dir, S_dir in HDFS. 37

48 Chapter 3: Join Algorithms on Hadoop 3) Mapping: A Map/Reduce job takes R_dir as input file which contains partitions of dataset R. Since we want each mapper to handle only one part file in its entirety without splitting it, we can enforce this by setting the mapred.min.split.size property of the JobConf object as Long.MAX_VALUE. As a result of this allocation, each mapper has one partition of relation R. Each mapper determines the file name of the input split it is dealing with by using the map.input.file property of the jobconf object. Recall from the earlier discussion that this file name in itself contains a clue about the partition number e.g. part file is storing the result of partition 5 from the phase 1 of partitioning. Knowing which partition number a mapper is dealing with, each mapper can load from HDFS the corresponding partition from S_dir. Here we build an in-memory hash table for the partition of relation S which we downloaded from HDFS. Each mapper and reducer are available with a configure() function that is called before the start of a map or reduce task. Any initialization task can be performed inside this configure function. We want the in-memory hash table to be built before a map task starts reading the tuples from the partition of the probing relation R. Hence the hash table is built in the configure() function. While reading each record from the corresponding partition of relation R, we probe the record against this hash table. If a match is found, the join is computed and the result is written to HDFS (Figure 13). Since we have to build an in-memory hash table for the corresponding partition of relation S, we select S to be the smaller of the two relations. A map-side join is more efficient than a reduce-side join since we are avoiding the shuffling and sorting phases and hence the whole reduce stage. However, there are strict prerequisites for carrying out this sort of the join. If the data is already available in partitioned and sorted order, well and good. In some cases, the datasets to be joined are produced by some previous map-reduce jobs. If the partitioning of the output emitted from this previous map-reduce job is based on the join key of the next join job, the partitioning requirement for the map-side join is met. In such situations, the performance of this algorithm is far better than the reduce-side join. However, if we have to run additional jobs to make the data in the required order, the performance of this technique may be affected. Moreover, since a hash function is used for partitioning both the datasets, the map-side partitioned join is prone to taking a performance-hit in case of skewed datasets. We present experimental proof 38

49 Chapter 3: Join Algorithms on Hadoop of this argument in the evaluation section. Figure 13: Data flow of the map-side partitioned join 3.3 Memory-backed joins A memory-backed join is essentially a map-side join in the sense that this join also is carried out completely on the map-side and the reduce phase is skipped altogether. However as opposed to the map-side partitioned join described earlier, a memory-backed join does not require both the datasets to be already partitioned and sorted on the join key. The keys can be in any random order. There are further cases in this join technique. The simplest case is the situation when one of the datasets is small enough to fit in the memory of each mapper. If this condition is fulfilled, the smaller dataset is treated as the inner relation. This dataset has to be provided to each mapper so that an in-memory hash-table can be built for it and then probed by the outer relation. We use the Distributed Cache mechanism of Hadoop to copy this smaller relation to each mapper. A distributed cache is provided by the Map/Reduce framework to efficiently distribute files needed by all worker machines. Files to be uploaded in the distributed cache should be on HDFS. Once per job, these files are copied in the distributed cache. The framework then copies the contents of the distributed cache to each worker machine before any task is started on that machine. Again, in the configure() function of each mapper, an in-memory hash-table is 39

50 Chapter 3: Join Algorithms on Hadoop built from the contents copied by the distributed cache (which is the inner relation in this case i.e. relation S). Each mapper is assigned one input split of the outer relation R (Figure 14). A record reader hands over the records of this split to the map task one-by-one. The records are probed against the hash-table to find any match. If a match is found, a join is computed and the joined record is emitted as output which is eventually stored on HDFS. Figure 14: Data flow of the memory-backed join A bit complicated case in this technique is the scenario when the inner relation is not small enough to fit in the main memory. In this situation, we divide the inner relation into n partitions such that each partition is small enough to fit in the memory. For each of the n partitions of the inner relation, the above memory-backed hash-join is carried out. However, this results in executing n jobs, one for joining each of the partitions with the whole outer relation. Since this definitely results in a higher execution cost, we ignore this alternative for the rest of the discussion. As clear from the discussion above, a memory-backed join avoids the sorting and shuffling phases of the reduce-side join. Unlike the map-side partitioned join, it does not require that the datasets should be partitioned. Instead, the inner relation is copied completely in the memory of each mapper and is used for the join. Although, the memory-backed join takes the least amount of time for joining, the performance of this algorithm is strictly conditioned by the size of the inner relation. If the inner 40

51 Chapter 3: Join Algorithms on Hadoop relation does not fit in the main-memory of a mapper node, the join operation cannot be completed. Since no partitioning is required for the memory-backed join, it is insensitive to skew. However, as stated earlier, the biggest drawback of this join algorithm is the condition on the size of the build relation. 3.4 The hybrid Hadoop join We present a join technique, the Hybrid Hadoop join, that is a combination of the map-side and reduce-side joins. Unlike a map-side partitioned join which performs a full partition-wise join, our algorithm performs only a partial partition-wise join. Like a map-side partitioned join, we require the relations to be blocked into same number of partitions on the basis of the join key. Like a reduce-side join, we carry out the join in the reduce-phase. However unlike the map-side partitioned join, we require only one of the relations to be pre-partitioned while the other relation is partitioned on-the-fly while carrying out the join. This dynamic partitioning of the second relation can save one Map/Reduce job. Unlike the reduce-side join, we do not require tagging of the input relations to distinguish the tuples from the two sources for joining and hence can avoid the time and space overhead incurred by tagging. Since the overheads are being avoided, we expect that this algorithm will perform better than the map-side partitioned join and the reduce-side join. We compare the performance of our algorithm with other algorithms in the evaluation section. Following are the steps to produce joined records using our optimal algorithm: 1) Step 1: We assume that the smaller of the two relations (S) is already partitioned into x partitions on the basis of the join key (if it is not, we run a Map/Reduce job to partition it). 2) Step 2: In a second job, we read the input splits of the outer relation R and pass them through an identity mapper since we do not need to do anything in the mapper. We want our data to pass through the partitioning stage however. The partitioning function should generate x partitions on the join key so that each partition of R has a corresponding partition of S containing the same keys. For this purpose, the partitioning function for both relations should be the same. 41

52 Chapter 3: Join Algorithms on Hadoop 3) Step 3: After the partitioning stages, each reducer receives one partition of the outer relation R (Figure 15). This reducer now needs the corresponding partition of the relation S which is stored in HDFS. For each reducer, we determine the partition number of relation R it is dealing with by using the mapred.task.partition property of the JobConf object. This property returns an integer representing the partition number the reducer is handling. Using this partition number, we construct an HDFS path for the corresponding partition of the relation S and load that partition from HDFS. 4) Step 4: An in-memory hash-table is built for the loaded partition. We want this hash table to be built before any reduce task is carried out (where we perform the actual join). We complete this loading of the partition and making of the hash table in the configure() function. 5) Step 5: As far as the partition for relation R is concerned, each reducer now is handling the same partition as the one belonging to S for which we have built the in-memory hash table. The reducer provides to each reduce task a key and all sets of values related to that key. The key is probed against the hash-table. If one or more matches are found in the hash-table, a join is computed with all the associated values and the result of the join is written to HDFS. Figure 15: Data flow of the hybrid Hadoop join 42

53 Chapter 3: Join Algorithms on Hadoop We use multi-maps for implementing the hash tables to cope with the situation of one-to-many or many-to-many join. A record in an outer table can have a number of matching records in the inner table. In order to store all the values related to a particular key, we implement an ArrayList inside the hash table for each key where each value in that ArrayList belongs to the same key. If a key-match is found, a joined record is produced with each entry of the ArrayList. In order to have consistency with our implementation, we implement multimaps for the map-side and memory-backed joins as well (which are also based on the hash-join). Since we are using a hash function for partitioning the datasets, our algorithm can also suffer from an uneven distribution of workload in the case of skewed keys. In two other versions of our algorithm, we incorporate the simple range partitioner and the virtual processor partitioner. In the evaluation section, we compare the performance of the hash partitioned and range partitioned implementations of our algorithm with the other three algorithms discussed above and determine which algorithm performs best in case of skewed and non-skewed data. 1. Simple Range partitioner In order to partition the keys using split vector obtained by sampling the input dataset, we need to implement a custom range partitioner. This range partitioner replaces the hash partitioner of Hadoop and assigns keys to partitions on the basis of their ranges. The steps for partitioning a build relation using a custom range partitioner are as follows: 1) Map/Reduce divides an input dataset into a number of splits of size 64MB (default), depending on the size of the input dataset. In order to sample the input dataset, the master node is directed to collect samples of the join key attributes from each of the input split before the start of the job. We assume that the join key attributes are randomly distributed over the input datasets and hence the input splits. For the number of samples x specified by user, the master retrieves x/y samples from each of the y input splits. 2) These samples of the join key attribute are stored in a sorted sequence in some data structure. We now retrieve only p samples from these x samples at the step size of x/p, where p is the number of partitions. 3) The samples are written in a file and stored in HDFS. 43

54 Chapter 3: Join Algorithms on Hadoop 4) This sampling file is distributed to all the mapper nodes using the distributed cache mechanism. Figure 16: Custom range partitioning for Hadoop 5) Each mapper node uses this sampling file to make the partitioning decisions in the partitioning phase (Figure 16). Each node reads the sampling file from the distributed cache and builds a range map, containing information about the ranges and their associated partition numbers, for searching the range values. All this is done in the configure() function of the map tasks. Recall that a configure() function is called before the start of any map task. Whenever a mapper reads a key-value pair, it checks if the key attribute falls in a particular range. Here, we assume that the key is the actual join key. If the key belongs to a range, the key-value pair is allocated the partition number associated with that range from the range map. If a join key attribute belongs to more than one partitions, the mapper assigns this key randomly to one of the candidate partitions. The partition number is encoded with the (key-value) pair. When this (key, value) pair is received by the partitioner, it decodes the value of partition number and returns to the framework. For this purpose, we extend a custom partitioner from the Partitioner class of Hadoop. 44

55 Chapter 3: Join Algorithms on Hadoop For partitioning the probing relation: 1) We use the same sampling file that was created for the build relation. The file is distributed to each mapper using the distributed cache. 2) Since the key of a tuple may fall in more than one ranges, we allocate this tuple to all of the candidate partitions. However, the partitioner class can assign only a single partition value to a key and returns this partition number to the framework. To override the default behavior, we determine the partition numbers for a key inside a map function (just like above) and encode the partition number with the (key, value) pair that has to be emitted. For the case where a key may fall in more than one ranges, we have to assign this key to all the associated candidate partitions. For this situation, we emit such key-value pair multiple times (depending on the number of candidate partitions), encoding one partition number with each pair. In the partitioner, the partition number is decoded and sent to the Map/Reduce framework which then instructs an appropriate reducer to pick up the tuple. 2. Virtual processor partitioner The Hadoop implementation of the virtual processor partitioner is the same as the simple range partitioner except that user specifies the number of partitions as a multiple of number of reduce nodes. Hadoop schedules the partitions itself on the reduce nodes. Initially it allocates a single partition to each node. When a node completes the join of one pair of partitions, the master allocates from the pool of pending partitions a partition to the idle reduce node. This enables dynamic load balancing and hence skew is handled in an effective way. 45

56 Chapter 4: Evaluation Chapter 4 Evaluation In this section, we experimentally evaluate the performance of the join algorithms discussed in Chapter 3 for handling different degrees of skew. To compare the performance, we consider the absolute runtimes of the algorithms. 4.1 The testbed For the purpose of experimentation, we use a Hadoop cluster of eight nodes. Out of these eight nodes, six are the datanodes, one is the namenode responsible for managing the distributed file system, and one is the tasktracker node responsible for assigning the map and reduce tasks to other worker nodes. Each node is a Dell Poweredge SC1425 with two Intel Xeon 3.2 GHz CPUs and a 256MB/16GB ECC DDR SDRAM memory chip. The secondary storage of each node is 80GB SATA drive running at 7200rpm. The nodes are connected to an HP ProCurve 2650 at the network bandwidth of 100BaseTx-FD. The cluster contains two racks with three datanodes in each rack. The racks are connected by a 1Gbps link. On each node are installed Scientific Linux 5.5, Hadoop 0.18, and Java 1.6. The block size is the default 64MB. The size of the heap memory is increased to 1024MB. Two map and two reduce tasks can run on each node, providing a total task capacity of 4 tasks. 4.2 Data Sets We generate the datasets such that the join keys are distributed randomly. In order to test the skew, we construct data sets with varying degrees of skew. To keep things simple, we take the join key as simple numerical values. We construct datasets of cardinalities 2,000,000 with join keys ranging from 1 to 2,000,000. We take 1 to be the skewed key wherever in the datasets the key is skewed. For the datasets with different skews, we vary the number of 1s in those datasets for expressing the skew. The convention used to represent the datasets is like this: Input1 represents an input 46

57 Chapter 4: Evaluation dataset that has only one 1 as the join key. All the other join key attributes are randomly assigned values from 2 to 2,000,000. Similarly, Input10 represents an input dataset with ten 1s and the remaining 1,999,990 values are randomly assigned values from 2 to 2,000,000; Input100 has 100 1s; Input1K has s; Input20K has 20,000 1s, and so on. We represent join as, for example, Input1 x Input1K to show a join between two input datasets, one consisting of one 1 and the other consisting of s. When we represent a join as Input1 x Input10, the first of these datasets is the build relation and the second one is the probe relation. Each tuple in the datasets consists of a join key, two random date values, and two random strings. The average size of a tuple is 80 bytes; the join key is occupying 8 bytes. Some example input tuples are: ,Sun Nov 14 02:39:40 BST 1943,Wed Jun 29 08:38:42 BST 1977,faukiwevvy,vdniyg ,Sat Jul 23 02:48:22 BST 1938,Mon Jun 04 09:12:37 BST 1917,hmeenobxao,qjynum ,Sat Apr 03 02:57:25 GMT 1954,Tue Nov 16 09:44:33 GMT 1926,hhdphumifv,agftml ,Sat Mar 29 05:26:36 GMT 1913,Fri Aug 04 06:02:21 BST 1916,gwhkisdgqn,svwojx ,Tue Dec 03 09:26:03 GMT 1912,Fri Dec 14 01:43:11 GMT 1934,fkcaiypevc,cqojdc ,Wed Sep 17 04:41:07 BST 1947,Tue May 03 05:01:16 BST 1932,xscyphkgwe,ticdll The relations are horizontally partitioned by the Map/Reduce framework into chunks with cardinalities Ri and Si among i nodes. The conventions for representing the algorithms are presented in Table 4. Table 4: Conventions for representing the algorithms Abb. RJ MPJ MBJ HHH HHR HHVP Algorithm Reduce-Side Join Map Side Partitioned Join Memory Backed Join Hybrid Hadoop Join With Hash Partitioning Hybrid Hadoop Join With Simple Range Partitioning Hybrid Hadoop Join With Virtual Processor Range Partitioning 47

58 Chapter 4: Evaluation 4.3 Tests Before carrying out the actual test for comparison of the performance of join algorithms, we carry out some tests (Tests 1-5) to determine the appropriate values for some essential parameters required for the optimal execution of the join algorithms. Test 6 shows the results of performance comparison of the join algorithms Test 1: To determine time for sampling and finding the splitters This test examines whether a significant amount of time is taken for sampling the input datasets and selecting the split values for the range vector in the range partitioning approach. From the data size of size N, we take x samples and then select s splitters. For the sake of experiments, we have fixed the number of splitters, s, to 24. For an input data of 14,000,000 records and size of 1.1 GB, we variate the number of samples picked from the input splits, The time required for the splitter-finding algorithm for random data of size N is shown in Figure 17. Sampling Time (ms) Sampling Time Trend 0 No of Samples Figure 17: Time trend for taking samples and selecting splitters As obvious from the figure, the time required for sampling and making a split vector increases with increasing the number of samples taken. However, this time is not very significant. Even taking 100,000 samples takes less than 3 seconds. The splitter-finding algorithm uses a temporary array of length x (where x is the number 48

59 Chapter 4: Evaluation of samples taken from the input data) to store the join keys sampled from the input splits. Since each key is of 8 bytes, the intermediate data structure takes 8x bytes. For 100,000 samples, it just takes 800,000 bytes (around 0.8MB) of the main memory, which is not very significant. Even if the key size increases, the temporary data structure does not occupy too much space. From the sampled keys, only s keys are selected with the step size of x/s (where s is the number of partitions into which the tuples have to be range-partitioned). The resultant split file of 8s bytes is written to HDFS. If s=100, the split file will be of only 800 bytes. So only a trivial amount of disk space is required for storing the split file. This shows that our splitter-finding algorithm is not costly in terms of the main-memory space requirements, the disk-space requirements, and the time overhead and hence does not affect the overall performance of the algorithm that uses the splitter-finding program for partitioning Test 2: To determine the size of the bloom filter for pre-processing In Test 3, we determine whether the semi-join through selection or the semi-join through bit filtering should be used for filtering the input datasets. For this purpose we have to select an appropriate size for the bloom filter. In this test we determine what should be the size of the bloom filter that would result in an optimal performance. The size of a bloom filter is the length of the bit-array whose selective bits are set after applying hash functions on a key. As discussed earlier, the accuracy of a bloom filter depends on the length of the filter. However, a large size bloom filter can result in a greater processing time for filtration. The reason behind the increased processing time is due to the fact that a bloom filter is created using the join keys of the probing relation and is saved in binary format in HDFS. For filtration, each map task processing an input split of the build relation has to retrieve the binary file from the distributed cache, rebuild the bit array from that file, and then use it to filter the keys of the build relation. The bigger is the size of bloom filter, the greater time it will take to reconstruct the bit array from the binary file. However, a small sized bloom filter will result in too many false-positives and hence those false tuples will be propagated to the join stage. We execute the reduce side join algorithm for creating a join between two datasets while differing the size of the bloom filter. In our experiment, the total 49

60 Chapter 4: Evaluation number of keys of dataset 1 filtered by the unique keys of dataset 2 should be 1,264,336. In Table 5 are listed the actual number of tuples that pass the filter test using different sizes of the bloom filter. Table 5: Effects of increasing size of the bloom filter Bit array size Size on disk (MB) Execution time (Sec) Tuples Filtered % error 1,000, ,000, ,000, ,000, ,000, As evident from the table, a bloom filter of size 1,000,000 bits although occupies a little disk space, the false positive rate is too high and hence a lot of tuples are filtered through falsely. This increases the overall execution time of the join algorithm since these falsely identified tuples have to be shuffled through the network (we performed the test using the reduce-side join). Although these falsely identified tuples will be discarded in the join stage (i.e. the reduce-stage), still the shuffling cost of these false tuples increases the overall execution time of the algorithm. In comparison, the bloom filters of sizes , , and let pass only smaller number of false tuples, the rate decreasing with the increase in the size of the bit filter. However, by increasing the size of the bit filter, although the number of false tuples to be shuffled across the network decreases, the overall execution time increases. This is because reconstructing a bigger filter from the binary file now takes an increased amount of time. As a tradeoff between the filter-construction time and %error rate, we select the filter size of bits in our experiments Test 3: To determine the effectiveness of pre-processing In this test we compare the performance of all join algorithms with and without pre-processing. The two pre-processing techniques used are semi-join through 50

61 Chapter 4: Evaluation bit-filtering and semi-join through selection. As discussed earlier, the effects of pre-processing are noticeable when the join keys of the smaller dataset contains a subset of the join keys of the larger dataset. For testing the effects of pre-processing, we generate another dataset with 10,000,000 keys with keys ranging from 1 to 10,000,000. Now the input with 2,000,000 keys (described earlier) contains only a subset of keys of the bigger dataset (recall that the keys of this dataset range from 1 to 2,000,000). We filter the bigger dataset with the keys of the smaller dataset and then use the filtered dataset for joining with the smaller dataset. We determine the execution time for all the join algorithms without pre-processing as well as using both the pre-processing techniques of the semi-join using selection and semi-join using bloom filter and observe the effect. We conducted each experiment three times and present the mean of those values here. 300 Execution Time (sec) No pre-processing Semi join using bloom filter Semi join using selection 0 RJ MPJ HHH MBJ Algorithm Figure 18: Execution time of join algorithms with and without pre-processing in low selectivity situations The memory backed join without pre-processing did not finish because of the prohibitively long time for making the in-memory hash table for the larger dataset, so it is not included in the table. As evident from Figure 18, pre-processing always lowers the execution time of each of the algorithms. Compared to the semi-join through bloom filter, the semi-join through selection yields better results. This is because of the fact that the semi-join through selection constructs a hash table in main memory to test the containment of keys. Only one lookup is required for this 51

62 Chapter 4: Evaluation test. In case of the bloom filter, although lookup time is constant, each containment test requires k lookups, where k is the number of hash functions used for the bloom filter. We conduct another experiment to observe the effects of pre-processing when the selectivity factor is high i.e. when significant number of tuples from the dataset to be filtered actually take part in the join and hence are filtered through to the join stage. The results are presented in Figure 19. It is clear from the figure that in the situations where filtering does not discard a significant proportion of the tuples in the dataset to be filtered, pre-processing does not improve the execution time of the join algorithms. Instead, there is an overhead involved in applying the pre-processing. Thus the semi-join loses its benefit when the increase in processing delay due to the semi-join phase outweighs the reduction in communication delay. However, as a generalization, we will use the semi-join through selection for pre-processing in the rest of our experimentation. 250 Execution Time (sec) No pre-processing Semi join using bloom filter Semi join using selection 0 RJ MPJ HHH MBJ Algorithm Figure 19: Effect of pre-processing in high selectivity situations Test 4: To determine the effect of increasing partitions on the join of skewed relations We determine whether increasing the number of partitions has any impact on the performance for skewed joins when we use the hash partitioning. In Map/Reduce, each partition is processed by one reduce task. So the number of partitions is equal 52

63 Chapter 4: Evaluation to the number of reduce tasks. We can specify the number of reduce tasks in the job configuration and hence can set the number of partitions. We compute the join between Input400K x Input10 for the HHH algorithm and determine the amount of time taken by each reduce task (where the join takes place) while varying the number of reduce tasks. We want to observe whether increasing the number of reduce tasks (i.e. partitions) spreads the workload relatively evenly in case of skewed joins. We take the number of partitions as a multiple of the number of nodes in the cluster. Table 6: Time taken by reducers in case of skewed relation (6 partitions) Reducer R0 R1 R2 R3 R4 R5 Time(sec) Using the hash partitioning, key 1 is mapped to reducer R1 (hash_code_of_1 % no_of_partitions). As evident from Table 6, R1 is the most loaded partition since it receives all the tuples with the key 1 and hence takes the most amount of time for completion. If we increase the number of partitions, the skewed key 1 is still directed to only one reduce task i.e. R1 since all keys with the same hash code are assigned to the same reduce task. In Table 7, Table 8, Table 9, and Table 10 are presented the execution times taken by the reduce tasks while the number of reduce tasks are increased to 12, 18, 24, and 30 respectively. In all these cases, the reduce task R1 is the most overloaded. Moreover, the tables show that increasing the number of partitions increases the time difference between R1 and the remaining reduce tasks. Keys get spread apart farther among reduce tasks with the increase in the number of reduce tasks and hence the execution time of these reduce tasks keeps on decreasing but the overloaded reduce task remains unaffected since the skewed key 1 does not get spread. On the other hand, increasing the number of reduce tasks creates a scheduling overhead since there are only six nodes in the cluster and hence the master has to schedule the tasks on a node when it finds a node idle. Figure 20 shows the total execution time of the join algorithm for different numbers of reduce tasks. It is clear from the figure that increasing the number of reduce tasks although decreases the execution time of each reduce task (except the overloaded reduce task), the total execution time of the algorithm increases slightly because of the 53

64 Chapter 4: Evaluation scheduling overhead. Table 7: Time taken by reducers in case of skewed relation (12 partitions) Reducer R0 R1 R2 R3 R4 R5 Time(sec) Reducer R6 R7 R8 R9 R10 R11 Time(sec) Table 8: Time taken by reducers in case of skewed relation (18 partitions) Reducer R0 R1 R2 R3 R4 R5 Time(sec) Reducer R6 R7 R8 R9 R10 R11 Time(sec) Reducer R12 R13 R14 R15 R16 R17 Time(sec) Table 9: Time taken by reducers in case of skewed relation (24 partitions) Reducer R0 R1 R2 R3 R4 R5 Time(sec) Reducer R6 R7 R8 R9 R10 R11 Time(sec) Reducer R12 R13 R14 R15 R16 R17 Time(sec) Reducer R18 R19 R20 R21 R22 R23 Time(sec)

65 Chapter 4: Evaluation Table 10: Time taken by reducers in case of skewed relation (30 partitions) Reducer R0 R1 R2 R3 R4 R5 Time(sec) Reducer R6 R7 R8 R9 R10 R11 Time(sec) Reducer R12 R13 R14 R15 R16 R17 Time(sec) Reducer R18 R19 R20 R21 R22 R23 Time(sec) Reducer R24 R25 R26 R27 R28 R29 Time(sec) Execution Time (sec) No of partitions Figure 20: Execution time of the HHH algorithm for different numbers of partitions (i.e. the reduce tasks) 55

66 Chapter 4: Evaluation Test 5: To determine the number of partitions for the virtual processor range partitioning As discussed earlier, the range partitioner has a tendency to spread apart the skewed keys in more than one partitions. Increasing the number of partitions for the virtual processor range partitioning improves the performance of skewed joins because the skewed key gets spread farther among a number of ranges and hence is assigned to separate partitions which are handled by separate reduce tasks. In this experiment, we observe the effect of increasing the number of partitions for the virtual processor range partitioning approach while performing the join Input400K x Input10 for the HHVP algorithm. In Table 11 are shown the effect of increasing the number of partitions on the distribution of the skewed keys among the reducers and the execution time of each reducer. The split vector represents the boundaries of the ranges of join keys in each partition. For example, in case of six partitions, the split vector is {1, , , , }. Keys 1 belong to partition 1 (which is assigned to reducer 1), 1 < keys are assigned to partition 2 (which is assigned to reducer 2), and so on. The corresponding vector representing the execution times of the reducers is {139, 52, 55, 54, 62, 62}, showing that the reducer handling all the tuples with join key attribute 1 takes 139 seconds for computing the join. The second reducer handling join key attributes between 1 (exclusive) and (inclusive) takes 52 seconds for execution, and so on. Join key attributes > are handled by reducer 6 which takes 62 seconds for computing the join. In the case of split vectors containing replication of a join key, the tuples with that join key can be routed to any of the reducers handling that key. For example, in the case of 12 partitions, the tuples with join key 1 can be forwarded to reducers 1, 2, or 3. As evident from Table 11, the greater the number of partitions, the skewed key is spread apart in a greater number of partitions. The execution time of the reducers handling the skewed keys also decreases with the increase in the number of partitions. However, with the increase in the number of partitions, the cost of searching the range map for determining the appropriate partition number also increases. When this cost starts becoming significant, the performance gains obtained by the reduction in the cost of skewed join start diminishing. This is evident from the total execution time column of Table

67 Chapter 4: Evaluation Table 11: Effect of the number of partitions on the execution time of skewed reducers for the virtual processor range partitioning 57

68 Chapter 4: Evaluation In Figure 21 is graphically presented the summary of the results of Table 11. In this figure are depicted the number of reducers handling the skewed key 1 along with their execution times when the number of partitions is varied. It is clear from the figure that increasing the number of partitions increases the number of reducers handling the skewed key and lowers the execution time of each such reducer. Time (sec) taken by reducers handling skewed keys No of partitions Figure 21: The execution times of the reducers handling the skewed keys for different number of partitions Test 6: Comparison of algorithms in the case of skewed and non-skewed joins In order to better understand the performance comparison of the join algorithms, we provide an insight into how much of the total execution time each algorithm spends on the actual join operation. In Figure 22 is depicted the time taken for the partition, map, shuffle, and reduce stages of the join algorithms discussed in Chapter 3, when executed on non-skewed datasets. The total time of execution for each algorithm is also shown in the figure. 58

69 Chapter 4: Evaluation Figure 22: Stages of Execution for Join Algorithms In all of these algorithms, the actual join takes place at different stages. In the reduce-side and our hybrid Hadoop join, the join takes place during the reduce stage. In both of these algorithms, the actual join takes little time as compared to the time required for shuffling the data across the network. In the memory-backed and map-side partitioned joins, the join of the two datasets occurs during the map phase and no shuffling across the network is involved; hence the time required for the shuffling stage is saved. However, in the map-side partitioned join, the pre-processing step of partitioning the two datasets takes a significant amount of the total time (since partitioning involves shuffling across the network). Compared to the map-side partitioned join, our hybrid Hadoop join involves partitioning only one dataset before the second job of joining is carried out; thus the partitioning phase takes much less time as compared to the map-side partitioned join. The other dataset is partitioned dynamically before the tuples actually reach the reduce phase (where the actual join takes place). As evident from Figure 22, it is the memory-backed join that finishes the earliest. This is because of the fact that no shuffling and partitioning is involved. However, the map-side join cannot be used all the time since there are restrictions on the size of the build relation. If the build relation is too large to fit in memory (which actually is the case with usual Map/Reduce applications consisting of billions of records), the memory-backed join falls apart. The other three algorithms: map-side partitioned, reduce-side, and hybrid Hadoop joins do not have any stringent condition on the sizes of the input datasets and hence can be used generally. 59

70 Chapter 4: Evaluation Now we present the performance comparison of the reduce-side, memory-backed, and map-side partitioned join algorithms with the hash-partitioned and range partitioned versions of our hybrid Hadoop join algorithm for varying degrees of skew in the input data. All of these algorithms pre-process the input datasets using the semi-join through selection. The number of partitions for RJ, MBJ, MPJ, and HHH are taken to be 12. The HHR join algorithm has same number of partitions as the number of nodes i.e. 6. For the virtual processor range partitioning in the HHVP join algorithm, we keep the factor of virtual processor 2 i.e. the total number of partitions is 12. We conduct each experiment three times and present here the mean of those values. The results of join with datasets of varying skew are presented in Figures Execution Time (sec) Input1 x Input1 0.0 RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 23: Join Results for Input1xInput1, output records produced: 4,001,864 60

71 Chapter 4: Evaluation Execution Time (sec) Input10K x Input1 0.0 RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 24: Join Results for Input10KxInput1, output records produced: 2,000, Execution Time (sec) Input100K x Input1 0.0 RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 25: Join Results for Input100KxInput1, output records produced: 2,002,328 61

72 Chapter 4: Evaluation Execution Time (sec) Input300K x Input1 0.0 RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 26: Join Results for Input300KxInput1, output records produced: 1,799, Execution Time (sec) Input400K x Input1 0.0 RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 27: Join Results for Input400KxInput1, output records produced: 2,000,015 62

73 Chapter 4: Evaluation Execution Time (sec) Input500K x Input1 0.0 RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 28: Join Results for Input500KxInput1, output records produced: 1,999, Execution Time (sec) Input600K x Input1 0.0 RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 29: Join Results for Input600KxInput1, output records produced: 1,798,816 63

74 Chapter 4: Evaluation Execution Time (sec) Input300K x Input RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 30: Join Results for Input300KxInput10, output records produced: 4,503,238 Execution Time (sec) Input500K x Input RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 31: Join Results for Input500KxInput10, output records produced: 6,501,763 64

75 Chapter 4: Evaluation Execution Time (sec) Input600K x Input RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 32: Join Results for Input600KxInput10, output records produced: 7,199, Execution Time (sec) Input100K x Input100 0 RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 33: Join Results for Input100KxInput100, output records produced: 11,899,500 65

76 Chapter 4: Evaluation Execution Time (sec) Input300K x Input RJ MBJ MPJ HHH HHR HHVP Algorithm Figure 34: Join Results for Input300KxInput100, output records produced: 31,499, Discussion on the results of the comparison tests It is quite evident from the results presented in Figures that for any degree of skew, the memory-backed join finishes the earliest. The reason behind this result is that neither the hash partitioning nor the range-partitioning is applied to distribute the datasets among the worker nodes computing the join (mappers). Instead, each mapper works on an input split of the probe relation assigned to it by the Map/Reduce framework (the input splits are almost of the same size). Each mapper gets the whole build relation from the distributed cache, builds an in-memory hash table for it, and then probes each tuple of the input split of probing relation against the hash table. This prevents load imbalancing at the nodes computing the join. However as discussed earlier, there are stringent conditions on the size of the build relation for the memory-backed join. The memory-backed join does not finish when the build relation does not fit in the main memory. In the experiments conducted for the comparison of the join algorithms, fortunately the build relation is fitting in the main memory, but more practical, real datasets consisting of trillions of records are huge enough to exceed the size of the main memory. Hence, the memory-backed join will be out of the competition in those situations. In Figure 35, we present the results of the join operation using the memory-backed algorithm for the build relation of varying number of tuples, keeping the number of tuples in the 66

77 Chapter 4: Evaluation probe relation constant. For the smaller number of tuples in the build relation, the algorithm comes to completion. However, when the build relation consists of 10,000,000 records, the join operation never comes to an end. This is because the nodes computing the join run out of the memory while building an in-memory hash table for such a large dataset. Therefore, in general, the memory-backed join is not suitable for computing joins of huge datasets while all other join algorithms discussed in Chapter 3 do not have such limitations. Figure 35: Time taken by the memory-backed join for different number of tuples in the build relation (*DNF=Does Not Finish) Among the remaining algorithms, the reduce-side join performs better in littleor no skew- situations. This is because all other algorithms require partitioning of datasets on the join key to accumulate similar keys of both datasets into partitions with same partition number. The two corresponding partitions of both datasets are then joined in the join stage. Since the reduce-side join skips this phase of partitioning, its performance is better than the other two algorithms. However as the skew increases, the performance of the reduce-side join starts degrading. The reason is that the keys are distributed among the reduce nodes (where join takes place) according to the hash code of the join key. The reduce nodes receiving the skewed keys are overloaded as compared to the other reduce nodes and hence the overloaded nodes take more time to compute the join, thereby degrading the 67

78 Chapter 4: Evaluation performance. Between the map-side partitioned join and our hybrid Hadoop join algorithm (hash-partitioned version), the hybrid Hadoop join algorithm shows a slightly better performance in general than the map-side partitioned join. This is because we partially partition the input data as compared to the map-side partitioned join which involves full partitioning. However, since both of these algorithms use the hash partitioning, load imbalancing results in the case of skewed joins and hence some worker nodes are overloaded. The system takes a longer time to compute the join. It is evident from the results that in case of heavy skew, the virtual processor partitioning approach (HHOV) performs the best among all algorithms (except the memory-backed algorithm) since it distributes the skewed keys to more partitions. The load imbalance is prevented and the system does not have to wait for the heavy-hitter nodes to complete the processing. For heavily skewed joins, the simple range partitioning version (which uses the same number of partitions as the number of nodes) although performs better than algorithms using the hash partitioning, its performance is always lower than that of the virtual processor partitioning approach. This is because of the fact that the skewed keys are farther spread apart by the virtual processor partitioning as compared to the simple range partitioning approach. It is also clear from the results that in non-skewed or little skewed joins, the hash partitioning performs better than the range partitioning. So we modify our algorithm such that it is able to detect the skew and then dynamically select an appropriate partitioning strategy. 68

79 Chapter 5: Conclusion and Future Works Chapter 5 Conclusion and Future Works 5.1 Conclusion The Map/Reduce framework facilitates parallel processing of the data distributed among processing nodes in a computing cluster. For massive datasets, parallel processing significantly reduces the response time since processing of independent subsets of work is carried out independently at distributed sites. As every kind of processing task benefits from parallelization, so does the parallel joining of datasets. However parallel joins are vulnerable to skew in the datasets being joined. If the datasets are sufficiently skewed on some keys, some parallel sites will take more time to accomplish the task than others and hence the whole system will wait for those overloaded sites. Such load imbalances swamp the benefits achievable by parallelization. The roots of these load imbalances are in the partitioning stage where input datasets are distributed among the processing sites for the join operation. Selection of an inappropriate partitioning strategy results in load imbalancing. The hash partitioning strategy is found to be inefficient for load distribution in the case of skewed data. Instead, if the range partitioning is employed, the datasets are distributed among processing nodes on the basis of the characteristics of data itself. Hence, if the input data is sufficiently skewed, the overused keys are distributed among a number of processing nodes and the overall performance of the system improves. The Hadoop implementations of the join operation do not efficiently handle the situations where datasets are significantly skewed since they use the hash partitioning approach. The range partitioning strategy can handle skewed joins in a better way as compared to the hash partitioning strategy. In this project, we presented a join algorithm that is a hybrid of the map-side partitioned and the reduce-side joins and is capable of handling skew in the input datasets. Since the hash partitioning performs well in the case of non-skewed data and the range partitioning has a better performance when the input data is skewed, our algorithm dynamically selects an appropriate partitioning strategy on the basis of the characteristics of the input data. In the case of 69

80 Chapter 5: Conclusion and Future Works non-skewed data, it selects the hash partitioning strategy while if significant skew in the input data is detected, it selects the range partitioning strategy for the partitioning phase. As evident from the results, this algorithm performs better than other algorithms in the case of heavily skewed input data. 5.2 Future Works This algorithm computes two-way joins i.e. joins between two input datasets. The project can be extended in future to handle multi-way joins as well. A multi-way join can be computed as multiple two-way joins using the current implementation of this algorithm. However, each two-way join in this multi-way join would be a separate Map/Reduce job. An efficient implementation of the multi-way join would be possible if the multi-way join is computed in a single run i.e. in one Map/Reduce job. Moreover, implementation of the cyclic- and star-joins using this algorithm is also a candidate for the future works. The number of partitions for the range partitioning and the hash partitioning, determined through the tests, provides an efficient performance for the input datasets used in this project. However, it is not guaranteed that these statistics would work well with other datasets. In future, work can be carried out to dynamically determine the parameters of the system e.g. the number of nodes, their processing capabilities, their memory sizes etc. and then an appropriate number of partitions can be selected on the basis of these characteristics of the system. 70

81 Bibliography Bibliography [1] Dean, J. and Ghemawat, S MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), [2] Dean, J. and Ghemawat, S MapReduce: a flexible data processing tool. Commun. ACM 53, 1 (Jan. 2010), [3] Mishra, P. and Eich, M. H Join processing in relational databases. ACM Comput. Surv. 24, 1 (Mar. 1992), [4] Schneider, D. A. and DeWitt, D. J A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. SIGMOD Rec. 18, 2 (Jun. 1989), [5] Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., and Rasin, A HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. 2, 1 (Aug. 2009), [6] Ghemawat, S., Gobioff, H., and Leung, S The Google file system. SIGOPS Oper. Syst. Rev. 37, 5 (Dec. 2003), [7] Yang, H., Dasdan, A., Hsiao, R., and Parker, D. S Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international Conference on Management of Data, [8] Dewitt, D. and Stonebraker, M. MapReduce: A Major Step Backwards blogpost; ackwards/ [9] Vernica, R., Carey, M. J., and Li, C Efficient parallel set-similarity joins using MapReduce. In Proceedings of the 2010 international Conference on Management of Data,

82 Bibliography [10] Valduriez, P. and Gardarin, G Join and Semijoin Algorithms for a Multiprocessor Database Machine. ACM Trans. Database Syst. 9, 1 (Mar. 1984), [11] Afrati, F. N. and Ullman, J. D Optimizing joins in a map-reduce environment. In Proceedings of the 13th international Conference on Extending Database Technology, [12] DeWitt, D. J. and Gerber, R. H Multiprocessor hash-based join algorithms. In Proceedings of the 11th international Conference on Very Large Data Bases - Volume 11 (Stockholm, Sweden, August 21-23, 1985). [13] Martínez, C. and Roura, S Optimal Sampling Strategies in Quicksort and Quickselect. SIAM J. Comput. 31, 3 (Mar. 2002), [14] Lakshmi, M. S. and Yu, P. S Effectiveness of Parallel Joins. IEEE Trans. on Knowl. and Data Eng. 2, 4 (Dec. 1990), [15] J. Menon, Sorting and join algorithms for multiprocessor database. IBM Res. Rep. RJ 5049, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, [16] Ross, K. A. and Cieslewicz, J Optimal splitters for database partitioning with size bounds. In Proceedings of the 12th international Conference on Database theory (St. Petersburg, Russia, March 23-25, 2009), [17] Walton, C. B., Dale, A. G., and Jenevein, R. M A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. In Proceedings of the 17th international Conference on Very Large Data Bases (September 03-06, 1991) [18] Xu, Y., Kostamaa, P., Zhou, X., and Chen, L Handling data skew in parallel joins in shared-nothing systems. In Proceedings of the 2008 ACM SIGMOD international Conference on Management of Data. SIGMOD '08. ACM, New York, NY, [19] Lynch, C. A Selectivity Estimation and Query Optimization in 72

83 Bibliography Large Databases with Highly Skewed Distribution of Column Values. In Proceedings of the 14th international Conference on Very Large Data Bases (August 29 - September 01, 1988) [20] Shapiro, L. D Join processing in database systems with large main memories. ACM Trans. Database Syst. 11, 3 (Aug. 1986), [21] S. Christodoulakis, Estimating record selectivities. Inform. Syst., vol. 8, no. 2, pp , [22] Zeller, H. and Gray, J An Adaptive Hash Join Algorithm for Multiuser Environments. In Proceedings of the 16th international Conference on Very Large Data Bases (August 13-16, 1990), [23] Ghandeharizadeh, S. and DeWitt, D. J Hybrid-Range Partitioning Strategy: A New Declustering Strategy for Multiprocessor Database Machines. In Proceedings of the 16th international Conference on Very Large Data Bases (August 13-16, 1990) [24] Hua, K. A. and Lee, C Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning. In Proceedings of the 17th international Conference on Very Large Data Bases (September 03-06, 1991) [25] DeWitt, D. J., Naughton, J. F., Schneider, D. A., and Seshadri, S Practical Skew Handling in Parallel Joins. In Proceedings of the 18th international Conference on Very Large Data Bases (August 23-27, 1992), [26] Hadoop: The Definitive Guide. O'Reilly Media. May [27] Manning Hadoop in Action. [28] Walton, C. B., Dale, A. G., and Jenevein, R. M A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. In Proceedings of the 17th international Conference on Very Large Data Bases (September 03-06, 1991) [29] Kitsuregawa, M. and Ogawa, Y Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database 73

84 Bibliography computer (SDC). In Proceedings of the Sixteenth international Conference on Very Large Databases (Brisbane, Australia), [30] David DeWitt; Michael Stonebraker. "MapReduce: A major step backwards". [31] Apache Hadoop: [32] The Hadoop Distributed File System: [33] The Sloan Digital Sky Survey: [34] National Climatic Data Centre: [35] Data Centre Knowledge: servers/ [36] E-bays data warehouses: [37] GreenPlum: [38] AsterData: [39] Qizmt: [40] Disco: [41] Skynet: [42] Facebook Statistics: [43] Yahoo! Hadoop Tutorial: [44] 35th International conference on very large databases vldb2009.org/files/slides/3375-kim_join_vldb09.pptx 74

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework Konstantina Palla E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science School of Informatics University of Edinburgh

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India [email protected] 2 PDA College of Engineering, Gulbarga, Karnataka,

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

BBM467 Data Intensive ApplicaAons

BBM467 Data Intensive ApplicaAons Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal [email protected] Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

PERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE

PERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE PERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE A report submitted in partial fulfillment of the requirements for the award of the degree of MASTER OF TECHNOLOGY in COMPUTER SCIENCE

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information