Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework

Size: px
Start display at page:

Download "Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework"

Transcription

1 Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Fariha Atta MSc Informatics School of Informatics University of Edinburgh 2010

2

3 T Abstract he Map/Reduce framework -- a parallel processing paradigm -- is widely being used for large scale distributed data processing. Map/Reduce can perform typical relational database operations like selection, aggregation, and projection etc. However, binary relational operators like join, cartesian product, and set operations are difficult to implement with Map/Reduce. Map/Reduce can process homogeneous data streams easily but does not provide direct support for handling multiple heterogeneous input data streams. Thus the binary relational join operator does not have efficient implementation in the Map/Reduce framework. Some implementations of the join operator exist for the Hadoop distribution of the Map/Reduce framework. However, these implementations do not perform well in case of heavily skewed data. Skew in the input data affects the performance of the join operator in parallel environment where data is distributed among parallel sites for independent joins. Data skew can severely limit the effectiveness of parallel architectures when some processing units (PUs) are overloaded during data distribution and hence take a greater time for completion as compared to other PUs. This also results in wastage of resources of the idle PUs. As data skew naturally occurs in many applications, handling it is an important issue for improving the performance of the join operation. We implement a hash join algorithm which is a hybrid of the map-side and the reduce-side joins of Hadoop with the ability to handle skew and we compare its performance to the other join algorithms of Hadoop. I

4 Acknowledgements My heartfelt gratitude goes to my supervisor Stratis Viglas who provided me guidance and support throughout the project. I specially appreciate his willingness to help anytime. I would like to acknowledge Chris Cooke for administering the Hadoop cluster and answering my queries regarding the cluster. Finally, I would thank my family and friends for their continuous moral support and encouragement. II

5 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Fariha Atta) III

6 Table of Contents Chapter 1. Introduction Motivation Related Work Problem Statement and Aims Thesis Outline... 8 Chapter 2. Background The Map/Reduce Framework The Hadoop Distribution Impact of parallelization on the join operation Hash Join algorithm Join algorithms on a single processor machine Hash join algorithm for parallel implementation Skew and its impact on join operation Hash Partitioning and its skew sensitivity Range Partitioning and its skew sensitivity Pre-processing for joining Semi-join using selection Semi Join using Hash Operator Chapter 3. Join Algorithms on Hadoop Reduce-side joins Map-side partitioned joins Memory-backed joins The hybrid Hadoop join Chapter 4. Evaluation The testbed IV

7 4.2. Data Sets Tests Test 1: To determine time for sampling and finding the splitters Test 2: To determine the size of the bloom filter for pre-processing Test 3: To determine the effectiveness of pre-processing Test 4: To determine the effect of increasing partitions on the join of skewed relations Test 5: To determine the number of partitions for the virtual processor range partitioning Test 6: Comparison of algorithms in the case of skewed and non-skewed joins Discussion on the results of the comparison tests Chapter 5. Conclusion and Future Works Bibliography V

8 List of Figures Figure 1: The Map/Reduce dataflow Figure 2: The HDFS Architecture Figure 3: Hadoop Map/Reduce dataflow (source [43]) Figure 4: Partitioning R and S on join column R.r=S.s in three partitions using hash(k)=k% Figure 5: Grace Join Algorithm (source [44]) Figure 6: Joining on parallel machines Figure 7: Example of data skew -- Patent and Cite Tables Figure 8: A bloom filter with m=15, k=3 for a set {A, B, C} Figure 9: Bit setting in bloom filter in case of collision Figure 10: Construction of a Bloom Filter Figure 11: Data flow of the reduce-side join Figure 12: Key-Value pairs before and after group partitioning Figure 13: Data flow of the map-side partitioned join Figure 14: Data flow of the memory-backed join Figure 15: Data flow of the hybrid Hadoop join Figure 16: Custom range partitioning for Hadoop Figure 17: Time trend for taking samples and selecting splitters Figure 18: Execution time of join algorithms with and without pre-processing in low selectivity situations Figure 19: Effect of pre-processing in high selectivity situations Figure 20: Execution time of the HHH algorithm for different numbers of partitions (i.e. the reduce tasks) Figure 21: The execution times of the reducers handling the skewed keys for different number of partitions VI

9 Figure 22: Stages of Execution for Join Algorithms Figure 23: Join Results for Input1xInput1, output records produced: 4,001, Figure 24: Join Results for Input10KxInput1, output records produced: 2,000, Figure 25: Join Results for Input100KxInput1, output records produced: 2,002, Figure 26: Join Results for Input300KxInput1, output records produced: 1,799, Figure 27: Join Results for Input400KxInput1, output records produced: 2,000, Figure 28: Join Results for Input500KxInput1, output records produced: 1,999, Figure 29: Join Results for Input600KxInput1, output records produced: 1,798, Figure 30: Join Results for Input300KxInput10, output records produced: 4,503, Figure 31: Join Results for Input500KxInput10, output records produced: 6,501, Figure 32: Join Results for Input600KxInput10, output records produced: 7,199, Figure 33: Join Results for Input100KxInput100, output records produced: 11,899, Figure 34: Join Results for Input300KxInput100, output records produced: 31,499, Figure 35: Time taken by the memory-backed join for different number of tuples in the build relation (*DNF=Does Not Finish) VII

10 List of Tables Table 1: Books Relation Table 2: Authors Relation Table 3: Filtered Book Relation Table 4: Conventions for representing the algorithms Table 5: Effects of increasing size of the bloom filter Table 6: Time taken by reducers in case of skewed relation (6 partitions) Table 7: Time taken by reducers in case of skewed relation (12 partitions) Table 8: Time taken by reducers in case of skewed relation (18 partitions) Table 9: Time taken by reducers in case of skewed relation (24 partitions) Table 10: Time taken by reducers in case of skewed relation (30 partitions) Table 11: Effect of the number of partitions on the execution time of skewed reducers for the virtual processor range partitioning VIII

11 Chapter 1: Introduction Chapter 1 Introduction 1.1 Motivation Past are the times when applications used to process kilobytes or megabytes of data. Now is the age of processing gigabyte, terabyte, or even petabyte scale data. Applications these days deal with gigantic amounts of data that do not fit in main memory of one single machine and are also beyond the processing power of one machine. For example, the data volume of the National Climatic Data Centre (NCDC) [34] is 350 gigabytes; E-bay maintains 17 trillion records and has a total size of 6.5 terabytes [36]; Facebook manages more than 25 terabytes of data per day for logging [35]; The Sloan Digital Sky Survey (SDSS) [33] maintains about 42 terabytes of image data and 18 terabytes of catalogue data. Processing this massive amount of data is not an easy feat. Data-intensive applications use a distributed infrastructure containing clusters of computers and employ distributed parallel algorithms to efficiently process huge volumes of data. One of the fairly recent advancements in distributed processing is the development of the Map/Reduce paradigm [1]. Map/Reduce a programming framework developed at Google provides a cost-effective, scalable, flexible, and fault-tolerant distributed platform for large scale distributed data processing across a cluster of hundreds or even thousands of nodes. It allows for processing huge volumes of data in parallel [1, 2], upto multiples of petabytes. Computations are moved to machines having appropriate data on which processing has to be carried out, rather than moving data to machines that can perform computation on it (the traditional parallel processing approach). Map/Reduce makes use of a large number of shared-nothing cheap commodity machines. The use of commodity hardware results in a cheap solution to large data processing problems. Replication of data on multiple nodes ensures availability and reliability on the unreliable underlying hardware. Map/Reduce takes care of data movement, load balancing, fault-tolerance, job scheduling, and other nitty-gritty details of parallel processing. Users of the Map/Reduce framework just have to concentrate on data processing algorithms which have to be 1

12 Chapter 1: Introduction implemented in the map and reduce functions of the framework. Map and Reduce are the two primitives provided by the framework for distributed data processing. The signatures of these primitives are: Map: (k 1, v 1 ) -> [(k 2, v 2 )] Reduce: (k 2, [v 2 ]) -> [v 3 ] The map function converts input key-value pairs of data into intermediate key-value pairs which are distributed among reduce functions for further aggregation. In simple terms, data is distributed among nodes for processing during the map phase and the result is aggregated in the reduce phase. Algorithms for processing distributed data have to be supplied in these primitives of the framework. This simplifies the development of programs for parallel settings. When it comes to the challenge of processing vast amounts of data, the distinction of structured and unstructured data does not matter much for Map/Reduce. The perceived rival of Map/Reduce, the parallel DBMS, is suitable for efficiently processing large volumes of structured data because of its ability to distribute the relational tables among processing nodes, compression, indexing, query optimization, result caching etc. However, the parallel DBMS has some inherent limitations. Once it is deployed and the data is distributed among nodes, adding more nodes to scale the parallel DBMS becomes very difficult. Moreover, the parallel DBMS is not fault-tolerant [5]. In case of a fault at one node during the processing, the whole processing sequence has to be restarted at each node. Because of these reasons, the parallel DBMS cannot cope well with the demand of processing tremendously large amounts of data. In addition, processing unstructured data is out of the jurisdiction of the parallel DBMS. For such a situation, Map/Reduce comes to the rescue. Map/Reduce can process bulks of unstructured data and can also handle structured relations of conventional databases. All a user of the Map/Reduce framework needs is to implement the processing logic in the map and reduce primitives of the framework. Users can specify algorithms for selection, projection, aggregation, grouping, joining, or other similar processing tasks of the relational databases in these primitives. Of all the different available implementations of Map/Reduce e.g. GreenPlum [37], Aster Data [38], Qizmt [39], Disco [40], Skynet [41], Apache s implementation of Map/Reduce (called Hadoop [31]) is the most commonly and widely used implementation for educational and research purposes because of its 2

13 Chapter 1: Introduction open-source and platform-independent nature. Hadoop allows for easily developing robust, scalable, and efficient programs for distributed data processing. Hadoop is based on client-server architecture. Centralized management by the master of the Hadoop cluster makes tasks far more simple and organized. The master distributes the workload among worker/slave nodes, which on completion of the requested task report to the master. The master decides the next course of actions to be taken. The storage of vast amount of the distributed data and its dissemination to worker nodes in the cluster is managed by the Hadoop Distributed File System (HDFS) [32]. As mentioned earlier, different sorts of operations can conveniently be performed on structured as well as unstructured data using Map/Reduce. Among these operations, joining two heterogeneous datasets is the most important and challenging operation. Many Map/Reduce applications require joining data from multiple sources. For example, a search engine maintains many databases such as crawler, log, and webgraph databases. It constructs the index database using both the crawler and webgraph databases so it requires joining these two datasets. However, joining large heterogeneous datasets is quite challenging. Firstly, processing two massive datasets in parallel for finding a match based on some attribute is intimidating even if a large computational cluster is available. Secondly, in a distributed setting, the datasets involved in joining are stored at distributed sites. Thirdly, although the database world is full of different techniques for the join operation, Map/Reduce itself is not built for processing multiple data streams in parallel. By its very nature, Map/Reduce processes a single stream of data at a time and hence does not have an efficient description for joining two parallel streams. In the Hadoop implementation of the framework, some strategies have been documented for the join operation. These are the Map-side, Reduce-side, and Memory backed join techniques but have their own inherent limitations. When it comes to joining skewed datasets, the performance of these join techniques is degraded. Skew in the distribution of join attribute s value can limit the effectiveness of parallel execution. The variation in the processing time of parallel join tasks affects the maximum speedup that can be achieved by virtue of parallel execution. In their popular paper Map-Reduce: A major step backwards, DeWitt and Stonebraker criticize Map/Reduce on various aspects, one of which is its inability to handle skew. They state: One factor that Map/Reduce advocates seem to have overlooked is the issue of skew The problem occurs in the 3

14 Chapter 1: Introduction map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the Map/Reduce community might want to adopt. Therefore, this project aims at developing a hash join algorithm for the Map/Reduce framework with the specialty to handle large skew in data and analyzing its performance with respect to the implementations provided by Hadoop. 1.2 Related Work The join operation is one of the fundamental, most difficult, and hence the most researched query operation. It is an important operation that facilitates combining data from two sources on the basis of some common key. The database literature is full of discussions on techniques, performance, and optimization of this operation. Nested loops, sort/merge, and hash join are the most commonly used join techniques. [3], [20] provide a general discussion on join processing in relational databases for single processor systems. They determine that the nested loops algorithm is useful only when the datasets to be joined are relatively small. When the datasets are large, hash-based algorithms are superior to sort-merge joins provided that the final result needs not be sorted before presenting it to user. The optimization of join techniques for multiprocessor environment has also been widely researched. [10], [12], [15] discuss join algorithm for multiprocessor databases. The performance of the sort-merge, grace, simple, and hybrid join techniques on a shared-nothing multiprocessor setup, GAMMA, is presented in [4]. They prove that the hybrid hash join algorithm dominates other algorithms for all degrees of memory availability. Different join techniques for Map/Reduce have also been researched. Hadoop implements map-side and reduce-side joins [26], [27] in which the join operation is carried out in mappers and reducers respectively. Set-similarity joins using Map/Reduce are discussed in [9]. They use the reduce-side join with custom partitioning to group together most similar tuples. [11] discusses the optimization 4

15 Chapter 1: Introduction of the join operator for multi-way and star joins using the Map/Reduce framework. For a 3-way join among R x S x T, the tuples of R and T are replicated across a number of reducer nodes to avoid communicating the result of the first join. The join is performed on the reduce side. They prove that a multi-way join using a single Map/Reduce job is more efficient than cascading a number of Map/Reduce jobs, each performing a 2-way join. A modification to the Map/Reduce framework for the join operation is presented in [9], called Map-Reduce-Merge. It introduces a new stage called merge where matching tuples from multiple sources are joined. The modified primitives are: Map: (k 1, v 1 ) α -> [(k 2, v 2 )] α Reduce: (k 2, [v 2 ]) α -> (k 2, [v 3 ]) α Merge: ((k 2, [v 3 ]) α, (k 3, [v 4 ]) β ) -> [(k 4, v 5 )] ϒ Where k is the key, v is the value, and α, β, ϒ are the lineages. A map function converts the key-value pair (k 1, v 1 ) from lineage α to an intermediate key-value pair. The reduction operation puts all intermediate values related to k 2 in the list [v 3 ]. Another map-reduce operation does the same with a key-value pair from lineage β and the subsequent reduce produces a key-value pair (k 3, [v 4 ]). Depending on the values of keys k 2 and k 3, the merge operator performs the join and combines the two reduced outputs in another lineage ϒ. A user-defined module partition selector determines which reducers the merge operator gets data from for joining. Thus joining is carried out after the reduce stage in an additional merge phase where each merger receives corresponding data of multiple datasets from the reduce phase. Sort-merge, block-nested, or hash joins can be performed in the merge phase. Hence, all the processing is complete in one Map/Reduce job. However, the Map-Reduce-Merge implementation requires changes in the basic Map/Reduce framework. Many techniques have been implemented for handling skew in parallel joins operation. [18] presents a partitioning strategy for skewed data which assumes that the skewed keys are known in advance, which is not the case in practical situations. The range partitioning has long been studied for assigning tuples to partitions on the basis of ranges rather than hash values of a join key. [23] sorts input datasets according to join keys. Depending on the processing capability of the system, the number of tuples T to be allocated to each partition is determined. The sorted datasets are then divided into n partitions, each containing T tuples, and partitions 5

16 Chapter 1: Introduction are assigned to processing units (PUs) in a round robin fashion. The tuples of the second dataset are assigned to PUs on the basis of partition ranges of the first dataset. This approach sorts the input dataset for determining appropriate ranges for partitioning, which is quite costly if the datasets are very large. [24], [28, [29] determine the ranges for partitioning after the full scan of the input data to determine the skewed keys. The scanning cost may overshadow the benefits obtained from the range partitioning. [25] determines the ranges for partitioning by randomly sampling the input datasets. It also presents the virtual processor partitioning approach which divides an input dataset into a greater number of partitions than PUs to scatter the skewed keys. To our knowledge, no work has yet been done for handling skew in the join operation for Hadoop. We attempt to add skew handling capability for Hadoop using the range partitioning. 1.3 Problem Statement and Aims Although several join techniques are available in the database literature, implementing them for Map/Reduce framework is, if not impossible, not so easy because of the very nature of the Map/Reduce framework. All the join techniques implemented for the Map/Reduce framework discussed in the Related Works section have their limitations. Map-Reduce-Merge, for example, introduces a new stage merge that can be an overhead for implementation and hence is not an efficient solution. The reduce-side join implemented by Hadoop incurs the space and time overhead as a result of tagging each tuple with source information. Although the performance of the map-side join is better than the reduce-side join, the prerequisite for the map-side join is that both input datasets must be pre-partitioned and properly structured before being input to the mappers. Moreover, these algorithms use the hash partitioning for distributing the two datasets among the worker nodes. The hash partitioning is sensitive to skew in the input data when some values appear multiple times. The similar repeated values are routed to one single processing node using the hash partitioning scheme. As a result, the worker nodes handling the repeated values are overloaded with too many records to be joined. Hence, the algorithms using hash partitioning for data distribution are prone to a degraded performance when the input datasets to be joined contain skewed keys. Since all the Hadoop join algorithms use the 6

17 Chapter 1: Introduction hash-partitioning for data distribution, they suffer from a performance hit in the case of skewed data. Problem Statement: We consider the equi-join on two data sources R and S, with cardinalities R and S, on the basis of a single join column such that R.r=S.s. For simplifying the situation, we do not perform any selection and projection on the data sources, although it can easily be incorporated during either the map or the reduce phase while emitting key-value pairs at nodes. To keep things simple, we perform an inner join on the datasets for evaluation. We assume that both dataset are stored in HDFS. Our major tasks in this project are: 1. We provide a detailed discussion of various join algorithms supplied by the Hadoop implementation. We analyze the pros and cons of each of these algorithms i.e. the map-side, reduce-side, and memory-backed joins. 2. We discuss our hybrid algorithm that is a combination of the map-side and the reduce-side join. 3. We discuss some pre-processing techniques such as the semi-join through bit-filtering and semi-join through selection which can reduce the sizes of datasets to be joined by removing those keys that do not take part in the join operation. We experimentally determine the performance of both techniques for filtering the input datasets. 4. We present different partitioning strategies for workload distribution among the nodes performing the join operation. The default hash-based partitioning of Hadoop overloads some nodes in the case of skewed keys. We discuss how to avoid this by using the range partitioning approach. We incorporate the range partitioning in our algorithm for handling skew. 5. We conduct experiments to compare the performance of the map-side, reduce-side, and memory-backed join algorithms with our hybrid algorithm (both versions: hash partitioning and range partitioning) for handling skewed data. 7

18 Chapter 1: Introduction 1.4 Thesis Outline Chapter 2 equips the reader with some background information to better understand the problem statement. Overview of Map/Reduce, its Hadoop implementation, hash join techniques, partitioning strategies, and filtering techniques are presented in this chapter. Chapter 3 discusses the Hadoop implementation of the map-side, reduce-side, and memory-backed joins and presents our hybrid algorithm with the range and hash partitionings. Chapter 4 presents the experiments conducted to compare the performance of these algorithms in case of skewed and non-skewed input datasets and discusses the results. Chapter 5 concludes all the findings and suggests some future extensions to the project. 8

19 Chapter 2: Background Chapter 2 Background 2.1 The Map/Reduce Framework The Map/Reduce framework consists of two operations, map and reduce, which are executed on a cluster of shared-nothing commodity nodes. In a map operation, the input data available through a distributed file system (e.g. GFS [6] or HDFS[32]), is distributed among a number of nodes in the cluster in the form of key-value pairs. Each of these mapper nodes transforms a key-value pair into a list of intermediate key-value pairs (the output keys may not be the same as the input keys). Depending on the map operation, the list may contain 0, 1, or many key-value pairs (shown in Figure 1). The signature of this map operation is: Map(k 1, v 1 ) -> list(k 2, v 2 ) The intermediate key-value pairs are propagated to the reducer nodes such that each reduce process receives values related to one key. The values are processed and the result is written to the file system. Depending on the reduce operation, the result may contain 0, 1, or many values (list of values). The signature of the reduce operation is: Reduce(k 2, list(v 2 )) -> list(v 3 ) An example of a map-reduce job is to count the number of times a particular word appears in some input data [1]. Input to this problem is a data file, the contents of which are distributed among the nodes of a cluster in the form of splits. Each node receives lines one-by-one from its input split in the form of key-value pairs; key in this case is the byte offset of the line and the value is a line in the file. A map operation is performed on each key-value pair which produces a list of intermediate key-value pairs, in this case a word and its count e.g., ( Hadoop, 1 ), ( join, 1 ) etc. The value 1 indicates that the word appears once in a particular line. The key-value pairs from all the mappers are stored in buckets such that values related 9

20 Chapter 2: Background Figure 1: The Map/Reduce dataflow to one key (word) are gathered in one bucket. A set of such buckets, called partition, is then provided to each reducer. A reducer calls a reduce operation for every bucket that performs an aggregate operation on the values to determine the count of the total number of times a word appears in the input file. Thus each call to a reduce operation produces one output for the key. An optional combine phase can be employed by each mapper to minimize the traffic on the network. In this phase, each mapper locally aggregates the output for each key. Hence, for example instead of transferring a key-value pair ( Hadoop, 1 ) thirty times, a mapper sends ( Hadoop, 30 ) only once to the file-system. This reduces the traffic when a reducer picks the value during the shuffling phase. 2.2 The Hadoop Distribution The Hadoop Distribution is based on HDFS a file-system with master-slave architecture. In a Hadoop cluster of n nodes, one node is the master node called NameNode (NN). The other nodes are worker nodes, called DataNodes (DN). The NN maintains metadata about the FileSystem. Files are broken down into splits of 64MB and distributed among the DNs. The larger size of splits as compared to the split size of conventional file systems reduces the amount of metadata to be maintained for each file. Each split is replicated on three DNs to ensure fault-tolerance. Hadoop also ensures locality of data i.e. processes are scheduled 10

21 Chapter 2: Background on the nodes which possess the data on which processing has to be performed, that is, computation is moved to the nodes containing data rather than data being moved to the nodes capable of doing computation. This reduces the amount of the data transferred among nodes and hence improves the performance. The DNs constantly send a heartbeat message to the NN along with the status of the task they have been assigned. If the NN doesn t get any information from a node for a threshold time, it re-schedules the task on another node which contains the replica of the data. Similarly, if tasks are distributed among data-nodes and all of the nodes have finished the processing but a straggler node still is performing the processing, the NN re-schedules the same task on an idle node. Whichever node returns the result earlier, its output is used by the NN and similar processes on the other nodes are killed. The general architecture of HDFS is shown in Figure 2. Figure 2: The HDFS Architecture 11

22 Chapter 2: Background Jobtrackers and tasktrackers work on master and worker nodes respectively to handle jobs and tasks. When a Map/Reduce job is submitted to the master, a jobtracker divides it in m tasks and assigns a task to each mapper. Following is the sequence of steps for conversion of input to output on Hadoop: 1. Mapping Phase: Each mapper works on non-overlapping input splits assigned to it by the NN. An input split consists of a number of records. The records can be in different formats depending on the InputFormat of the input file. A RecordReader for that particular InputFormat reads each and every record, determines the key and value for the records, and supplies the key-value pairs to map functions where the actual processing takes place (Figure 3). Each mapper applies a user defined function on the key-value pairs and converts them to intermediate key-value pairs. The intermediate results of mappers are written to the local file-system in a sorted order. 2. Partitioning Phase: A partitioner determines which reducer an intermediate key-value pair should be directed to. The default partitioner provided by Hadoop computes a hash value for the key and assigns the partition on the basis of the function: (hash_value_of_key) mod (total_number_of_partitions). 3. Shuffling Phase: Each map process, in its heartbeat message, sends information to the master about the location of the partitioned data. The master informs each reducer about the location of the mapper from which it has to pick its partition. This process of moving data to appropriate reducer-nodes is called shuffling. 4. Sorting Phase: Each reducer, on receiving its partitions from all mappers, performs the sort-merge join to sort the tuples on the basis of the keys. Since keys within partitions were already sorted in each mapper, the partitions have to be merged only such that the similar keys are grouped together. 5. Reduce Phase: A user-defined reduce operation is applied on each group of keys and the result is written to HDFS. 12

23 Chapter 2: Background Figure 3: Hadoop Map/Reduce dataflow (source [43]) 2.3 Impact of parallelization on the join operation The massive growth in the input data to be processed hampers the performance of the applications executing on uni-processor machines. If it curtails the performance of single stream operators (selection, projection, aggregation etc.), it doubles the trouble for the join operator which handles two data streams in parallel. Matching the records from gigantic data streams is clearly overwhelming. In large data warehousing applications, this may mean joining of trillions of records. Multi-processor or distributed processing is the solution to this problem and significantly improves the response time. In a multi-processor or distributed setting, the performance of the join operation can be improved by using partition-wise joins [14]. The input data is partitioned among a number of machines such that processing at parallel machines 13

24 Chapter 2: Background can be carried out independently. For parallel evaluation of the join operator, the two datasets are partitioned by applying a hash function in the same way such that each machine handles a subset of keys which can be joined independently. The partitioning of the datasets is performed on the basis of the join key so that a machine gets all tuples with same join key from both datasets. Thus partitioning the input datasets scatter them across a number of machines where a partition-wise join is carried out (Figure 4). This partition-wise join is a key to achieving scalability for massive join operations as it reduces the response time. The amount/degree of parallelism of the partition-wise join is limited by the number of partition-wise joins that can be executed concurrently. The greater the number of such concurrent partition-wise joins, the greater is the degree of parallelism. If the number of parallel partition-wise joins is 8, the degree of parallelism is 8. Figure 4: Partitioning R and S on join column R.r=S.s in three partitions using hash(k)=k%3 Sort-merge and hash joins are the natural choices for the join operation in distributed environments since both of these join techniques can operate independently on subsets of join keys. As each partition contains the same join keys from both datasets, employing sort-merge or hash join techniques exploits the parallel partitioning and hence provides scalability and divisibility. New partitions can be added for processing without effecting the on-going processing. However, in comparison, performance of the hash join is better than that of the sort-merge join since hash joins have linear cost, as long as a minimum amount of memory is available [22]. For the rest of our discussion and for the implementation of our algorithm, we will consider only the hash-join algorithm. 14

25 Chapter 2: Background 2.4 Hash Join algorithms Several variations of the hash join algorithm exist such as the classical hash join, grace hash join, hybrid hash join etc., with little differences. In essence, all of them build a hash table for the keys of the inner relation. The keys of the outer relation are iterated over and matched against the hash table entries. The matching tuples are written to an output table. The different flavors of the hash join algorithm differ in granularity of data used for the join operation. We discuss all of these join algorithms in brief for a single processor machine and then elaborate only the grace join algorithm for distributed setting Join algorithms on a single processor machine Let us consider that relations R and S have to be joined on the join column R.r= S.s. We assume S to be the smaller inner relation and R to be the outer relation. We also assume an inner-join between the relations. 1. Classical hash join This simplest and most basic hash join consists of following steps: 1) For each tuple t s of the inner build relation S a. Add it in an in-memory hash table on the basis of a hash function h applied on the key i.e. h(key). b. If no more tuples can be added to the hash table because the memory of the system is exhausted: i. For each tuple t r of the outer probe relation R, apply the same hash function h(key) on the key. Use the result as an index into the in-memory hash table built in step 1 to find a match. ii. If a match is found, a joined record is produced and the result is outputted. iii. Reset the hash table. 2) In a final scan of the relation R, the resulting join tuples are written to the output. 15

26 Chapter 2: Background 2. Grace hash join A refinement of the classical hash join is the grace hash join algorithm. The grace hash join algorithm partitions the relations and carries out the above hash join technique for each partition rather than for the whole relation (Figure 5). This reduces the memory requirements for keeping the in-memory hash table. 1) R and S are divided into n partitions by applying a hash function h 1 on the join column of each tuple t r and t s of relations R and S respectively. Both partitioned relations are written out to the disk. 2) For the smaller relation S, a partition is loaded and an in-memory hash table is built using a hash function h 2 on the join column of each tuple t s in the build phase. Hash function h 2 divides the partition into a number of buckets so that matching with the probing tuples becomes efficient. 3) During the probe phase, a similar partition from the relation R is read. The same hash function h 2 is applied on the join attribute of each tuple t r and matched against entries in the in-memory hash table. 4) If a match is found, a joined record is produced and written to the disk. Otherwise the tuple t r is discarded. Figure 5: Grace Join Algorithm (source [44]) 3. Hybrid Hash Join A minor refinement of the grace join algorithm is to keep one partition in memory instead of outputting to the disk and then use it for joining. This is the hybrid of the classical and grace join algorithms. The hybrid hash join follows these steps: 16

27 Chapter 2: Background 1) When partitioning the relation S, all partitions except the first one are written to the disk. An in-memory hash table is built for this partition. 2) When partitioning the relation R, all partitions except the first one are written to the disk. The tuples of this partition are used to probe the in-memory hash table of step 1 for matching. If a match is found, the joined record is written to the disk. 3) After exhaustion of the first partition, the procedure of the grace join algorithm is carried out for the rest of partitions. Since first partitions of both relations are never written to the disk and are processed on-the-fly, this avoids the cost of reading back these partitions from the disk to the memory Hash join algorithm for parallel implementation Because of their nature, the grace- and hybrid- hash join algorithms can easily be parallelized. The difference between the single processor and multi-processor/parallel variants of these algorithms is that the partitions are processed in parallel by multiple processors in the parallel variant (Figure 6). Below are the steps for the grace hash join algorithm on a multi-processor system: 1) The input relation R is horizontally divided into n partitions such that each partition carries approximately R /n tuples. A hash function h 1 is applied on the distribution key. Here we make the join key as the distribution key so that tuples with same join key are propagated to the same partition. The range of this hash function is from 0 to n-1 so that keys can be directed to one of the n nodes. The n partitions of R formed as a result of the hash distribution are written to the disk. 2) A similar process is carried out for relation S. It is divided into n partitions, each partition carrying about S /n tuples, by applying the same hash function h 1. This ensures that a partition x of the relation S contains the same join keys as partition x of the relation R. The partitions of S are also written to the disk. 3) Each processor reads in parallel a partition of relation S from the disk. It creates an in-memory hash table for the partition using a hash function h 2. 17

28 Chapter 2: Background 4) A corresponding partition of relation R is also read in parallel from the disk by each processor. For each tuple in this relation, it probes the in-memory hash table for any match. For each matching tuple, a joined record is outputted to the disk. Since all the n partitions of a relation are completely independent of each other, parallel processing can be carried out. Each processor handles the corresponding partitions from both the relations and writes the joined records for the matching tuples. Thus parallelizing the join operation improves the performance by a factor of the number of PUs. Figure 6: Joining on parallel machines 2.5 Skew and its impact on join operation In databases, it is common that certain attribute values occur more frequently than others [19], [21]. This is referred to as data skew. Skew in the input data can limit the effectiveness of parallelization of the join query [14]. As discussed earlier, parallelizing a join operation consists of the following steps: 1) Tuples are read from disk 2) Selection and projection are carried out on the basis of query 3) Tuples are partitioned among the parallel sites 4) Tuples of partitions on each site are joined 18

29 Chapter 2: Background Skew can occur at any of these stages and hence categorized as the tuple placement skew, selectivity skew, redistribution skew, and join product skew for each of the above stages respectively [17]. The initial placement of tuples in partitions may vary, giving rise to the tuple placement skew. The selectivity skew results from the fact that applying a selection predicate on different partitions may result in a varying number of selected tuples left in each partition. The redistribution skew is caused by varying number of tuples in the partitions after applying the redistribution scheme for partitioning. The join product skew is the result of differences in join selectivity at each node. For our implementation, we are not considering the tuple placement skew since Map/Reduce creates the file splits of almost even sizes. The selectivity skew is also ignored because firstly, it does not have any considerable impact on the performance and secondly, we assume in our programs that the selection and projection predicates are not applied. The join product skew cannot be avoided because it is evident only after the partitions of the two relations are joined. The redistribution skew is the most important and major type of skew that impacts the load distribution among nodes. This skew is caused by the selection of an inappropriate redistribution strategy for partitioning. In further discussions of skews, we will be referring only to the redistribution skew and will handle only this skew in our implementation. After the redistribution of tuples in partitions, the hash join algorithm is applied on the partitions of the two datasets at each node. Although the hash join algorithm is easily divisible and scalable, it is very sensitive to skew. Skew in the keys results in the variance in time taken by processing nodes. If some keys appear quite frequently in the input relation, an overly used key is sent to only one processing node on the basis of the hash partitioning. This results in an uneven distribution of the keys since the partitions receiving overly used keys will contain too many tuples. As a result, the nodes processing these partitions take too much time for completion and hence become a performance bottleneck. The performance of the whole distributed system is adversely affected by these heavy-hitter nodes as some nodes remain underutilized. To take full benefit from the parallel distributed environment, it is therefore important that the redistribution strategy should be selected in such a way that partitions are of considerable size and evenly distributed to avoid the load imbalances. Current partitioning strategies are divided into two categories: hash partitioning and range partitioning [16]. Both partitioning strategies have different sensitivities to different degrees of skew in the input keys. In the following 19

30 Chapter 2: Background discussion, we observe their sensitivities and conclude which partitioning strategy is most effective for skew handling and should be incorporated in our algorithm Hash Partitioning and its skew sensitivity As discussed earlier, during the redistribution phase, a partitioning function is applied on the keys of input datasets to distribute the workload among a number of nodes for parallel join computation. The hash partitioning is the most commonly used partitioning strategy. It distributes tuples to PUs on the basis of the hash value of the redistribution key. PU_no = h(k) mod N pu Here, k belongs to the domain of the redistribution key attribute and N pu is the number of PUs in the system. PU_no determines the PU to which a tuple with key k should be forwarded. However, it is the hash partitioning that may end up the redistribution phase in skewed partitions. With the hash partitioning, the number of tuples that hash to a given range of values cannot be accurately determined in advance. Whenever, relations are distributed among the parallel processing nodes on the basis of hashing, a key that is skewed will be directed to one and only one processing node. Selection of a good hash function is not a solution to the problem of data skew. Even a perfect hash function will map every join attribute with same value to one partition and hence a partition that receives all of these overly used keys will be overloaded. Let us consider an example to understand the situation that arises from skewed partitions generated by the hash partitioning. Two relations patent and cite are to be joined (Figure 7). The patent relation lists the patent ID and its grant year. The fields in the cite table are citing patent and cited patent. Data in the patent relation is unique i.e. there is one row for each patent. However, one patent may cite one or many other patents. Therefore the cite relation may contain more than one entry for the same patent. Some patents are too popular and hence are cited by a number of other patents. On the other hand, some patents are once or not at all cited by other patents. This presents a possibility for data skew. Now for each patent cited in the cite table, we may need to determine information about the citing patent. This can be done by joining patent ID attribute in the cite and patent relations. This presents the case of single skew since one of the relations contain the skewed data. 20

31 Chapter 2: Background Figure 7: Example of data skew -- Patent and Cite Tables For example, patent is cited by 5,000 other patents, so the cite table would contain 5,000 entries for this patent. When partitioning the cite relation by applying a hash function on the patent ID attribute, one single node would receive at least those 5,000 tuples irrespective of how many tuples are directed to other nodes. Selection of a good hash function has negligible impact on the skew in partitions. Although selection of a perfect hash function may restrict two different join attribute values to be hashed into the same partition, the imbalance discussed above may still be present even using this ideal hash function since the hash partitioning directs the same keys to one partition. So applying a hash function such that only patent is in one partition would still overload this node with 5,000 tuples while other nodes may not have sufficient load. There is no smarter hash function that can avoid the imbalance because of key repetition since it is by the very nature of the hash function to direct same keys to same partition. This resulting uneven distribution nullifies the gains achievable by parallel infrastructure. A more practical example of heavy data skew can be of the data received from sensors in a sensor network. These sensors continuously send the sensed values (which can be raw bytes, complex records, or uncompressed images etc.) to a monitoring station where the values are logged. Let us consider that a log dataset L logs for each sensor the sensor ID, time stamp, and the sensed humidity of a place being monitored. A sensor dataset S stores information about the sensor ID, the name of the place being monitored, and the sensor manufacturer. A monitoring station may need to join the relations L and S. In practice, some sensors may be of very high frequency and send data for logging very frequently while others may not 21

32 Chapter 2: Background have very high frequency. Therefore, the join operation using the hash partitioning would clearly overload the partitions handling a high frequency sensor. This will eventually throttle the performance of the distributed system Range Partitioning and its skew sensitivity As we have seen, hash functions used in the redistribution phase may result in imbalanced partitions in case of heavy skew in the input data. A good redistribution strategy should distribute the overly used keys to more than one partition. However, the overly used keys should first be determined and then this information can be used for deciding the partition boundaries. Two strategies of partitioning datasets on the basis of their key distribution are the simple range partitioning and the virtual processor partitioning. 1. Simple Range-based Partitioner As opposed to the hash partitioner, which at best allocates a single attribute value to a partition, a range partitioner may allocate a sub-range of the join attribute value to one partition. In the simple range partitioning, the number of partitions is equal to the number of PUs. Since one partition is handled by one PU, allocating only a sub-range of a single value to one partition reduces the burden on one PU in case of heavy skew. A split vector determines the boundaries for distribution of values among partitions. The entries of the split vector may not divide the value range in equally spaced blocks. This has, in fact, a positive impact since these entries may be chosen in such a way as to equalize the number of tuples mapped to each partition. Given p PUs, the split vector contains p-1 entries {e 1, e 2, e 3,, e p-1 }. This split table determines the ranges for range partitioning. From this split vector, each PU is assigned a lower bound and an upper bound for the range partitioning (except the first PU which does not have a lower bound and the last PU which does not have an upper bound). All the tuples that have their join key attribute falling in a particular range are sent to the PU associated with that range i.e. keys e 1 are routed to processor 1, e 1 < keys e 2 are directed to processor 2, and keys > e p-1 find their way to processor p. How to select this split vector? A good split vector can be selected by sampling the input relations so that an estimate of the distribution of join attribute values in the data can be obtained. Sorting the input dataset R with cardinality R and then 22

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework Konstantina Palla E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science School of Informatics University of Edinburgh

More information

Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Data Management Summer Semester 2015 TU Kaiserslautern Distributed Data Management Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) http://dbis.informatik.uni-kl.de/ Distributed Data Management,

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India rlrooparl@gmail.com 2 PDA College of Engineering, Gulbarga, Karnataka,

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

BBM467 Data Intensive ApplicaAons

BBM467 Data Intensive ApplicaAons Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

A Comparison of Join Algorithms for Log Processing in MapReduce

A Comparison of Join Algorithms for Log Processing in MapReduce A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel Computer Sciences Department University of Wisconsin-Madison {sblanas,jignesh}@cs.wisc.edu Vuk Ercegovac,

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Google Bing Daytona Microsoft Research

Google Bing Daytona Microsoft Research Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com

More information

PERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE

PERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE PERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE A report submitted in partial fulfillment of the requirements for the award of the degree of MASTER OF TECHNOLOGY in COMPUTER SCIENCE

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang, Ali Dasdan Yahoo! Sunnyvale, CA, USA {hcyang,dasdan}@yahoo-inc.com Ruey-Lung Hsiao, D. Stott Parker Computer Science

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information