SARAH Statistical Analysis for Resource Allocation in Hadoop

Size: px
Start display at page:

Download "SARAH Statistical Analysis for Resource Allocation in Hadoop"

Transcription

1 SARAH Statistical Analysis for Resource Allocation in Hadoop Bruce Martin Cloudera, Inc. Palo Alto, California, USA Abstract Improving the performance of big data applications requires understanding the size and distribution of the input and intermediate data sets. Obtaining this understanding and then translating it into resource settings is challenging. SARAH provides a set of tools that analyze input and intermediate data sets and recommend configuration settings and performance optimizations. Statistics generated by SARAH are persistently stored, incrementally updated and operate across the several processing frameworks available in Apache Hadoop. In this paper we present the SARAH tool set, describe several Hadoop use cases for utilizing statistics and illustrate the effectiveness of utilizing statistics for balancing reduce workload on Map-Reduce jobs on web server log file data. Keywords big data; statistical analysis; Hadoop; Map-Reduce, performance tuning; I. INTRODUCTION The performance of big data applications is typically a function of the size and distribution of input, intermediate and output data sets. The Apache Hadoop platform[2] offers developers, system administrators, data scientists and analysts 1 dozens of configuration parameters to specify the cluster resources needed by a big data application and to influence how the big data application executes. While taking advantage of such flexibility can result in a finely tuned system, the challenge of effectively setting those parameters is great. It requires understanding the size and distribution of input and intermediate data sets and the algorithms of the big data application. It also requires understanding the operation and configuration of Hadoop processing frameworks, the available resources of a given cluster and the overall workload of the cluster. Consider the problem in the Map-Reduce[8]framework of determining the number of reducers that an application needs and balancing the load of intermediate data across those reducers. [10] In the Map-Reduce and Hive[3] frameworks, the user sets a property to specify the number of reducers. In the Pig[1] and Spark[11] frameworks, the user specifies an optional parameter to commands that typically run in reducers. To come up with a meaningful value, the user needs to understand the size and distribution of records in the intermediate data sets and given that understanding, have a way to influence the assignment of records to reducers. At best, the informed user understands the intermediate data and 1 Throughout the paper we will refer to the developer, administrator, data scientist and analyst as the user. can carefully calculate the number of reducers. At worst, the uniformed user utilizes system defaults or makes a random guess. Advanced relational data base systems gather and utilize statistics about tables in query optimization. [16] Such systems are closed systems; they offer a single relational model, a single query language and the storage format is defined by the system. Hadoop, on the other hand, supports unlimited storage formats defined by the user, multiple models and multiple processing frameworks with varying degrees of metadata. The Pig, Hive and Impala frameworks utilize some statistics about a job s data but the statistics generated in one framework are not usable in the others.[14],[15],[18] Computing statistics for big data sets is expensive. It makes little sense to spend more time computing statistics than the performance gain obtained by a more efficient use of resources. On the other hand, if the statistics are persistently saved, used across subsequent executions of applications and available in multiple processing frameworks, then this cost can be amortized over time. Furthermore, if updates to analyzed data sets only require an incremental statistical analysis cost, then the cost of generated statistics can be amortized over a long time. We view persistence, cross framework access and incremental update as requirements for any big data environment that gathers and utilizes big data statistics. SARAH (Statistical Analysis for Resource Allocation in Hadoop) is a test bed we have built to experiment with the generation of statistics of big data and the use of those statistics at runtime. SARAH generated statistics are used to enhance performance and help the user in setting resource properties. Statistics generated by SARAH are persistently saved, incrementally updatable and can be utilized across processing frameworks. Concretely, SARAH is a set of tools run by users on their data sets for their big data applications. SARAH contains a tool set for each supported processing framework to generate, store and update statistics. The statistics generated by the tool in one framework can be used at runtime in other frameworks. Having systems automatically generate, update and use statistics without user involvement is appealing. SARAH takes a more pragmatic approach, requiring user involvement but at a high level, productive fashion. User input is needed to determine when statistics should be gathered and Appeared in: 3 rd IEEE Conference on Big Data Science and Engineering (BSDE14), September, 2014

2 incrementally updated and to map those statistics across platforms. II. HADOOP USE CASES FOR STATISTICAL ANALYSIS The Hadoop Map-Reduce, Pig, Hive, Impala and Spark frameworks have many configurations that allow or require users to specify resources. Our goal is that SARAH generated statistics support these and other use cases. A. Smart Input Split Hadoop processing frameworks divide the input data set into subsets of records for parallel processing. The default behavior is to split each file into 64 MB blocks and assign each block to a map task. This approach is often adequate because the amount of work each parallel map task does is not sensitive to the distribution of the data, as it is with reducers. Each map task is given 64 MB of data. There is overhead creating and initializing a task, most notably the overhead to create and initialize a Java Virtual Machine. If a task has too little work, this overhead dominates and a larger block size is appropriate. Statistical analysis of the cost of executing the map function to the input data can estimate an effective value for the block size. An input data set that consists of many small files, that is, files that are smaller than 64 MB, results in too many small map tasks because the default behavior is to assign one map task to each file in this case. Statistical analysis of the input data set can recognize this. B. Appropriate number of balanced reducers Users can specify the number of reducers to use in a Map- Reduce job, including those executed by the Hive framework. Similarly, users of the Pig and Spark frameworks can set an additional parameter in the commands that are usually executed in parallel reducers. Statistical analysis of the intermediate data can estimate the number of reducers. Such analysis needs to take into account the size of the intermediate data and the overhead for creating and initializing a task. Estimating the number of reducers using only the size of the intermediate data is insufficient. Intermediate data is susceptible to data skew. Statistical analysis of the distribution of the intermediate data can break the intermediate data into similarly sized partitions. We describe this use case with SARAH in more detail in section IV. C. Skewed Joins Joining two large data sets can result in unbalanced parallel reducers if the joined data is skewed. [7] Statistical analysis of both data sets can estimate the number of reducers. Furthermore, by analyzing the distribution of the joined data, multiple reducers can be assigned to popular join keys. This approach requires replicating some of the records across reducers. Pig does this kind of analysis for skewed joins[14], however the analysis does not persist, cannot be incrementally updated and is not available across processing frameworks. D. Combiner Benefit In the Hadoop processing frameworks, a combiner is a function that is applied to subsets of intermediate data. For large intermediate data, a combiner almost always improves performance and lessens network utilization. When a reduce function is not commutative and associative, the reduce function cannot simply be reused as a combiner. Instead, the user must program a separate function. Statistical analysis of the input and intermediate data can advise on the benefits of coding an additional combiner function. In the Pig framework combiners are automatically determined by the execution plan. The Pig framework does not apply combiners when the script invokes a user-defined function because it treats the function as a black box. Pig does apply combiners, however, if the user code is declared as algebraic and provided as initial, intermediate and final functions. [13] Again, statistical analysis of the input and intermediate data can advise of the benefits of this additional coding. E. Task Memory Allocation Hadoop processing frameworks define several properties that specify task memory requirements. These properties are defined prior to executing the job. The properties include a map task s heap size, the size of the map task s in-memory buffer for intermediate data and a reduce task s heap size. Statistical analysis of input and intermediate data can estimate values for these memory specifications. While not required by the Hadoop framework, some reducers buffer all of the values associated with a key. Statistical analysis of intermediate data can estimate an upper bound on the amount of memory a reduce function requires. F. Balanced Total Order Sort The Map-Reduce Framework sorts partitions by key. It does not, however, sort across all the partitions. The total order partitioner [6] ensures the sorted keys in one partition are less than the sorted keys in the next partition, effectively producing a total sort of the data set. The user provides keys that divide the partitions and the practitioner builds the partitions at runtime. The partitions can be unbalanced since they depend on the keys provided by the user. Statistical analysis of the intermediate data can estimate the distribution of the keys and calculate keys for balanced partitions. G. Compressed Intermediate Data Hadoop processing frameworks transmit intermediate data over the network. The user can specify if this data should be compressed and the compression algorithm that should be used. If the amount of intermediate data is large and the overhead of compressing and decompressing the data is not too great, then compressing the data improves performance. Statistical analysis of intermediate data can estimate whether compression is worth it. H. Parallel Data Transfer Hadoop processing frameworks transmit intermediate data over the network in parallel. In the Map-Reduce framework, reducers pull sorted intermediate data from multiple mappers

3 and merge the sorted data. A property controls how many streams are received and sorted in parallel. Statistical analysis of the intermediate data can estimate appropriate values for this property. I. Estimating cluster workload The previous use cases utilize statistical analysis of input, intermediate, output data sets and algorithms for optimizing the performance of a single job. Data and algorithm statistics can also be used across jobs and over time. Job and data statistics are useful in expanding a cluster, that is, in determining additional hardware to deploy. The analysis can also be useful in determining service level agreements and scheduling policies. III. THE SARAH TEST BED SARAH is a test bed for generating and using crossframework, persistent and incremental statistics for Hadoop. Some of the SARAH tools generate statistics; other tools produce artifacts from the generated statistics. Some of the artifacts are used at runtime for better resource utilization, others are used to configure a job, others are used when developing and testing software and still others are used for informational purposes. SARAH tools are framework specific. For Hadoop s Map-Reduce framework, SARAH provides a set of generic Map-Reduce jobs to compute statistics and generate artifacts. For the Pig framework, SARAH provides a set of parameterized Pig scripts. For the Hive and Impala frameworks, SARAH provides a set of parameterized HiveQL scripts. For the Spark framework, SARAH provides a Scala API for computing statistics on intermediate RDDs. Metadata differ between frameworks. In Hive and Impala, data sets are completely described as tables. The table definitions are stored in the Hive metastore. In the Map- Reduce, Pig and Spark frameworks, metadata are embedded in programs and incomplete. Such differences necessitate separate tools for generating statistics. SARAH artifactgenerating tools are also framework specific because execution costs differ between frameworks and many of the artifacts themselves are framework specific. Since generating statistics on big data sets is costly, SARAH saves generated statistics persistently. Furthermore, SARAH tracks changes to input data sets and users can request SARAH to incrementally update the generated statistics. All of the tools represent generated statistics in a common format. Statistics generated by one tool set can be used in another framework. In particular, artifact-generating tools from one framework can use statistics generated in another framework. A. Input Data Sets Hadoop processing frameworks typically operate on sets of files, stored in HDFS. Hive and Impala equate tables with HDFS directories and support partitioning of tables as subdirectories. Other frameworks are more flexible, allowing the user to define an input data set as an arbitrary set of files. SARAH requires users to name and define data sets. Data sets are either a directory in the file system or an explicitly defined set of files. SARAH then bases incremental updates to generated statistics on the addition to or deletion from files in the input data set. SARAH does not track changes within a file since HDFS files are immutable. SARAH calculates and stores basic statistics from input data sets, including the number of records in the data set, the size of the data set in bytes, the minimum, maximum and average record size in bytes and the distribution of record sizes. B. Intermediate Data Sets and Functions An intermediate data set results from applying a function to an input data set. A function is named and realized in different ways in different frameworks. A given named function can have multiple realizations. By naming different realizations of the function with the same name, the user is indicating that they produce the same intermediate data set, given the same input data set. In the Map-Reduce framework, a map method in a Java mapper class is an example of a function. It produces an intermediate data set. In Hive and Impala, a simple query that operates on a single record is an example. In Pig, a simple script that operates on a single record is an example. In Spark, a Scala or Java function that is defined on a single record is an example. Unlike input data sets, intermediate data sets are not necessarily realized as files in HDFS. An intermediate data set in the Map-Reduce framework is initially generated and partitioned at all the mappers and then transmitted over the network to the reducers. It is always a distributed data structure, never being stored in a single place. In Spark, an intermediate data set may only live in the cluster-wide cache. Advanced database systems, Hive[5] and Impala[18] generate so-called column statistics. Such systems are essentially generating statistics on intermediate data sets generated from simple column selection functions. An intermediate data set can represent a resource intensive state of a big data application. It represents the result of applying a function to an input data set. Statistics about the intermediate data set are useful in different resource allocation contexts in different frameworks. For each function applied to an input data set, SARAH calculates and stores basic statistics from the resulting intermediate data sets, including the average execution time of the function, the number of intermediate records, the size of the intermediate data set in bytes, the minimum, maximum and average intermediate record size in bytes and the distribution of intermediate records. C. Artifacts Once statistics are generated for input and intermediate data sets, SARAH tools can generate useful artifacts from those statistics. Some of the artifacts are used at runtime for better resource utilization, others are used to configure a job, others are used when developing and testing software and still others are used for informational purposes.

4 An example of a runtime artifact in the Map-Reduce framework is an interval file that can be used with the Total Order Partitioner[6] to balance or sort the load across reducers. The number of reducers in a Map-Reduce job, the value for a parallel parameter in a Pig statement and an estimation of the amount of memory that a task needs to process a data set are all examples of configuration artifacts. SARAH computes random samples of input and intermediate data sets for generating statistics. Users specify a sample percentage between 0 and 100%. The generated samples are saved artifacts and users can use them for development, testing and analysis. Besides random samples, SARAH can create other kinds of samples of input and intermediate data, including samples with outliers and samples of data sets to be joined that reflect the resulting distribution of joined data. Such samples are useful in development, testing and analysis of big data applications. The join samples are useful in understanding and addressing skewed joins. SARAH generates artifacts that are useful for a user to understand the input and intermediate data sets. SARAH can generate visualizations of data distributions. Figure 1 visualizes the distribution of applying the months() function to a weblog input data set. SARAH generates the distribution data as well as an R [17] script to create the graphic. IV. USING SARAH TO BALANCE LOAD ACROSS REDUCERS We now illustrate the use of SARAH to estimate an appropriate number of reducers in the Map-Reduce framework and to balance the load across those reducers. While this use case is also relevant to Hive, Pig and Spark, we limit the description to generating statistics and artifacts for Hadoop Map-Reduce jobs To begin, the user issues the following command: sarah statistics [sample-%] input-data-set function1.. functionn The statistics command generates or updates statistics for an input data set and the n intermediate data sets defined by function1.. functionn. The user can specify an optional sample percentage. The first phase of the statistics command computes and saves the n+1 samples in a single map-only job. The mappers randomly select records in the input to generate the input sample. The mappers also apply each function to each record in the input data set and randomly select records in each intermediate data set for each intermediate sample. The first sample generating phase is efficient. It only requires a single, parallel, processing of the input data. Each record is considered once in parallel and the n functions are applied to it. The second phase of the statistics command generates statistics from the samples. For samples that are small enough to fit in memory, SARAH generates all statistics as a single efficient map-only job. Each mapper processes a single sample. For larger samples, SARAH generates multiple Map- Reduce jobs to compute the statistics for all of the data sets. To obtain SARAH s recommendations for the number of reducers and artifacts for balancing the load across reducers, the user issues the following command: sarah balanced-reducers [split-values] input-data-set function1.. functionn For each function, SARAH estimates a number of reducers by simply dividing the estimated number of records in the associated intermediate data set by the ideal partition size; that is, it divides the estimate by the number of records ideally processed by each reducer. The later is a constant. For each function, SARAH also generates an interval file. The interval file is an artifact that serves as input to a get_partition function used by the Map-Reduce framework to assign records to reducers. By default, SARAH generates a n i n t e r v a l f i l e t h a t c a n b e p r o v i d e d t o t h e TotalOrderPartitioner in the Map-Reduce framework. This partitioner does not split the set of values that are associated with a key. This limits the ability to balance the load across

5 reducers. If the number of values associated with a single key is greater than the ideal partition size, the partitioning is less than optimal. If the user sets the optional split-values parameter to true, the intermediate data set is exactly balanced over all of the reducers. For keys with split set of values, the interval file also contains a percentage of records that are included in each partition. SARAH provides an extension to the TotalOrderPartioner that uses the percentages to split the set of values associated with a single key. The extended partitioner violates the rule that all values associated with a single key are provided to a single reducer. If the user wishes to do this, the application must accommodate the non-standard partitioning. A. Measuring SARAH Effectiveness Balancing Reducers Measuring the value of SARAH requires comparing performance and resource utilization of big data applications that have been configured using SARAH artifacts to those that have been configured manually. We compare the utilization of reducers on a Map-Reduce job using SARAH artifacts to the same job configured manually. We consider three cases of manual configuration for the balanced reducer use case: The naïve user understands neither the Map-Reduce job being executed nor the input and intermediate data sets. The naïve user accepts all of the defaults for running the job. Since the default number of reducers is 1, there is no need to partition the intermediate data set. The rule of thumb user learned some simple rule for allocating reducers. For our purposes the rule is based on the size of the input data set. Hive and Pig provide this as a default. [9], [12] The job configured by the rule of thumb user runs with the default HashPartitioner that simply assigns records to reducers by hashing the key. The educated guess user attempts to understand the intermediate data but the understanding is incomplete. In particular, the user measures the size of the intermediate data set by running the mapper but fails to measure the distribution of the intermediate data set. Again, the job configured by the educated guess user runs with the default HashPartitioner. We ran a Map-Reduce job that processes an 18 million record weblog data set generated by an Apache Web Server. The job analyzes the weblog data to understand the differences by month of users who access the web server in the evening. The intermediate data consisted of 6 million records. The naïve user ran the job with a single reducer. The single reducer processed all 6 million records. This approach obviously does not scale. The intermediate data set grows as a function of the input data set, eventually overloading the single reducer. The rule of thumb user ran the job as a function of the input size, with 12 reducers. Figure 2 illustrates the result of running the job configured by the rule of thumb user. The partitioning of data to reducers was not very good. There were three reducers with no records to process and the workload of the remaining reducers was not balanced. Furthermore, the balanced reducers were underutilized, processing less than 600,000 records each. We used SARAH to generate a sample of the intermediate data and used the tool s recommendation of 5 reducers and the interval file generated by the tool. We did not split the set of values so an exact partitioning is not possible. Figure 3 illustrates the distribution of records to reducers that SARAH generated. The first four reducers are fairly balanced. The fifth reducer handled a single key with 1.7 million records.

6 Since we did not choose to break up sets of records, the fifth reducer had more work than the others. Finally, the educated guess user ran the job with 5 reducers but without considering the distribution of the data. Figure 4 illustrates the distribution of records to reducers as a result of running the job by the educated guess user. The number of reducers is appropriate but the skew in the intermediate data is not handled. Notice that the third, rather than the fifth, reducer processed the 1.7 million records associated with the same key in this case. V. CONCLUSIONS AND FUTURE WORK The SARAH test bed produces statistics for large input and intermediate data sets. The statistics are persistently saved, incrementally updated and usable across frameworks. From these statistics, SARAH generates artifacts that are useful for allocating resources in the desired framework. We illustrated this with the problem of allocating reducers and balancing the workload across those reducers. The analogous problem exists in the Pig, Hive and Spark frameworks. We continue to develop the SARAH test bed for the Map- Reduce, Hive, Impala, Pig and Spark frameworks. We continue experimenting with different algorithms and use cases. We continue to expand the SARAH test cases to better experiment with the algorithms. To date, we have concentrated on using SARAH to estimate resources for a single computation. We have not yet addressed any of the use cases that use collected data set statistics over time to estimate cluster wide sizing and scheduling. Finally, in building SARAH we realized the need to define a runtime API and service that makes the statistics and generated artifacts available to the Hadoop applications and frameworks. VI. REFERENCES [1] A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava., "Building a High Level Dataflow System on top of Map-Reduce: The Pig Experience." Proc. of the VLDB Endowment, vol. 2, no. 2, [2] Apache Hadoop Project. [3] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, Raghotham Murthy, "Hive - a Petabyte Scale Data Warehouse Using Hadoop," icde, pp , 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010 [4] Cloudera Impala Project [5] Column Statistics in Apache Hive, [6] D. Miner and A. Shook, MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly Media, December, [7] David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, S. Seshadri, Practical Skew Handling in Parallel Joins, Proceedings of the 18th International Conference on Very Large Data Bases, p.27-40, August 23-27, 1992 [8] Dean, J., and Ghemawat, S. Mapreduce: Simplified Data Processing on Large Clusters. Communications of the ACM, vol. 51, no. 1, 2008 [9] E. Capriolo, D. Wampler, J. Ruthergien, Programming Hive, O'Reilly Media, September, [10] Lars Kolb, Andreas Thor, Erhard Rahm, Load Balancing for MapReduce-based Entity Resolution", Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, p , April 01-05, 2012 [11] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, Spark: Cluster Computing With Working Sets, Proceedings of the 2nd USENIX Conference on Hot topics in Cloud Computing, p.10-10, June 22-25, 2010, Boston, MA [12] Pig Reducer Estimation, [13] Pig User-defined Functions, [14] PigSkewedJoinSpec, [15] Statistics in Hive, [16] Sunil Chakkappen, Thierry Cruanes, Benoit Dageville, Linan Jiang, Uri Shaft, Hong Su, Mohamed Zait, Efficient and scalable statistics gathering for large databases in Oracle 11g, Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, June 09-12, 2008, Vancouver, Canada [17] The R Project for Statistical Computing. [18] Tuning Impala for Performance, docs/cdh5/latest/impala/installing-and-using- Impala/ciiu_performance.html

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in. by shatha muhi CS 6030 1 q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges.

More information

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Enhancing Massive Data Analytics with the Hadoop Ecosystem www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha

More information

Data and Algorithms of the Web: MapReduce

Data and Algorithms of the Web: MapReduce Data and Algorithms of the Web: MapReduce Mauro Sozio May 13, 2014 Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, 2014 1 / 39 Outline 1 MapReduce Introduction MapReduce

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems paper:38 Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems Edson Ramiro Lucsa Filho 1, Ivan Luiz Picoli 2, Eduardo Cunha de Almeida 2, Yves Le Traon 1 1 University of Luxembourg

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis Prabin R. Sahoo Tata Consultancy Services Yantra Park, Thane Maharashtra, India ABSTRACT Hadoop Distributed

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

More information

CSE-E5430 Scalable Cloud Computing Lecture 11

CSE-E5430 Scalable Cloud Computing Lecture 11 CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

CASE STUDY OF HIVE USING HADOOP 1

CASE STUDY OF HIVE USING HADOOP 1 CASE STUDY OF HIVE USING HADOOP 1 Sai Prasad Potharaju, 2 Shanmuk Srinivas A, 3 Ravi Kumar Tirandasu 1,2,3 SRES COE,Department of er Engineering, Kopargaon,Maharashtra, India 1 psaiprasadcse@gmail.com

More information

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

Advanced SQL Query To Flink Translator

Advanced SQL Query To Flink Translator Advanced SQL Query To Flink Translator Yasien Ghallab Gouda Full Professor Mathematics and Computer Science Department Aswan University, Aswan, Egypt Hager Saleh Mohammed Researcher Computer Science Department

More information

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Data Migration from Grid to Cloud Computing

Data Migration from Grid to Cloud Computing Appl. Math. Inf. Sci. 7, No. 1, 399-406 (2013) 399 Applied Mathematics & Information Sciences An International Journal Data Migration from Grid to Cloud Computing Wei Chen 1, Kuo-Cheng Yin 1, Don-Lin Yang

More information

CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP

CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP Ashvini A.Mali 1, N. Z. Tarapore 2 1 Research Scholar, Department of Computer Engineering, Vishwakarma Institute of Technology,

More information

Approaches for parallel data loading and data querying

Approaches for parallel data loading and data querying 78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies diaconita.vlad@ie.ase.ro This paper

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS) INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS) International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976 ISSN 0976

More information

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014 Hadoop and Hive Introduction,Installation and Usage Saatvik Shah Data Analytics for Educational Data May 23, 2014 Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 1 / 15

More information

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research

More information

NetFlow Analysis with MapReduce

NetFlow Analysis with MapReduce NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Alternatives to HIVE SQL in Hadoop File Structure

Alternatives to HIVE SQL in Hadoop File Structure Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Data Warehousing and Analytics Infrastructure at Facebook

Data Warehousing and Analytics Infrastructure at Facebook Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo Zheng Shao Suresh Anthony Dhruba Borthakur Namit Jain Joydeep Sen Sarma Facebook 1 Raghotham Murthy Hao Liu 1 The authors can be

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

The PigMix Benchmark on Pig, MapReduce, and HPCC Systems

The PigMix Benchmark on Pig, MapReduce, and HPCC Systems 2015 IEEE International Congress on Big Data The PigMix Benchmark on Pig, MapReduce, and HPCC Systems Keren Ouaknine 1, Michael Carey 2, Scott Kirkpatrick 1 {ouaknine}@cs.huji.ac.il 1 School of Engineering

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team Why Another Data Warehousing System? Problem: Data, data and more data 200GB per day in March 2008 back to

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Big Data Analytics Hadoop and Spark

Big Data Analytics Hadoop and Spark Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

More information

Toward Lightweight Transparent Data Middleware in Support of Document Stores

Toward Lightweight Transparent Data Middleware in Support of Document Stores Toward Lightweight Transparent Data Middleware in Support of Document Stores Kun Ma, Ajith Abraham Shandong Provincial Key Laboratory of Network Based Intelligent Computing University of Jinan, Jinan,

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Comparing High Level MapReduce Query Languages

Comparing High Level MapReduce Query Languages Comparing High Level MapReduce Query Languages R.J. Stewart, P.W. Trinder, and H-W. Loidl Mathematical And Computer Sciences Heriot Watt University *DRAFT VERSION* Abstract. The MapReduce parallel computational

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

A Study on Big Data Integration with Data Warehouse

A Study on Big Data Integration with Data Warehouse A Study on Big Data Integration with Data Warehouse T.K.Das 1 and Arati Mohapatro 2 1 (School of Information Technology & Engineering, VIT University, Vellore,India) 2 (Department of Computer Science,

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive

Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive Dr. E. Laxmi Lydia 1, Dr. M.Ben Swarup 2 1 Associate Professor, Department of Computer Science and Engineering,

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK 44 ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK Ashwitha Jain *, Dr. Venkatramana Bhat P ** * Student, Department of Computer Science & Engineering, Mangalore Institute of Technology & Engineering

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey By, Mr. Brijesh B. Mehta Admission No.: D14CO002 Supervised By, Dr. Udai Pratap Rao Computer Engineering Department S. V. National

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Integrate Master Data with Big Data using Oracle Table Access for Hadoop Integrate Master Data with Big Data using Oracle Table Access for Hadoop Kuassi Mensah Oracle Corporation Redwood Shores, CA, USA Keywords: Hadoop, BigData, Hive SQL, Spark SQL, HCatalog, StorageHandler

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015 Big data analytics / machine

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop

More information