Hadoop Performance Diagnosis By Post-execution Log Analysis
|
|
|
- Ferdinand Parks
- 10 years ago
- Views:
Transcription
1 Hadoop Performance Diagnosis By Post-execution Log Analysis cs598 Project Proposal Zhijin Li, Chris Cai, Haitao Liu University of Illinois at Urbana Champaign Abstract Vaidya is a rule based post-execution diagnosis tool for Hadoop mapreduce jobs. Its purpose is to provide feedback for Hadoop administrator after analyzing job log and configuration files of mapreduce tasks. Vaidya analyzes Hadoop mapreduce jobs performance through testing job execution statistics and compares them to the predefined thresholds set by user. Each test of Vaidya targets a specific performance issue. Depending on whether a test passes the check or not, Vaidya gives prescription as feedback for system administrator as instructions to follow. Vaidya can identify potential Hadoop system inefficiencies and give feedback report in short period of time. In this project we extends Vaidya by introducing new sets of test rules aiming to provide more complete diagnosis for Hadoop systems. We also improve Vaidya by providing informative feedback to optimize Hadoop system inefficiencies. 1 Introduction Hadoop [1] is a data-intensive computing framework inspired by Google s MapReduce [7] and Google File System [8] papers. As one component of Hadoop, the Mapreduce distributed computation framework provides a comfortable environment for programmers to develop MapReduce application for analyzing large scale dataset, and the size of which can range from several terabytes to petabytes. Hadoop also provides its own distributed file system HDFS [6]. The extensibility of Hadoop allows programmers to build different computation paradigm on top of it, to serve different research and commercial needs. Nowadays, many companies including Yahoo, Facebook and Amazon use Hadoop to process business data. The scalability, efficiency and fault tolerance Hadoop supports make it possible to process computation on large amount of data in short time. However, the distributed nature of Hadoop and its large scale makes it hard for administrator to identify potential data processing bottleneck and improve performance by appropriate optimization. For example, frequent re-execution of jobs can be due to failure of certain nodes, corruption of data on old disk or link failure in network among clusters. In some task, some nodes responsible for performing reduce jobs might get significantly more jobs than other reduce nodes. The reason could be either incorrect job specification, or an infrastructure problem in which the nodes are set up inappropriately. An informative prescription is indeed essential after similar performance inefficiencies occur, so Hadoop users are able to identify the potential category the problem lies and corresponding actions of correctness and 1
2 optimization can be performed more quickly. Hadoop Vaidya [2] is a rule based postexecution diagnosis tool for Hadoop MapReduce jobs. Vaidya analyzes execution statistics of MapReduce jobs through job history and job configuration files. Vaidya runs predefined rule based tests against job execution statistics to diagnosis performance problems. Each test targets a specific MapReduce job performance problem. User can write own customized tests to deal with specific performance issues. After running all tests, Vaidya summarizes the feedback according to the rules given and writes the evaluation results as a XML report. Each test rule in Vaidya is associated with a value called importance, which is a declarative value specifying the overall importance of this test. Users can choose value from high, median and low, depending on how much weight they place on this performance test. For each test rule, user defines an evaluate function, which calculates the impact level specifying the degree of problem job has with respect to the condition being evaluated. User also sets a SuccessThreshold, which is a threshold value Vaidya uses to decide whether the Hadoop system performance indicated by job log passes this specific test. If the calculated impact level of a test is less than its SuccessThreshold, the test is declared as passed. For each test which fails to pass, Vaidya writes Prescription to the XML report as target advice written by the test case adviser for the user to optimize the current Hadoop configuration. 2 Motivation The motivation of our class project consists of many parts. First, inspired by Vaidya itself and other similar log analysis tools (see details in Section 3), we have found that Hadoop job diagnosis by post-execution log analysis is significantly helpful for users to tune MapReduce jobs. However, there are some insufficiencies in the existing tools, such as uncovered failure detection methods and incomplete running statistics. Those tools are overlapping yet not complementary enough in terms of functionalities. As such, we are motivated to improve and extend one of the tools Vaidya. We believe this would benefit Hadoop users in industry as well as researchers in academia by providing a convenient and easy-to-use way for inspecting the MapReduce jobs they have run. Second, the current Vaidya tool released only comes with five default test rules, which cover a limited number of potential inefficiencies can occur during Hadoop mapreduce jobs. The mapreduce job and task statistics extracted from Vaidya API are not fully utilized. There are more useful information we can retrieve from these statistics extracted from Hadoop logs to improve its performance. For example, the Vaidya API extracts task running statistics for both map and reduce tasks, but only statistics for map tasks are used in one of the default tests. We can image the same type of inefficiencies can occur at both map and reduce sides of mapreduce process. We believe our extension to the Vaidya tool will benefit a wide range of Hadoop users. Users who are setting a new Hadoop cluster can use our new rules to figure out the optimal configuration. Users of aged Hadoop clusters can identify potential outdated part of hardwares and replace old disks, network switches, etc. Third, the appropriate threshold is hard to find. In Vaidya, whether the Hadoop job log passes a test depends on the thresholds set for the specific test. For example, in one of the default tests, if the map side disk spill exceeds 30 percent of the total map output, the test is considered to be failed to pass the check. Finding the right threshold requires testing with realistic mapreduce programs and datasets. We will explore different thresholds and look for ones suitable for practical purposes. For our project we are going to extend Vaidya by adding tests to examine more critical Hadoop 2
3 mapreduce performance issues including disk spill, map/reduce nodes ratio, intermediate data processing time, etc. We are also going to conduct experiments to compare different thresholds and look for suitable ones. Finally, if time permits, we will design a user friendly interface for Vaidya to let the feedback for job logs become more informative and lucid. 3 Related Work After running certain period of time or some task executions are done in the distributed system, usually all the configurations, runtime statistics, and running results are saved in form of logs. Execution logs of distributed system software are highly valuable information, as they can be used to do post-execution analysis by examination and mining on the logs. The analysis can benefit the system by suggesting configurations for better performance, or detecting failures, errors and anomalies, etc. Some tools for analyzing and reporting Hadoop job performance have been built. However, therere certain problems they havent solved or insufficiencies they havent covered. In this section, we are going to survey and investigate the existing works and tools for post-execution log analysis, particularly on Hadoop. And more importantly, relate them to the tool Vaidya we work on, to better give us working directions. Mochi [10] is a log-analysis based tool for Hadoop debugging. It produces visualizations for users to reason and debug performance issues. Mochi analyzes Hadoop s behavior in terms of space, time and volume, and extracts a model of data flow from the cluster nodes, at the MapReduce-level abstraction. Mochi constructs views of cluster upon the execution logs of MapReduce tasks. It then correlates the execution of the task trackers and data nodes in time to determine data read/write operations on HDFS. Mochi basically provides three kinds of visualizations: Swimlanes for task progress in time and space, MIROS plots for data ows in space, and Realized Execution Path for volume-duration correlations. Visualization is very helpful for exposing the running statistics of data-intensive jobs, and we might want to consider using visualization as a friendly user interface in one of our Vaidya extensions if time is allowed. Rumen [4] is a tool for data extraction and analysis based on Hadoop JobHistory logs. Useful MapReduce-related information extracted from JobHistory logs are stored in a digest that can be easily parsed and accessed. The raw trace data from MapReduce logs are often insufficient for simulation, emulation, and benchmarking, as these tools often attempt to measure conditions that did not occur in the source data. Vaidya experiences the same problem but from our inspections it can get rid of the insufficiencies by extracting more statistics, which is one of our work. Rumen does a statistical analysis based on the digest to estimate the variables the trace does not provide. Rumen generates Cumulative Distribution Functions for the MapReduce task runtimes, which can be used for inferring runtime of incomplete and missing tasks. We adopt similar inference procedure to help users better tune their MapReduce jobs. Chukwa [5] is devoted to log collection and analysis in large scale. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) and MapReduce framework, aiming to provide a flexible and powerful platform for distributed data collection and rapid data processing. Chukwa is structured as a pipeline of collection and processing stages, with clean and narrow interfaces between stages. It has three primary components: agents that run on each machine and emit data, collectors that receive data from the agent and write it to stable storage, and MapReduce jobs for parsing and archiving the data. Vaidya does not have the pipelining as not needed, but it would be a good idea to use these tool collectively. For example, use the pipeline for rapid data preprocessing, and then use Vaidya for post analysis. 3
4 GridMix [3] is a benchmark and simulation tool for Hadoop clusters. It imposes synthetic jobs (in form of binary readers) that model a profile mined from production loads onto Hadoop for saturation and stress at scale. A MapReduce job trace describing the job mix for a given cluster, which is usually generated by Rumen, is required to run GridMix. The most recent version GridMix3 takes task distribution, submission interval, input dataset, user diversity and job complexity for benchmarking. Vaidya, the diagnosis tool we are working on, is not the first tool that deals with Hadoop log analysis. Vaidya is designed such that Vaidya and the tools mentioned above are complementary and supportive to each other in the Hadoop framework. It would be good to know how the other tools work, what they have achieved, and what are still pending to be achieved. On one hand, we will delve into some of them to make the extensions on Vaidya better and more meaningful for the tools integration. On the other hand, utilizing some wonderful thoughts of other tools (like statistics processing and anomaly detection) into our extensions would be great directions to go. Great understanding and inspection of above tools will definitely better guide us how we are going to extend Vaidya. 4 Project Plan Since MapReduce jobs typically run on a large scale of distributed systems, it is normal for user to suffer the failure and low performance problems while executing jobs. So we can not expect to create a perfect system once with no error and high performance. Instead, we can do much better by learning from the failures and problems we go through previously. Hadoop Vaidya is a powerful diagnosing tool. Basically, it provides some predefined rules to give user prescriptions in order to improve the MapReduce Jobs. The input data being used are stored in log.txt file and configuration conf.xml file that both are generated automatically by Hadoop after job execution. However, Vaidya is still far from being perfect. So on the other hand, it means there is a lot we can do for further improvement. Currently, only a few test rules have been set up to analyze the job, which is severely insufficient. So the first goal of the project is to design more test rules to support Vaidya to be more comprehensive. Specifically, three functions should be filled up for new test, which are evaluate(), setting up the formulas and rules to calculate the impact level, getprescription(), new advices and suggestions for improving the performance, and getreferencedetails() showing some key statistics during the job execution. Also, new prescriptions will be given depending on the analysis of the data. The data is another part that could be supplemented. Because now the number of counters is not enough to indicate the problem of a specific job. Based on this insufficiency, new features have been done by other researchers by adding new job/task counters, such as number of spills and maximum memory used in Hadoop Performance Monitoring[9]. Along with this idea, we believe we can explore more interesting features. The whole project is supposed to be divided into five steps. In the first week, all the group members will install Hadoop and set up configurations on the virtual machine. The next job is to look up references, read books and look at the open source code to understand how the MapReduce Job and Vaidya work in Java. Next step, we will spend about two weeks focusing on designing new test rules based on JobStatistics and DiagnoseTest. Besides, we will work on composing the prescription for users to improve the performance of the job. Then we will move to the next step to finish the 2/3 milestone report and do further research on the counter part design. 4
5 5 Challenges There is a challenge with our designs of diagnostic tests. Recall that the rule-based tests can fail or pass depending on the job execution and the given parameters (e.g. MaxMapFailureRatio). The parameters play important roles in the diagnosis process in form of function arguments. Unsuccessful settings of the parameters may directly lead into test design failures which make the tests meaningless. The challenge comes from tuning those parameters. Different kinds of jobs may present different characteristics. Even for a same job, when it is executed under different circumstances (e.g. network, cluster, number of mappers and reducers, input data, etc.), the job log could be quite different. But for the tests we write, the one-time-write-multiple-time-uses feature requires us to set the parameters useful for as many jobs as possible, at least for jobs running in a relatively common environment. We will simulate the different running environment mentioned above and carefully determine the parameters accordingly. Another challenge is limited information provided by current Hadoop logs. The range of potential performance efficiencies which can be detected by Vaidya tests depend on how much information Hadoop logs can provide. The information provided in Hadoop logs can be enriched by adding counters to Hadoop MapReduce framework. Counters are responsible for recording specific MapReduce jobs and tasks execution information like number of bytes written by mappers and reducers. We plan to extend Hadoop by adding new counters to record useful information which helps to detect more Hadoop MapReduce performance inefficiencies. 6 Tests In this section, we introduce 6 tests we have added to Vaidya. We include the motivation behind the tests, the algorithm we use to determine if the log passes a test, the prescription we give as feedback to Hadoop user and parameters of tests which are going to be tuned by experiments. 6.1 Desirable Reduces Maps Ratio Motivation : For maximum performance of MapReduce jobs, the number of reducers should be slightly less than the number of reduce slots in the cluster. This allows the reducers to finish in one batch and fully utilizes the cluster during reduce phase. The user may better tune settings by knowing how many maps and reduces have actually been launched and their ratio. Algorithm 1 determine whether reduces maps ratio is desirable 1: mapf ailratio launchedmaps failedmaps 2: redf ailratio launchedreds failedreds 3: if mapf ailratio maxmapf ailratio && redf ailratio maxredf ailratio then return Test is not meaningful! 4: end if 5: ratio maps reds 6: if ratio successt hreshold then return 7: elsereturn Test Failed. 8: end if (1) Try to set less reducers. (2) Try to set more mappers if the cluster is still underloaded. (3) This test may not be meaningful because either the map or the reduce failure ratio is greater than the user-specified thresholds. Before running this test, try to capture other errors or pass other tests first. SuccessThreshold: 0.9 MaxMapFailureRatio: 0.20 MaxReduceFailureRatio: Balanced Shuffling From Map To Reduce Motivation : When a MapReduce job finishes its map process, the Hadoop framework executes a shuffling algorithm to distribute and move the 5
6 output of mappers to the reducers over the network. If shuffling is unbalanced, due to unbalanced partitioning or problematic network latency, there will be a huge waste of time for some reducers to begin their jobs and thus greatly slow down the entire job completion. Therefore, we want to detect unbalanced shuffling and advise users to inspect the potential problems, mostly from problematic network latency. Algorithm 2 determine whether the shuffling time for the reducers is balanced 1: if job is MAP ONLY then return Test Passed 2: end if 3: reds redt askslist() 4: //shuffle time for each task is computed by shuffle finished time minus task started time 5: sdshf lt ime stddevshf lt ime(reds) tdshflt ime maxt ime 6: ratio 7: if ratio successt hreshold then return 8: elsereturn Test Failed. 9: end if (1) Use appropriate shuffling function (2) First, check whether the reduce partition is balanced (3) If balanced, check if there s unbalanced latency from retrieving data from map output to reduces (4) Investigate unnecessary network latency for data transmission SuccessThreshold: 0.2 MaxShuffleTimeStandardDeviation: Reduce Side Disk Spill Motivation : When map tasks complete, the map output are copied to the reduce tasktracker s memory. If tasktracker s memory buffer reaches a threshold size, or reaches a threshold number of map outputs, map outputs are merged and spilled to disk. Reduce side disk spill causes unnecessary storage waste and slows down the reduce phase. Algorithm 3 calculating reduce side disk spill amount 1: totallocalbytesw rittenbyred RedT ask LocalBytesW rittenbyredt ask 2: RedSideDiskSpill = JobM apoutputbytes totallocalbytesw rittenbyred 3: ratio RedSideDiskSpill JobMapOutputBytes 4: if ratio successt hreshold then return 5: elsereturn Test Failed. 6: end if (1) Increase the proportion of total heap size to be allocated to the map outputs during the copy phase of the shuffle (2) Increase the threshold usage proportion of the map outputs buffer for starting the process of merging the outputs and spilling to disk (3) Increase the threshold number of map outputs for starting the process of merging the outputs and spilling to disk SuccessThreshold: Intermediate Data Process Time Motivation : When map tasks finish, the map output stay on the local disk of the tasktracker which ran the map tasks. The reduce task need to copy map output for particular partition from map tasks across the cluster, which is called copy phase of reduce task. After the map output are copied, the reduce tasks start to merge the map output and maintain the sorted order, which is called merge phase. The merging is done in rounds. The time spent on manipulating intermediate data can vary by a good amount for different mapreduce configurations. (1) Increase the number of threads used to copy 6
7 Algorithm 4 calculating intermediate data process time 1: lastm apt askf inisht ime max MapT asksf inisht ime 2: f irstredt askstartt ime min RedT asksstartt ime 3: intermdatap rot ime = f irstredt askstartt ime lastm apt askf inisht ime intermdatap rot ime 4: ratio JobExecutionT ime 5: if ratio successt hreshold then return 6: elsereturn Test Failed. 7: end if map outputs to the reducer (2) Increase the maximum number of streams to merge at once then sorting files SuccessThreshold: Map Task Failure Solution Motivation : In the real scenario, there will be different kinds of failures in the map side. The reason may be bugs in the users code, hardware problem and corrupted records in the file system. It is also possible that the bug is in a third-party library that cannot be easily handled. In addition, sometimes users do not want to abort the whole job if a few tasks failed. So this test aims to help user detect the specific failure and find corresponding solutions in an efficient way without significantly affecting the final result and performance. (1) If the total number of failed tasks is small, you can set the maximum percentage of tasks that are allowed to fail without triggering job failure by using the mapred.max.map.failures.percent property. (2) If the number of failed tasks is still bearable, enable SkipBadRecords automatically for map tasks and increase the maximum number of task Algorithm 5 provide solution for the map task failure 1: badrecmp 0 2: totalm aps lanuchedm ap() 3: totalf Maps failedmaps() 4: badrecm p badrecineachm ap() 5: badrecf Ratio badrecmp totalmaps totalf Maps totalmaps 6: totalf Ratio 7: ratio 1 totalf Ratio 8: if ratio successt hreshold then return 9: elsereturn Test Failed. 10: end if attempts, via mapred.map.max.attempts to give skipping mode enough attempts to detect and skip all the bad records. (3) Too much task failures occur. Check the code as well as the hardware. SuccessThreshold: 0.8 smapfratio: 0.10 mmapfratio: Combiner Efficiency Test Motivation : Combiner is a function that performs local aggregation of the map outputs before shuffling them as inputs to the reducer. Because it reduces the number of data transferred to the reducer, it is able to improve the performance of the map-reduce job significantly. However, calling combiner will bring additional overhead on serialization and deserialization of intermediate data. So this test will notify the user if it is appropriate to call the combiner in the intermediate part. (1) If the efficiency of combiner in the map-side is lower than 0.2 please do not use combiner in that side. Because there is a time cost on the serialization and deserialization of map-output records. 7
8 Algorithm 6 determine the efficiency for the combiner 1: cmbiptrec cmbiptrecinm aps() 2: cmboptrec cmboptrecinm aps() 3: mapcmbeff 1 cmbiptrec cmboptrec 4: if mapcmbef f Ef f T hreshhold then 5: impact 1.0 6: end if 7: if impact successt hreshhold then return 8: elsereturn Test Failed. 9: end if [9] Impetus.opensource. [10] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Mochi: visual log-analysis based tools for debugging hadoop. In Proceedings of the 2009 conference on Hot topics in cloud computing, HotCloud 09, Berkeley, CA, USA, USENIX Association. SuccessThreshold: 0.5 EffThreshold 0.2 References [1] Apache. [2] Apache. docs/current/vaidya.html. [3] Apache. docs/current/gridmix.html. [4] Apache. docs/r0.22.0/rumen.html. [5] Apache. [6] D. Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, [7] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, pages 10 10, Berkeley, CA, USA, USENIX Association. [8] S. Ghemawat, H. Gobioff, and S. Leung. The google file system. In ACM SIGOPS Operating Systems Review, volume 37, pages ACM,
Rumen. Table of contents
Table of contents 1 Overview... 2 1.1 Motivation...2 1.2 Components...2 2 How to use Rumen?...3 2.1 Trace Builder... 3 2.2 Folder... 5 3 Appendix... 8 3.1 Resources... 8 3.2 Dependencies... 8 1 Overview
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Generic Log Analyzer Using Hadoop Mapreduce Framework
Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,
ImprovedApproachestoHandleBigdatathroughHadoop
Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 14 Issue 9 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
MapReduce: Algorithm Design Patterns
Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Hadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy
MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears Neeraj Ganapathy Outline Hadoop Architecture Pipelined MapReduce Online Aggregation Continuous
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India
Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:
MapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
An improved task assignment scheme for Hadoop running in the clouds
Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
University of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
How MapReduce Works 資碩一 戴睿宸
How MapReduce Works MapReduce Entities four independent entities: The client The jobtracker The tasktrackers The distributed filesystem Steps 1. Asks the jobtracker for a new job ID 2. Checks the output
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
