Abstract. 1 Introduction. 2 Problem Statement. 1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
|
|
|
- Brook Flynn
- 10 years ago
- Views:
Transcription
1 1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department Carnegie Mellon University, Pittsburgh, PA Abstract Mochi, a new visual, log-analysis based debugging tool correlates Hadoop s behavior in space, time and volume, and extracts a causal, unified control- and dataflow model of Hadoop across the nodes of a cluster. Mochi s analysis produces visualizations of Hadoop s behavior using which users can reason about and debug performance issues. We provide examples of Mochi s value in revealing a Hadoop job s structure, in optimizing real-world workloads, and in identifying anomalous Hadoop behavior, on the Yahoo! M45 Hadoop cluster. 1 Introduction (MR) [7] is a programming paradigm and framework introduced by Google for data-intensive cloud computing on commodity clusters. Hadoop [9], an open-source Java implementation of, is used by Yahoo! and Facebook, and is available on Amazon s pay-as-you-use EC2 cloud computing infrastructure. Debugging the performance of Hadoop programs is difficult because of their scale and distributed nature. Hadoop can be debugged by examining the local (node-specific) logs of its execution. These logs can be large, and must be manually stitched across nodes to debug system-wide problems. Current Java debugging/profiling tools (jstack, hprof) target programming abstractions to help debug local code-level errors rather than distributed problems across multiple nodes [16]. In addition, these tools do not provide insights at the higher level of abstraction (e.g. s and s) that is more natural to MR users and programmers. Similarly, path-tracing tools [8] for distributed systems produce fine-grained views at the language rather than at the MR abstraction. Our survey of the Hadoop users mailing-list indicates that the most frequent performance-related questions are indeed at the level of MR abstractions. We examined the 3400 posts on this mailing list over a 6- month period (10/2008 to 4/2009), and classified the 30-odd explicit performance-related posts 2 (some posts 1 This research was partially funded by the Defence Science & Technology Agency, Singapore, via the DSTA Overseas Scholarship, and sponsored in part by the National Science Foundation, via CA- REER grant CCR and grant CNS As expected of mailing-lists, most of the 3400 posts were from users learning about and configuring Hadoop (note that misconfigurations can also lead to performance problems). We filtered out, and had multiple categories) in Table 1. These posts focused on MR-specific aspects of Hadoop program behavior. The primary response to these posts involved suggestions to use Java profilers, which do not capture dynamic MR-specific behavior, such as relationships in time (e.g., orders of execution), space (which tasks ran on which nodes), and the volumes of data in various program stages. This motivated us to extract and analyze time-, space- and volume-related Hadoop behavior. The MR framework affects program performance at the macro-scale through task scheduling and data distribution. This macro behavior is hard to infer from lowlevel language views because of the glut of detail, and because this behavior results from the framework outside of user code. For effective debugging, tools must expose MR-specific abstractions. This motivated us to capture Hadoop distributed data- and execution-related behavior that impacts MR performance. Finally, given the scale (number of nodes, tasks, interactions, durations) of Hadoop programs, there is also a need to visualize a program s distributed execution to support debugging and to make it easier for users to detect deviations from expected program behavior/performance. To the best of our knowledge, Mochi is the first debugging tool for Hadoop to extract (from Hadoop s own logs) both control- and data-flow views, and to then analyze and visualize these views in a distributed, causal manner. We provide concrete examples where Mochi has assisted us and other Hadoop users in understanding Hadoop s behavior and unearthing problems. 2 Problem Statement Our previously developed log-analysis tool, SALSA [18], extracted various statistics (e.g., durations of and tasks) of system behavior from Hadoop s logs on individual nodes. Mochi aims to go beyond SALSA, to (i) correlate Hadoop s behavior in space, time and volume, and (ii) extract causal, end-to-end, distributed Hadoop behavior that factors in both computation and data across the nodes of the cluster. From our interactions with real Hadoop users (of the Yahoo! M45 [11] cluster), a third need has emerged: to provide helpful visualizations of Hadoop s behavior so that users can reason about and debug performance issues themselves. Goals. Mochi s goals are: focused on, the 30-odd posts that explicitly raised a performance issue. 1
2 Category Question Fraction Configuration How many s/s are efficient? Did I set a wrong number of s? 50% behavior My s have lots of output, are they beating up nodes in the shuffle? 30% Runtime behavior Must all mappers complete before reducers can run? What is the performance impact of setting X? 50% What are the execution times of program parts? Table 1: Common queries on users mailing list To expose -specific behavior that results from the MR framework s automatic execution, that affects program performance but is neither visible nor exposed to user / code, e.g. when s/s are executed and on which nodes, from/to where data inputs/outputs flow and from which s/s. Existing Java profilers do not capture such information. To expose aggregate and dynamic behavior that can provide different insights. For instance, Hadoop system views in time can be instantaneous or aggregated across an entire job; views in space can be of individual s/s or aggregated at nodes. To be transparent so that Mochi does not require any modifications to Hadoop, or to the way that Hadoop users write/compile/load their programs today. This also makes Mochi amenable to deployment in production Hadoop clusters, as is our objective. Non-goals. Our focus is on exposing MR-specific aspects of programs rather than behavior within each or. Thus, the execution specifics or correctness of code within a / is outside our scope. Also, Mochi does not discover the root-cause of performance problems, but aids in the process through useful visualizations and analysis that Hadoop users can exploit. 3 Mochi s Approach programs, or jobs, consist of tasks followed by tasks; multiple identical but distinct instances of tasks operate on distinct data segments in parallel across nodes in a cluster. The framework has a single master node (running the NameNode and JobTracker daemons) that schedules s and s on multiple slave nodes. The framework also manages the inputs and outputs of s, s, and Shuffles (moving of outputs to s). Hadoop provides a distributed filesystem (HDFS) that implements the Google Filesystem [10]. Each slave node runs a TaskTracker (execution) and a Node (HDFS) daemon. Hadoop programs read and write data from HDFS. Each Hadoop node generates logs that record the local execution of tasks and HDFS data accesses. 3.1 Mochi s Log Analysis Mochi constructs cluster-wide views of the execution of MR programs from Hadoop-generated system logs. Mochi builds on our log-analysis capabilities to extract local (node-centric) Hadoop execution views [18]. Detailed Swimlanes: Sort Workload (4 nodes) Task CopyReceive MergeCopy Task Figure 2: Swimlanes: detailed states: Sort workload Mochi then correlates these views across nodes, and also between HDFS and the execution layer, to construct a unique end-to-end representation that we call a Job-Centric -flow (JCDF), which is a distributed, causal, conjoined control- and data-flow. Mochi parses Hadoop s logs 3 based on SALSA s [18] state-machine abstraction of Hadoop s execution. In its log analysis, Mochi extracts (i) a time-stamped, crossnode, control-flow model by seeking string-tokens that identify TaskTracker-log messages signaling the starts and ends of activities (e.g.,, ), and (ii) a time-stamped, cross-node, data-flow model by seeking string-tokens that identify Node-log messages signaling the movement/access of data blocks, and by correlating these accesses with s and s running at the same time. Mochi assumes that clocks are synchronized across nodes using NTP, as is common in production Hadoop clusters. Mochi then correlates the execution of the TaskTrackers and Nodes in time (e.g. co-occurring s and block reads in HDFS) to identify when data was read from or written to HDFS. This completes the causal path of the data being read from HDFS, processed in the Hadoop framework, and written to HDFS, creating a JCDF, which is a directed graph with vertices representing processing stages and data items, and edges annotated with durations and volumes (Figure 1). Finally we extract all Realized Execution Paths (REPs) from the JCDF graph unique paths from a parent node to a leaf node using a depth-first search. Each REP is a distinct end-to-end, causal flow in the system. 3 Mochi uses SALSA to parse TaskTracker and Node logs. We can extend this to parse NameNode and JobTracker logs as well. 2
3 Read Size/ Read Time out Size/ Time Shuffle Size/ Shuffle Time Input Size/ Wait Time for Shuffles Copy- Receive Input Size/ Time Write Size/ Write Time Figure 1: Single instance of a Realized Execution Path (REP) showing vertices and edge annotations Input Output Dest Host (Towards rs) Input Output inputs, outputs node4 node3 node1 node2 Shuffles node4 node3 node1 node2 Src Host (From pers) inputs, outputs node1 node2 node4 node3 node3 node4 node2 node1 1.1e e e e e e e e e e e e e e e e e e e+09 Clusters Sort (4 nodes) MB MB MB MB paths paths 5102 paths paths paths Duration/seconds ReadBlockVol Vol ShuffleVol InputVol OutputVol WriteBlockVol ReadTime Time ShuffleTime CopyReceiveTime Time WriteTime Figure 4: REP plot for Sort workload Figure 3: MIROS: Sort workload; (volumes in bytes) Thus, Mochi automatically generates, and then correlates, the cross-node data- and control-flow models of Hadoop s behavior, resulting in a unified, causal, clusterwide execution+data-flow model. 3.2 Mochi s Visualization Mochi s distributed data- and control-flows capture MR programs in three dimensions: space (nodes), time (durations, times, sequences of execution), and volume (of data processed). We use Mochi s analysis to drive visualizations that combine these dimensions at various aggregation levels. In this section, we describe the form of these visualizations, without discussing actual experimental data or drawing any conclusions from the visualizations (although the visualizations are based on real experimental data). We describe the actual workloads and case studies in 4. Swimlanes : Task progress in time and space. In such a visualization, the x-axis represents wall-clock time, and each horizontal line corresponds to an execution state (e.g.,, ) running in the marked time interval. Figure 2 shows a sample detailed view with all states across all nodes. Figure 6 shows a sample summarized view (s and s only) collapsed across nodes, while Figure 7 shows summarized views with tasks grouped by nodes. Figure 5 shows Swimlanes for a 49-slave node cluster. Swimlanes are useful in capturing dynamic Hadoop execution, showing where the job and nodes spend their time. MIROS plots: -flows in space. MIROS ( Inputs, Outputs, Shuffles, Figure 3) visualizations show data volumes into all s and out of all Re- duces on each node, and between s and s on nodes. These volumes are aggregated over the program s run and over nodes. MIROS is useful in highlighting skewed data flows that can result in bottlenecks. REP: Volume-duration correlations. For each REP flow, we show the time taken for a causal flow, and the volume of inputs and outputs, along that flow (Figure 4). Each REP is broken down into time spent and volume processed in each state. We use the k-means clustering algorithm to group similar paths for scalable visualization. For each group, the top bar shows volumes, and the bottom bar durations. This visualization is useful in (i) checking that states that process larger volumes should take longer, and (ii) in tracing problems back to any previous stage or data. 4 Examples of Mochi s Value We demonstrate the use of Mochi s visualizations (using mainly Swimlanes due to space constraints). All data is derived from log traces from the Yahoo! M45 [11] production cluster. The examples in 4.1, 4.2 involve 4-slave and 49-slave clusters, and the example in 4.3 is from a 24-slave cluster. 4.1 Understanding Hadoop Job Structure Figure 6 shows the Swimlanes plots from the Sort and RandomWriter benchmark workloads (part of the Hadoop distribution), respectively. RandomWriter writes random key/value pairs to HDFS and has only s, while Sort reads key/value pairs in s, and aggregates, sorts, and outputs them in s. From these visualizations, we see that RandomWriter has only s, while the s in Sort take significantly longer than the s, showing most of the work occurs 3
4 Task durations (Matrix-Vector Multiplication): Per-node Task durations (Matrix-Vector Multiplication: 49 hosts): All nodes Figure 5: Swimlanes plot for 49-node job for the Matrix- Vector Multiplication; top plot: tasks sorted by node; bottom plot: tasks sorted by time. in the s. The REP plot in Figure 4 shows that a significant fraction ( 3 2 ) of the time along the critical paths (Cluster 5) is spent waiting for outputs to be shuffled to the s, suggesting this is a bottleneck. 4.2 Finding Opportunities for Optimization Figure 7 shows the Swimlanes from the Matrix-Vector Multiplication job of the HADI [12] graph-mining application for Hadoop. This workload contains two MR programs, as seen from the two batches of s and s. Before optimization, the second node and first node do not run any in the first and second jobs respectively. The number of s was then increased to twice the number of slave nodes, after which every node ran two s (the maximum concurrent permitted), and the job completed 13.5% faster. 4.3 Debugging: Delayed Java Socket Creation We ran a no-op ( Sleep ) Hadoop job, with 2400 idle s and s which sleep for 100ms, to characterize idle Hadoop behavior, and found tasks with unusually long durations. On inspection of the Swimlanes, we found delayed tasks ran for 3 minutes (Figure 8). We traced this problem to a delayed socket call in Hadoop, and found a fix described at [1]. We resolved this issue by forcing Java to use IPv4 through a JVM option, and Sleep ran in 270, instead of 520, seconds. 5 Related Work Distributed tracing and failure diagnosis. Recent tools for tracing distributed program execution have focused on building instrumentation to trace causal paths [3], infer causality across components [15] and networks [8]. They produce fine-grained views at the language but not MR level of abstraction. Our work correlates sys RandomWriter Workload (4 nodes) Sort Workload (4 nodes) Figure 6: Summarized Swimlanes plot for RandomWriter (top) and Sort (bottom) tem views from existing instrumentation (Hadoop system logs) to build views at a higher level of abstraction for MR. Other uses of distributed execution tracing [2, 5, 13] for diagnosis operate at the language level, rather than at a higher-level of abstraction (e.g. MR) since they are designed for general systems. Diagnosis for MR. [14] collected trace events in Hadoop s execution generated by custom instrumentation these are akin to language-level views; their abstractions do not account for the volume dimension which we provide ( 3.2), and they do not correlate data with the MR level of abstraction. [19] only showed how outlier events can be identified in Node logs; we utilize information from the TaskTracker logs as well, and we build a complete abstraction of all execution events. [17] diagnosed anomalous nodes in MR clusters by identifying nodes with OS-level performance counters that deviated from other nodes. [18] demonstrated how to extract state-machine views of Haodop s execution from its logs; we have expanded on the per-node state-machine views in [18] by building the JCDF and extracting REPs ( 3.1) to generate novel system views. Visualization tools. Artemis [6] provides a pluggable framework for distributed log collection, data analysis, and visualization. We have presented specific MR abstractions and ways to build them, and our techniques can be implemented as Artemis plugins. The machine usage data plots in [6] resemble Swimlanes; REP shows both data and computational dependencies, while the critical path analysis in [6] considers only computation. [4], visualized web server access patterns and the output of anomaly detection algorithms, while we showed 4
5 Matrix-Vector Multiplication Workload, (4 nodes) Sleep Workload with socket error (24 nodes) Matrix-Vector Multiplication Workload, (4 nodes) Sleep Workload without socket error (24 nodes) Figure 7: Matrix-vector Multiplication before optimization (above), and after optimization (below) 250 Figure 8: SleepJob with delayed socket creation (above), and without (below) system execution patterns. 6 Conclusion and Future Work Mochi extracts and visualizes information about MR programs at the MR-level abstraction, based on Hadoop s system logs. We show how Mochi s analysis produces a distributed, causal, control+data-flow model of Hadoop s behavior, and then show the use of the resulting visualizations for understanding and debugging the performance of Hadoop jobs in the Yahoo! M45 production cluster. We intend to implement our (currently) offline Mochi analysis and visualization to run online, to evaluate the resulting performance overheads and benefits. We also intend to support the regression testing of Hadoop programs against new Hadoop versions, and debugging of more problems, e.g. misconfigurations. Acknowledgements The authors would like to thank Christos Faloutsos and U Kang for discussions on the HADI Hadoop workload and for providing log data. References [1] Creating socket in java takes 3 minutes, [2] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance Debugging for Distributed System of Black Boxes. In SOSP, Oct [3] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In OSDI, Dec [4] P. Bodik, G. Friedman, L. Biewald, H. Levine, G. Candea, K. Patel, G. Tolle, J. Hui, A. Fox, M. Jordan, and D. Patterson. Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization In ICAC, Jun [5] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In DSN, Jun [6] G. Cretu-Ciocarlie, M. Budiu, and M. Goldszmidt. Hunting for problems with artemis. In Workshop on Analysis of System Logs, Dec [7] J. Dean and S. Ghemawat. : Simplified Processing on Large Clusters. In OSDI, Dec [8] R. Fonseca, G. Porter, R. Katz, S. Shenker, and I. Stoica. X- Trace: A Pervasive Network Tracing Framework. In NSDI, Apr [9] The Apache Software Foundation. Hadoop, [10] S. Ghemawat, H. Gobioff, and S. Leung. The Google Filesystem. In SOSP, Oct [11] Yahoo! Inc. Yahoo! reaches for the stars with M45 supercomputing project, [12] U. Kang, C. Tsourakakis, A.P. Appel, C. Faloutsos, and J. Leskovec. HADi: Fast Diameter Estimation and Mining in Massive Graphs with Hadoop. CMU ML Tech Report CMU-ML , [13] E. Kiciman and A. Fox. Detecting Application-level Failures in Component-based Internet Services. IEEE Trans. on Neural Networks, 16(5): , Sep [14] A. Konwinski, M. Zaharia, R. Katz, and I. Stoica. X-tracing Hadoop. Hadoop Summit, Mar [15] Eric Koskinen and John Jannotti. Borderpatrol: Isolating Events for Black-box Tracing. In Eurosys 2008, Apr [16] Arun Murthy. Hadoop - Tuning and Debugging, [17] X. Pan, J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Ganesha: Black-Box Diagnosis of Systems. In HotMetrics, Seattle, WA, Jun [18] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. SALSA: Analyzing Logs as StAte Machines. In Workshop on Analysis of System Logs, Dec [19] W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan. Mining Console Logs for Large-scale System Problem Detection. In SysML, Dec
Ganesha: Black-Box Diagnosis of MapReduce Systems
Ganesha: Black-Box Diagnosis of MapReduce Systems Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department, Carnegie Mellon University Pittsburgh,
Hadoop Performance Monitoring Tools
Hadoop Performance Monitoring Tools Kai Ren, Lianghong Xu, Zongwei Zhou Abstract Diagnosing performance problems in large-scale data-intensive programs is inherently difficult due to its scale and distributed
SALSA: Analyzing Logs as StAte Machines 1
SALSA Analyzing s as StAte Machines 1 Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi and Priya Narasimhan Electrical & Computer Engineering Department, Carnegie Mellon University {jiaqit, xinghaop,
Hadoop Performance Diagnosis By Post-execution Log Analysis
Hadoop Performance Diagnosis By Post-execution Log Analysis cs598 Project Proposal Zhijin Li, Chris Cai, Haitao Liu ([email protected],[email protected],[email protected]) University of Illinois
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Diagnosing Heterogeneous Hadoop Clusters
Diagnosing Heterogeneous Hadoop Clusters Shekhar Gupta, Christian Fritz, Johan de Kleer, and Cees Witteveen 2 Palo Alto Research Center, Palo Alto, CA 930, USA {ShekharGupta, cfritz, dekleer}@parccom 2
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
MR-Scope: A Real-Time Tracing Tool for MapReduce
MR-Scope: A Real-Time Tracing Tool for MapReduce Dachuan Huang, Xuanhua Shi, Shadi Ibrahim, Lu Lu, Hongzhang Liu, Song Wu, Hai Jin Cluster and Grid Computing Lab Services Computing Technique and System
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Theius: A Streaming Visualization Suite for Hadoop Clusters
1 Theius: A Streaming Visualization Suite for Hadoop Clusters Jon Tedesco, Roman Dudko, Abhishek Sharma, Reza Farivar, Roy Campbell {tedesco1, dudko1, sharma17, farivar2, rhc} @ illinois.edu University
Big Data Analysis and Its Scheduling Policy Hadoop
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India
Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
Towards a Resource Aware Scheduler in Hadoop
Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin Garegrat, Shiwali Mohan Computer Science and Engineering, University of Michigan, Ann Arbor December 21, 2009 Abstract Hadoop-MapReduce is
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
MapReduce: Algorithm Design Patterns
Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),
PerfCompass: Toward Runtime Performance Anomaly Fault Localization for Infrastructure-as-a-Service Clouds
PerfCompass: Toward Runtime Performance Anomaly Fault Localization for Infrastructure-as-a-Service Clouds Daniel J. Dean, Hiep Nguyen, Peipei Wang, Xiaohui Gu {djdean2,hcnguye3,pwang7}@ncsu.edu,[email protected]
A Hadoop MapReduce Performance Prediction Method
A Hadoop MapReduce Performance Prediction Method Ge Song, Zide Meng, Fabrice Huet, Frederic Magoules, Lei Yu and Xuelian Lin University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France Ecole Centrale
Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen
Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen Anil G, 1* Aditya K Naik, 1 B C Puneet, 1 Gaurav V, 1 Supreeth S 1 Abstract: Log files which
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
ASDF: Automated, Online Fingerpointing for Hadoop (CMU-PDL-08-104)
Research Centers and Institutes Parallel Data Laboratory Carnegie Mellon University Year 2008 ASDF: Automated, Online Fingerpointing for Hadoop (CMU-PDL-08-104) Keith Bare Michael P. Kasick Soila Kavulya
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Copyright Shraddha Pradip Sen 2011 ALL RIGHTS RESERVED
LOG ANALYSIS TECHNIQUE: PICVIZ by SHRADDHA PRADIP SEN DR. YANG XIAO, COMMITTEE CHAIR DR. XIAOYAN HONG DR. SUSAN VRBSKY DR. SHUHUI LI A THESIS Submitted in partial fulfillment of the requirements for the
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)
UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Graph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Research Article Hadoop-Based Distributed Sensor Node Management System
Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
Evaluating Task Scheduling in Hadoop-based Cloud Systems
2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy
