Ganesha: Black-Box Diagnosis of MapReduce Systems
|
|
|
- Melinda Wade
- 10 years ago
- Views:
Transcription
1 Ganesha: Black-Box Diagnosis of MapReduce Systems Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department, Carnegie Mellon University Pittsburgh, PA {spertet, rgandhi, ABSTRACT Ganesha aims to diagnose faults transparently (in a blackbox manner) in MapReduce systems, by analyzing OS-level metrics. Ganesha s approach is based on peer-symmetry under fault-free conditions, and can diagnose faults that manifest asymmetrically at nodes within a MapReduce system. We evaluate Ganesha by diagnosing Hadoop problems for the Gridmix Hadoop benchmark on 10-node and 50-node MapReduce clusters on Amazon s EC2. We also candidly highlight faults that escape Ganesha s diagnosis. 1. INTRODUCTION Performance problems in distributed systems can be hard to diagnose and to localize to a specific node or a set of nodes. There are many challenges in problem localization (i.e., tracing the problem back to the culprit node) and root-cause analysis (i.e., tracing the problem further to the underlying code-level fault or bug, e.g., memory leak, deadlock). As we show, performance problems can originate at one node in the system and then start to manifest at other nodes as well, due to the inherent communication across components this can make it hard to discover the original culprit node. A black-box diagnostic approach aims to discover the culprit node by analyzing performance data from the OS or network, without having to instrument the application or to understand its semantics. The most interesting problems to diagnose are not necessarily the outright crash (fail-stop) failures, but rather those that result in a limping-but-alive system, i.e., the system continues to operate, but with degraded performance. We describe Ganesha, our black-box diagnostic approach that we apply to diagnose such performance problems in Hadoop [3], the open-source implementation of MapReduce [1]. Ganesha is based on our hypothesis (borne out by observation) that fault-free nodes in MapReduce behave similarly. Ganesha looks for asymmetric behavior across nodes to perform its diagnosis. Inevitably, this black-box approach will not have coverage faults that do not result in a consistent asymmetry across nodes will escape Ganesha s diagnosis. Black-box diagnosis is not new. Other black-box diagnostic techniques [14, 12, 13] determine the root-cause of a problem, given the knowledge that a problem exists in the system (the techniques differ in how they know that a problem exists). In a MapReduce system with its potentially long-running jobs, the system might not provide us with quick indications of a job experiencing a problem. Thus, in contrast with other techniques, Ganesha attempts to determine, for itself, whether a problem exists and, if so, traces the problem to the culprit node(s). In this paper, we describe Ganesha s black-box diagnostic approach, based on our (experimentally substantiated) hypotheses of a MapReduce system s behavior. We evaluate our diagnosis approach on the well-accepted, multiworkload Gridmix Hadoop benchmark on a 10-node and a 50-node MapReduce cluster on Amazon s EC2. Our evaluation is performed by injecting and studying real-world problems that have been reported either in the Hadoop issue tracker or the Hadoop users mailing list. We demonstrate the black-box diagnosis of faults that manifest asymmetrically at peer Hadoop slave nodes in the system. We discuss equally our experiences with faults that escape Ganesha s diagnosis, and suggest how Ganesha can be synthesized with white-box metrics extracted by SALSA, our previously developed Hadoop log-analysis tools [9]. 2. TARGET SYSTEM: MAPREDUCE Hadoop [3] is an open-source implementation of Google s MapReduce [1] framework that enables distributed, dataintensive, parallel applications by decomposing a massive job into smaller tasks and a massive data-set into smaller partitions, such that each task processes a different partition in parallel. Hadoop uses the Hadoop Distributed File System (HDFS), an implementation of the Google Filesystem [2], to share data amongst the distributed tasks in the system. The Hadoop framework has a single master node running the NameNode (which provides the HDFS namespace) and JobTracker (which schedules Maps and Reduces on multiple slave nodes) daemon. Each slave node runs a TaskTracker (execution) and a DataNode (HDFS) daemon. We evaluate Ganesha s problem-diagnosis approach on the Gridmix Hadoop benchmark on a 10-node and a 50-node MapReduce cluster on EC2. Gridmix is an increasingly well-accepted Hadoop benchmark that is used to validate performance across different Hadoop upgrades, for instance. Gridmix models a cluster workload by generating random data and submitting MapReduce jobs that mimic the observed data-access patterns in user jobs. Gridmix comprises 5 different jobs, ranging from an interactive workload to a large sort of compressed data. We scaled down the size of the dataset to 2MB of compressed data for the 10-node cluster and 200MB for the 50-node cluster so that the benchmark would complete in 30 minutes. For lack of space, we omit our results with other Hadoop benchmarks (that are simpler than Gridmix), such as RandWriter, Sort, Nutch and Pig. 3. PROBLEM STATEMENT & APPROACH
2 MASTER JobTracker NameNode Maps TaskTracker Reduces DataNode sadc vectors SLAVES... Maps TaskTracker Reduces DataNode Operating System Operating System Operating System Data Blocks Data Blocks sadc vectors Figure 1: Architecture of Hadoop, showing our instrumentation points. user % CPU time in user-space system % CPU time in kernel-space iowait % CPU time waiting for I/O job ctxt Context switches per second runq-sz Number of processes waiting to run plist-sz Total number of processes and threads ldavg-1 system load average for the last minute eth-rxbyt Network bytes received per second eth-txbyt Network bytes transmitted per second pgpgin KBytes paged in from disk per second pgpgout KBytes paged out to disk per second fault Page faults (major+minor) per second bread Total bytes read from disk per second bwrtn Total bytes written to disk per second Table 1: Gathered black-box metrics (sadc-vector). We seek to understand whether Ganesha can localize performance problems accurately and non-invasively, and what are the limitations of black-box diagnosis for Hadoop. Hypotheses. We hypothesize that (1) Hadoop slave nodes exhibit a small number of distinct behaviors, from the perspective of black-box metrics; in a short interval (e.g. 1s) of time, the system s performance tends to be dominated by one of these behaviors; and (2) under fault-free operation, Hadoop slave nodes will exhibit similar behavior (peersymmetry) over moderately long durations. We exploit these hypotheses for Ganesha s fault diagnosis. We make no claims about peer-symmetry or lack thereof under faulty conditions. (That is, we claim that fault-free operation results in peer-symmetry, but that faulty conditions may or may not also result in peer-symmety). Both our hypotheses are grounded in our experimental observations of Hadoop s behavior on Sort, RandWriter, Pig, Nutch and Gridmix benchmarks for Hadoop clusters of 10 and 50 nodes on EC2. Goals. Ganesha should run transparently to, and not require any modifications of, both Hadoop and its applications. Ganesha should be usable in production environments, where administrators might not have the luxury of instrumenting applications but could instead leverage other (black-box) data. Ganesha should produce low false-positive rates, in the face of a variety of workloads for the system under diagnosis, and more importantly, even if these workloads fluctuate 1, as with Gridmix. Ganesha s data-collection should impose minimal instrumentation overheads on the system under diagnosis. Non-Goals. Ganesha currently aims for (coarse-grained) problem diagnosis by identifying the culprit slave node(s). Clearly, this differs from (fine-grained) root-cause analysis, which would aim to identify the underlying fault or bug, possibly even down to the offending line of code. While Ganesha can be supported online, this paper is intentionally focused on Ganesha s offline analysis for problem diagnosis. We also do not target faults on the master node. Assumptions. We assume that Hadoop and its applications are the dominant source of activity on every node. We assume that a majority of the Hadoop nodes are problem- 1 Workload fluctuations can often be mistaken for anomalous behavior, if the system s behavior is characterized in terms of OS metrics alone. Ganesha, however, can discriminate between the two because fault-free peer nodes track each other under workload fluctuations. free and that all nodes are homogeneous in hardware. 4. DIAGNOSTIC APPROACH For our problem diagnosis, we gather and analyze black-box (i.e., OS-level) performance metrics, without requiring any modifications to Hadoop, its applications or the OS. For black-box data collection, we use sysstat s sadc program [5] to periodically gather a number of metrics (14, to be exact, as listed in Table 1) from /proc, every second. We use the term sadc-vector to denote a vector of values of these metrics, all sampled at the same instant of time. We collect the time-series of sadc-vectors from each slave node and then perform our analyses to determine whether there is a performance problem in the system, and then to trace the problem back to the culprit Hadoop slave node. 4.1 Approach MapReduce nodes exhibit a small number of distinct behaviors, from the perspective of black-box metrics. In a short interval (e.g. 1s) of time, the system s performance tends to be dominated by one of these behaviors. Hadoop s performance, over a short interval of time, can thus be classified into K distinct profiles corresponding to these distinct behaviors. Effectively, these profiles are a way to classify the observed sadc-vectors into K clusters (or centroids). While profiles do not represent semantically meaningful information, they are motivated by our observation that, over a short interval of time, each Hadoop slave node performs specific resource-related activities, e.g., computations (CPUintensive), transfering data (network-intensive), disk access (I/O-intensive). The profiles, thus, represent a way to capture Hadoop s different behaviors, as manifested simultaneously on all of the 14 metrics. We use n to denote the number of slave nodes. There are two phases to our approach training and deployment as shown in Figure 2. In the training phase, we learn the profiles of Hadoop by analyzing sadc-vector samples from slave nodes, gathered over multiple jobs in the fault-free case. In the deployment phase, we determine whether there is a problem for a given job, and if so, which slave node is the culprit. We validate Ganesha s approach by injecting various faults at one of the Hadoop slave nodes, and then determining whether we can indeed diagnose the culprit node correctly. The results of our validation are described in Section 5.1. Training. We apply machine-learning techniques to learn
3 [Source] Reported Failure [Hadoop users mailing list, Sep ] CPU bottleneck resulted from running master and slave daemons on same machine [Hadoop users mailing list, Sep ] Excessive messages logged to file during startup [HADOOP-2956] Degrade network connectivity between DataNodes results in long block transfer times [HADOOP-1036] Hang at TaskTracker due to an unhandled exception from a task terminating unexpectedly. The offending TaskTracker sends heartbeats although the task has terminated. [HADOOP-1152] Reduces at TaskTrackers hang due to a race condition when a file is deleted between a rename and an attempt to call getlength() on it. [Fault Name] Fault Injected [CPUHog] Emulate a CPU-intensive task that consumes 70% CPU utilization [DiskHog] Sequential disk workload writes 20GB of data to filesystem [PacketLoss5/50] Induce 5%, 50% packet losses by dropping all incoming/outgoing packets with probabilities of 0.05, 0.5 [HANG-1036] Revert to older version and trigger bug by throwing NullPointerException [HANG-1152] Emulate the race by flagging a renamed file as being flushed to disk and throwing exceptions in the filesystem code Table 2: Injected faults, and the reported failures that they emulate. HADOOP-xxxx represents a Hadoop bug database entry. Figure 2: Ganesha s approach. the K profiles that capture Hadoop s behavior. We model our training data (a collection of sadc-vector time-series of fault-free experimental runs) as a mixture of K Gaussian distributions. The fault-free training data is used to compute the parameters means and covariance matrices of the K Gaussians. We enforce an equal prior over the K Gaussians, since the prior distributions of the K Gaussians may differ over different workloads. We do not assume that our training data is labeled, i.e., we do not know, a priori, which of the K Gaussians each gathered sadc-vector is associated with. Instead, we use the expectation-maximization (EM) algorithm [6] to learn the values of the unknown parameters in the mixture of Gaussians. Since the convergence time of the EM algorithm depends on the goodness of the initial values of these unknown parameters, we use K-means clustering to determine the initial values for the EM algorithm. In fact, we run the K-means clustering multiple times, with different initializations for the K-means clustering, in order to choose the best resulting centroid values (i.e., those with minimum distortion) as the initial values for the EM algorithm. The output of the EM algorithm consists of the means and covariance matrices, (µi, Σi ), respectively of each of the K Gaussians. We chose a value of K = 7 in our experiments2. Deployment. Our test data consists of sadc-vectors collected from the n slave nodes for a single job. At every sampling interval, Ganesha classifies the test sadc-vector samples from each slave node into one of the K profiles, i.e., each test sadc-vector is mapped to the best Gaussian, (µi, Σi ). If the test sadc-vector differs significantly from all of the K Gaussians, it is classified as unknown. For each of the n slave nodes, we maintain a histogram of all of the Gaussian labels seen so far. Upon receiving a new classified sadc-vector for a slave node j, Ganesha incrementally updates the associated histogram, Hj, as follows. The histogram count values of all the labels are multiplied by an exponential decay factor, and 1 is added to the count value of the label that classifies the current sadc-vector. From our peer-symmetry hypothesis, slave nodes should exhibit similar behavior over moderately long durations; thus, we expect the histograms to be similar across all of the n slave nodes. If a slave node s histogram differs from the other nodes in a statistical sense, then, Ganesha can indict that odd-slave-out as the culprit. To accomplish this, at each time instant, we perform a pairwise comparison of the histogram, Hj, with the remaining histograms, Hl, l 6= j, of the other slave nodes, l. The square root of the Jensen-Shannon divergence, which is a symmetric version of the Kullback-Leibler divergence and is known to be metric3, is used as the distance measure to compute the pairwise histogram distance between slave nodes. An alarm is raised for a slave node if its pairwise distance is 2 We discovered that as we increased K, the mean squared error of K-means decreases rapidly for K < 7 and slowly for K > 7. Our choice of K = 7 is guided by the elbow rule-of-thumb: an additional cluster does not add significant information. 3 A distance between two objects is metric if it exhibits symmetry, triangular inequality, and non-negativity.
4 FP Ratio TP Ratio CPU hog Disk hog Ganesha performance HANG 1036 HANG 1152 Packet loss nodes 50 nodes Packet loss 5 F P = # nodes without faults incorrectly indicted # nodes without injected faults Speculative execution did not have a significant effect on our results. Thus, we present results without discriminating experiments by whether speculative execution was enabled. Figure 3 demonstrates that Ganesha was able to achieve almost perfect TP ratios of 1.0 and FP ratios of 0.0 for CPUHog, DiskHog, and HANG-1036 on all cluster sizes. For PacketLoss50, Ganesha achieved high TP and FP ratios. Conversely, both TP and FP ratios for HANG-1152 and PacketLoss5 were low. We discuss these results in Section 6. In addition, we compute the the false-alarm rate to be the proportion of slave nodes indicted in fault-free runs; this turns out to be 0.03 in both the 10-node and 50-node cases. These low false-alarm rates across both cluster sizes suggest that, in the case where nodes are indicted by Ganesha, a fault is truly present in the system, albeit not necessarily at the node(s) indicted by Ganesha. Figure 3: Diagnosis results for Ganesha on faults injected in Hadoop for fault-workload pairs. more than a threshold value with more than n 1 2 slave nodes. An alarm is treated merely as a suspicion; repeated alarms are needed for indicting a node. Thus, Ganesha maintains an exponentially weighted alarm-count raised for each of the slave nodes in the system. Ganesha indicts a node as the culprit if that node s exponentially weighted alarm-count exceeds a predefined threshold value. 5. EXPERIMENTAL VALIDATION We analyzed system metrics from two (10-slave nodes and 50-slave nodes) clusters running Hadoop Each node consisted of an AMD Opeteron 1220 dual-core CPU with 4GB of memory, Gigabit Ethernet, and a dedicated 320GB disk for Hadoop, running amd64 Debian/GNU Linux 4.0. We selected our candidate faults from real-world problems reported by Hadoop users and developers in: (i) the Hadoop issue tracker [4] from February 2007 to February 2009, and (ii) 40 postings from the Hadoop users mailing list from September to November We describe our results for the injection of the six specific faults listed in Table 2. For each injected fault, we collected 20 traces on the 10-node cluster and 6 traces on the 50-node cluster, with Hadoop s speculative execution enabled for half of the traces. We demonstrate that Ganesha is able to detect three of the injected faults (CPUHog, DiskHog and HANG-1036), and discuss Ganesha s shortcomings at detecting the other three faults (PacketLoss5, PacketLoss50 and HANG-1152). 5.1 Results We evaluated Ganesha s approach using the true-positive (TP) and false-positive (FP) ratios across all runs for each fault. Figure 3 summarizes our results. A node with an injected fault that is correctly indicted is a true-positive, while a node without an injected fault that is incorrectly indicted is a false-positive. Thus, the true-positive and falsepositive ratios are computed as: T P = # faulty nodes correctly indicted # nodes with injected faults 6. DISCUSSIONS Here, we candidly discuss some of the injected faults that Ganesha was less effective at diagnosing. Ganesha achieved low true-positive ratios at diagnosing HANG-1152, indicating that our view of the faulty node was obscured such that, under the injected fault, the behavior of the faulty node did not deviate significantly from the fault-free nodes. This is due to the relationship between the workload and the injected faults: specifically, the injected fault occurs late in the processing cycle, during the later part of a Reduce. The manifestations of HANG-1152 are internal to the processing of each Reduce, and are triggered only near the end of each Map-to-Reduce processing cycle, where the jobs would have ended, thus squelching the manifestation of the fault before the fault can cause significant deviation in behavior on the faulty nodes. This is in contrast to HANG-1036, which is a hang triggered at the beginning of Map processing, which we detect with an almost perfect true-positive ratio. This is a limitation of diagnosing using only black-box operating system performance counters, which limits us from utilizing application specific information to obtain finer-grained views. However, we have experienced more success at diagnosing application hangs using white-box finer-grained information from Hadoop s natively-generated logs [9], and we intend to study how white-box and black-box information can be jointly synthesized to diagnose failures more effectively. Ganesha also achieved a high FP ratio for PacketLoss50, reflecting the highly correlated nature of the fault. Nodes that attempted to communicate with the culprit node also exhibited a slowdown in activity as they waited on the culprit node. A non-culprit slave node appears anomalous, as compared to other slave nodes, whenever it communicates with the culprit node. Hence, Ganesha indicts a significant number of non-culprit nodes as well. However, by using white-box metrics extracted from Hadoop s logs, we show in [11] that we can identify the culprit node with high accuracy, even under such correlated faults. On the other hand, Ganesha achieves low TP and FP ratios for PacketLoss5, because TCP is resilient against minor packet losses, but not against major packet losses. Hence, the minor packet losses are masked by TCP s own reliable delivery mechanisms and the faulty node does not deviate significantly from other nodes.
5 7. RELATED WORK We compare Ganesha with two sets of related work. First, we discuss other work on performance debugging for MapReduce systems. We then discuss other failure diagnosis techniques, most which target multi-tiered enterprise systems. Performance debugging for MapReduce. [7] is an integrated tracing framework which requires modifications to code and network protocols. It has been applied to Hadoop to build and visualize request paths. [8] is a framework for collecting, storing and visualizing logs, and allows for plug-in statistical modules for analysis. Ganesha could potentially be implemented as one such module. However, [7, 8] do not automatically diagnose failures in MapReduce systems. Instead, they present information that aids the user in debugging performance problems. Our prior work, SALSA [9] also performs automated failure diagnosis, but uses white-box metrics as extracted from Hadoop s logs. SALSA and Ganesha are complementary approaches; [11] shows how the two may be integrated to achieve better diagnosis results. Mochi [10] builds on statemachine models extracted by SALSA to build conjoined causal control- and data-flow paths of Hadoop s execution. Other failure diagnosis work. Ganesha collects blackbox performance counters from the OS. It models normal behavior as that observed on the majority of slave nodes; that is, Ganesha performs peer comparison for failure diagnosis. Like Ganesha, [12, 13] also collect resource utilization metrics. However, this is done at the thread level in [12]. Further, [12] captures events at the kernel, middleware and application layers. An anomaly is detected when a new request does not match any previously observed cluster of requests. [13] requires additional measurement of whether service-level objectives are violated. Problems are identified by comparing their signatures to previously observed clusters of signatures. However, MapReduce systems allow for arbitrary user code and thus does not lend itself well to the historical comparisons used in [12, 13]. [14, 15] collect only request path information. [14] infers paths from unmodified messaging layer messages; [15] collects messages at the middleware layer. [14] aims to isolate performance bottlenecks rather than detecting anomalies. [15] performs historical and peer comparison of path shapes to detect anomalies. However, paths in MapReduce follow the same Map to Reduce shape (except in straightforward cases of outright crashes or errors). Performance problems are not likely to result in changes in path shape, thus rendering a technique based on path shapes less useful in a MapReduce setting than in multi-tiered enterprise systems targetted in [15]. 8. CONCLUSION AND FUTURE WORK We describe Ganesha, a black-box diagnosis technique that examines OS-level metrics to detect and diagnose faults in MapReduce systems. Ganesha relies on experimental observations of MapReduce behavior to diagnose faults. The two, experimentally-based key hypotheses peer symmetry of slave nodes and a small number of profiles of MapReduce behavior drive Ganesha s diagnosis algorithm. We are currently extending Ganesha to diagnose correlated faults by using data- and control-flow dependencies extracted by our Hadoop log-analysis tools [9]. Also, we are integrating Ganesha with other diagnosis techniques to enable improved diagnosis [11]. Also, Ganesha s learning phase assumes metrics with Gaussian distributions; we plan to investigate if the diagnosis can be improved by using other, possibly non-parametric, forms of clustering. We also expect to run Ganesha online by deploying its algorithms within our ASDF online problem-diagnosis framework [16]. 9. REFERENCES [1] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pp , San Francisco, CA, Dec [2] S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. In SOSP, pp 29 43, Lake George, NY, Oct [3] Hadoop. [4] Apache s JIRA issue tracker, [5] S. Godard. SYSSTAT, [6] D. R. A. Dempster, N. Laird. Maximum likelihood from incomplete data via the em algorithm. J. of the Royal Statistical Society, 39:1,38, [7] R. Fonseca, G. Porter, R. Katz, S. Shenker, and I. Stoica. X-Trace: A pervasive network tracing framework. In NSDI, Cambridge, MA, Apr [8] G. F. Cretu-Ciocarlie, M. Budiu, M. Goldszmidt. Hunting for Problems with Artemis. In USENIX Workshop on Analysis of System Logs, San Diego, CA, Dec [9] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. SALSA: Analyzing logs as state machines. In USENIX Workshop on Analysis of System Logs, San Diego, CA, Dec [10] J. Tan, X. Pan, S. Kavulya, R. Gandhi, P. Narasimhan. Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop. HotCloud, San Diego, CA, Jun [11] X. Pan, Blind Men and the Elephant: Piecing Together Hadoop for Diagnosis. Masters Thesis, Carnegie Mellon University, Technical Report: CMU-CS , Carnegie Mellon University, May 2009 [12] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In OSDI, San Francisco, CA, Dec [13] I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP, pp , Brighton, U.K., Oct [14] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed system of black boxes. In SOSP, pp 74 89, Bolton Landing, NY, Oct [15] E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE Trans. on Neural Networks, 16(5): , Sep [16] K. Bare, M. Kasick, S. Kavulya, E. Marinelli, X. Pan, J. Tan, R. Gandhi, and P. Narasimhan. ASDF: Automated online fingerpointing for Hadoop. Technical Report CMU-PDL , Carnegie Mellon University, May 2008.
SALSA: Analyzing Logs as StAte Machines 1
SALSA Analyzing s as StAte Machines 1 Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi and Priya Narasimhan Electrical & Computer Engineering Department, Carnegie Mellon University {jiaqit, xinghaop,
Abstract. 1 Introduction. 2 Problem Statement. 1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department Carnegie Mellon University,
ASDF: Automated, Online Fingerpointing for Hadoop (CMU-PDL-08-104)
Research Centers and Institutes Parallel Data Laboratory Carnegie Mellon University Year 2008 ASDF: Automated, Online Fingerpointing for Hadoop (CMU-PDL-08-104) Keith Bare Michael P. Kasick Soila Kavulya
Diagnosing Heterogeneous Hadoop Clusters
Diagnosing Heterogeneous Hadoop Clusters Shekhar Gupta, Christian Fritz, Johan de Kleer, and Cees Witteveen 2 Palo Alto Research Center, Palo Alto, CA 930, USA {ShekharGupta, cfritz, dekleer}@parccom 2
Hadoop Performance Monitoring Tools
Hadoop Performance Monitoring Tools Kai Ren, Lianghong Xu, Zongwei Zhou Abstract Diagnosing performance problems in large-scale data-intensive programs is inherently difficult due to its scale and distributed
Hadoop Performance Diagnosis By Post-execution Log Analysis
Hadoop Performance Diagnosis By Post-execution Log Analysis cs598 Project Proposal Zhijin Li, Chris Cai, Haitao Liu ([email protected],[email protected],[email protected]) University of Illinois
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
1.1 Difficulty in Fault Localization in Large-Scale Computing Systems
Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Autoscaling Hadoop Clusters
U N I V E R S I T Y O F T A R T U FACULTY OF MATHEMATICS AND COMPUTER SCIENCE Institute of Computer Science Toomas Römer Autoscaling Hadoop Clusters Master s thesis (30 EAP) Supervisor: Satish Narayana
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
Towards a Resource Aware Scheduler in Hadoop
Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin Garegrat, Shiwali Mohan Computer Science and Engineering, University of Michigan, Ann Arbor December 21, 2009 Abstract Hadoop-MapReduce is
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
PerfCompass: Toward Runtime Performance Anomaly Fault Localization for Infrastructure-as-a-Service Clouds
PerfCompass: Toward Runtime Performance Anomaly Fault Localization for Infrastructure-as-a-Service Clouds Daniel J. Dean, Hiep Nguyen, Peipei Wang, Xiaohui Gu {djdean2,hcnguye3,pwang7}@ncsu.edu,[email protected]
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
MR-Scope: A Real-Time Tracing Tool for MapReduce
MR-Scope: A Real-Time Tracing Tool for MapReduce Dachuan Huang, Xuanhua Shi, Shadi Ibrahim, Lu Lu, Hongzhang Liu, Song Wu, Hai Jin Cluster and Grid Computing Lab Services Computing Technique and System
Copyright Shraddha Pradip Sen 2011 ALL RIGHTS RESERVED
LOG ANALYSIS TECHNIQUE: PICVIZ by SHRADDHA PRADIP SEN DR. YANG XIAO, COMMITTEE CHAIR DR. XIAOYAN HONG DR. SUSAN VRBSKY DR. SHUHUI LI A THESIS Submitted in partial fulfillment of the requirements for the
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
America s Most Wanted a metric to detect persistently faulty machines in Hadoop
America s Most Wanted a metric to detect persistently faulty machines in Hadoop Dhruba Borthakur and Andrew Ryan dhruba,[email protected] Presented at IFIP Workshop on Failure Diagnosis, Chicago June
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
System Behavior Analysis by Machine Learning
CSC456 OS Survey Yuncheng Li [email protected] December 6, 2012 Table of contents 1 Motivation Background 2 3 4 Table of Contents Motivation Background 1 Motivation Background 2 3 4 Scenarios Motivation
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju
Chapter 7: Distributed Systems: Warehouse-Scale Computing Fall 2011 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note:
Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System
Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System Renyu Yang( 杨 任 宇 ) Supervised by Prof. Jie Xu Ph.D. student@ Beihang University Research Intern @ Alibaba
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
Research Article Hadoop-Based Distributed Sensor Node Management System
Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Adaptive Task Scheduling for Multi Job MapReduce
Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
Big Data Analysis and Its Scheduling Policy Hadoop
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy
Map/Reduce Affinity Propagation Clustering Algorithm
Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud
A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Thuy D. Nguyen, Cynthia E. Irvine, Jean Khosalim Department of Computer Science Ground System Architectures Workshop
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: [email protected] & [email protected] Abstract : In the information industry,
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Automated Diagnosis of Chronic Performance Problems in Production Systems
Automated Diagnosis of Chronic Performance Problems in Production Systems Soila P. Kavulya Christos Faloutsos, CMU Greg Ganger, CMU Matti Hiltunen, AT&T Labs Priya Narasimhan, CMU (Advisor) Motivation
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
UBL: Unsupervised Behavior Learning for Predicting Performance Anomalies in Virtualized Cloud Systems
UBL: Unsupervised Behavior Learning for Predicting Performance Anomalies in Virtualized Cloud Systems Daniel J. Dean, Hiep Nguyen, Xiaohui Gu Department of Computer Science North Carolina State University
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India
Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Context-Aware Time Series Anomaly Detection for Complex Systems
Context-Aware Time Series Anomaly Detection for Complex Systems Manish Gupta 1, Abhishek B. Sharma 2, Haifeng Chen 2, Guofei Jiang 2 1 UIUC, 2 NEC Labs, America Abstract Systems with several components
FChain: Toward Black-box Online Fault Localization for Cloud Systems
2013 IEEE 33rd International Conference on Distributed Computing Systems FChain: Toward Black-box Online Fault Localization for Cloud Systems Hiep Nguyen, Zhiming Shen, Yongmin Tan, Xiaohui Gu North Carolina
Evaluating Task Scheduling in Hadoop-based Cloud Systems
2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
CHAPTER 1 INTRODUCTION
21 CHAPTER 1 INTRODUCTION 1.1 PREAMBLE Wireless ad-hoc network is an autonomous system of wireless nodes connected by wireless links. Wireless ad-hoc network provides a communication over the shared wireless
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
