SALSA: Analyzing Logs as StAte Machines 1
|
|
|
- Darleen Heath
- 9 years ago
- Views:
Transcription
1 SALSA Analyzing s as StAte Machines 1 Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi and Priya Narasimhan Electrical & Computer Engineering Department, Carnegie Mellon University {jiaqit, xinghaop, spertet, rgandhi, priyan}@andrew.cmu.edu Abstract SALSA examines system logs to derive state-machine views of the sytem s execution, along with controlflow, data-flow models and related statistics. Exploiting SALSA s derived views and statistics, we can effectively construct higher-level useful analyses. We demonstrate SALSA s approach by analyzing system logs generated in a Hadoop cluster, and then illustrate SALSA s value by developing visualization and failure-diagnosis techniques, for three different Hadoop workloads, based on our derived state-machine views and statistics. 1 Introduction Most software systems collect logs of programmergenerated messages for various uses, such as troubleshooting, tracking user requests (e.g. HTTP access logs), etc. These logs typically contain unstructured freeform text, making them relatively harder to analyze than numerical system-data (e.g., CPU usage). However, logs often contain semantically richer information than numerical system/resource utilization statistics, since the log messages often capture the intent of the programmer of the system to record events of interest. SALSA, our approach to automated system-log analysis, involves examining logs to trace control-flow and data-flow execution in a distributed system, and to derive state-machine-like views of the system s execution on each node. Figure 1 depicts the core of SALSA s approach. As log data is only as accurate as the programmer who implemented the logging points in the system, we can only infer the state-machines that execute within the target system. We cannot (from the logs), and do not, attempt to verify whether our derived state-machines faithfully capture the actual ones executing within the system. Instead, we leverage these derived state-machines to support different kinds of useful analyses to understand/visualize the system s execution, to discover data-flows in the system, to discover bugs, and to localize performance problems and failures. To the best of our knowledge, SALSA is the first log-analysis technique that aims to derive state-machine views from unstructured text-based logs, to support visualization, failure-diagnosis and other uses. In this paper, 1 This work is supported in part by the NSF CAREER Award CCR , NSF Award CCF , and the Army Research Office grant number DAAD ("Perpetually Available and Secure Information Systems") to the Center for Computer and Communications Security at Carnegie Mellon University. System s (from all nodes) Control-flow event traces Data-flow event traces Derived state-machine views of system s control- & data-flows Figure 1 SALSA s approach. Failure diagnosis Visualization we apply SALSA s approach to the logs generated by Hadoop [7], the open-source implementation of / [5]. Concretely, our contributions are (i) a loganalysis approach that extracts state-machine views of a distributed system s execution, with both control-flow and data-flow, (ii) a usage scenario where SALSA is beneficial in preliminary failure diagnosis for Hadoop, and (iii) a second usage scenario where SALSA enables the visualization of Hadoop s distributed behavior. 2 SALSA s Approach SALSA aims to analyze the target system s logs to derive the control-flow on each node, the data-flow across nodes, and the state-machine execution of the system on each node. When parsing the logs, SALSA also extracts key statistics (state durations, inter-arrival times of events, etc.) of interest. To demonstrate SALSA s value, we exploit the SALSA-derived state-machine views and their related statistics for visualization and failure diagnosis. SALSA does not require any modification of the hosted applications, middleware or operating system. To describe SALSA s high-level operation, consider a distributed system with many producers, P1, P2,..., and many consumers, C1,C2,... Many producers and consumers can be running on any host at any point in time. Consider one execution trace of two tasks, P1 and C1 on a host X (and task P2 on host Y ) as captured by a sequence of time-stamped log entries at host X [ t 1 ] Begin Task P1 [ t 2 ] Begin Task C1 [ t 3 ] Task P1 does some work [ t 4 ] Task C1 w a i t s f o r d a t a from P1 and P2 [ t 5 ] Task P1 p r o d u c e s d a t a [ t 6 ] Task C1 consumes d a t a from P1 on h o s t X [ t 7 ] Task P1 ends [ t 8 ] Task C1 consumes d a t a from P2 on h o s t Y [ t 9 ] Task C1 ends From the log, it is clear that the executions (control-
2 flows) of P1 and C1 interleave on host X. It is also clear that the log captures a data-flow for C1 with P1 and P2. SALSA interprets this log of events/activities as a sequence of states. For example, SALSA considers the period [t1,t6] to represent the duration of state P1 (where a state has well-defined entry and exit points corresponding to the start and the end, respectively, of task P1). Other states that can be derived from this log include the state C1, the data-consume state for C1 (the period during which C1 is consuming data from its producers, P1 and P2), etc. Based on these derived state-machines (in this case, one for P1 and another for C1), SALSA can derive interesting statistics, such as the durations of states. SALSA can then compare these statistics and the sequences of states across hosts in the system. In addition, SALSA can extract data-flow models, e.g., the fact that P1 depends on data from its local host, X, as well as a remote host, Y. The data-flow model can be useful to visualize and examine any data-flow bottlenecks or dependencies that can cause failures to escalate across hosts. Non-Goals. We do not seek to validate or improve the accuracy or the completeness of the logs, nor to validate our derived state-machines against the actual ones of the target system. Rather, our focus has been on the analyses that we can perform on the logs in their existing form. It is not our goal, either, to demonstrate complete use cases for SALSA. For example, while we demonstrate one application of SALSA for failure diagnosis, we do not claim that this failure-diagnosis technique is complete nor perfect. It is merely illustrative of the types of useful analyses that SALSA can support. Finally, while we can support an online version of SALSA that would analyze log entries generated as the system executes, the goal of this paper is not to describe such an online log-analysis technique or its runtime overheads. In this paper, we use SALSA in an offline manner, to analyze logs incrementally. Assumptions. We assume that the logs faithfully capture events and their causality in the system s execution. For instance, if the log declares that event X happened before event Y, we assume that is indeed the case, as the system executes. We assume that the logs record each event s timestamp with integrity, and as close in time (as possible) to when the event actually occurred in the sequence of the system s execution. Again, we recognize that, in practice, the preemption of the system s execution might cause a delay in the occurrence of an event X and the corresponding log message (and timestamp generation) for entry into the log. We do not expect the occurrence of an event and the recording of its timestamp/log-entry to be atomic. However, we do assume that clocks are loosely synchronized across hosts for correlating events across logs from different hosts. 3 Related Work Event-based analysis. Many studies of system logs treat them as sources of failure events. analysis of system errors typically involves classifying log messages based on the preset severity level of the reported error, and on tokens and their positions in the text of the message [14] [11]. More sophisticated analysis has included the study of the statistical properties of reported failure events to localize and predict faults [15] [11] [9] and mining patterns from multiple log events [8]. Our treatment of system logs differs from such techniques that treat logs as purely a source of events we impose additional semantics on the log events of interest, to identify durations in which the system is performing a specific activity. This provides context of the temporal state of the system that a purely event-based treatment of logs would miss, and this context alludes to the operational context suggested in [14], albeit at the level of the control-flow context of the application rather than a managerial one. Also, since our approach takes log semantics into consideration, we can produce views of the data that can be intuitively understood. However, we note that our analysis is amenable only to logs that capture both normal system activity events and errors. Request tracing. Our view of system logs as providing a control-flow perspective of system execution, when coupled with log messages which have unique identifiers for the relevant request or processing task, allows us to extract request-flow views of the system. Much work has been done to extract request-flow views of systems, and these request flow views have then been used to diagnose and debug performance problems in distributed systems [2] [1]. However, [2] used instrumentation in the application and middleware to track requests and explicitly monitor the states that the system goes through, while [1] extracted causal flows from messages in a distributed system using J2EE instrumentation developed by [4]. Our work differs from these request-flow tracing techniques in that we can causally extract request flows of the system without added instrumentation given system logs, as described in 7. -analysis tools. Splunk [10] treats logs as searchable text indexes, and generates visualizations of the log; Splunk treats logs similarly to other log-analysis techniques, considering each log entry as an event. There exist commercial open-source [3] tools for visualizing the data in logs based on standardized logging mechanisms, such as log4j [12]. To the best of our knowledge, none of these tools derive the control-flow, data-flow and statemachine views that SALSA does. 2
3 MASTER SLAVES Hadoop source-code JobTracker NameNode s s TaskTracker DataNode s s TaskTracker DataNode LOG. i n f o ( " LaunchTaskAction " + t. g e t T a s k I d ( ) ) ; LOG. i n f o ( r e d u c e I d + " Copying " + l o c. gettaskid ( ) + " o u t p u t from " + l o c. g e t H o s t ( ) + ". " ) ; TaskTracker log JobTracker NameNode TaskTracker Data Blocks DataNode HDFS TaskTracker Data Blocks DataNode Figure 2 Architecture of Hadoop, showing the locations of the system logs of interest to us. 4 Hadoop s Architecture Hadoop [7] is an open-source implementation of Google s / [5] framework that enables distributed, data-intensive, parallel applications by decomposing a massive job into smaller tasks and a massive data-set into smaller partitions, such that each task processes a different partition in parallel. The main abstractions are (i) tasks that process the partitions of the data-set using key/value pairs to generate a set of intermediate results, and (ii) tasks that merge all intermediate values associated with the same intermediate key. Hadoop uses the Hadoop Distributed File System (HDFS), an implementation of the Google Filesystem [16], to share data amongst the distributed tasks in the system. HDFS splits and stores files as fixed-size blocks (except for the last block). Hadoop has a master-slave architecture (Figure 2), with a unique master host and multiple slave hosts, typically configured as follows. The master host runs two daemons (1) the JobTracker, which schedules and manages all of the tasks belonging to a running job; and (2) the NameNode, which manages the HDFS namespace, and regulates access to files by clients (which are typically the executing tasks). Each slave host runs two daemons (1) the Task- Tracker, which launches tasks on its host, based on instructions from the JobTracker; the TaskTracker also keeps track of the progress of each task on its host; and (2) the DataNode, which serves data blocks (that are stored on its local disk) to HDFS clients. 4.1 ging Framework Hadoop uses the Java-based log4j logging utility to capture logs of Hadoop s execution on every host. log4j is a commonly used mechanism that allows developers to generate log entries by inserting statements into the code at various points of execution. By default, Hadoop s log4j configuration generates a separate log for each of the daemons the JobTracker, NameNode, TaskTracker and DataNode each log being stored on the , 466 INFO org. apache. hadoop. mapred. TaskTracker LaunchTaskAction task_0001_m_000003_ , 450 INFO org. apache. hadoop. mapred. TaskRunner task_ 0001_ r_ _ 0 Copying task_0001_m_000001_0 o u t p u t from fp30. p d l. cmu. l o c a l Figure 3 log4j-generated TaskTracker log entries. Dependencies on task execution on local and remote hosts are captured by the TaskTracker log. Hadoop source-code LOG. debug ( " Number of a c t i v e c o n n e c t i o n s i s "+ x c e i v e r C o u n t ) ; LOG. i n f o ( " Received b l o c k " + b + " from " + s. g e t I n e t A d d r e s s ( ) + " and m i r r o r e d t o " + m i r r o r T a r g e t ) ; LOG. i n f o ( " Served b l o c k " + b + " t o " + s. g e t I n e t A d d r e s s ( ) ) ; DataNode log , 603 INFO org. apache. hadoop. d f s. DataNode Number of a c t i v e c o n n e c t i o n s i s , 611 INFO org. apache. hadoop. d f s. DataNode Received b l o c k blk_ from / and m i r r o r e d t o / , 855 INFO org. apache. hadoop. d f s. DataNode Served b l o c k blk_ t o / Figure 4 log4j-generated DataNode log. Local and remote data dependencies are captured. local file-system of the executing daemon (typically, 2 logs on each slave host and 2 logs on the master host). Typically, logs (such as syslogs) record events in the system, as well as error messages and exceptions. Hadoop s logging framework is somewhat different since it also checkpoints execution because it captures the execution status (e.g., what percentage of a or a has been completed so far) of all Hadoop jobs and tasks on every host. Hadoop s default log4j configuration generates time-stamped log entries with a specific format. Figure 3 shows a snippet of a TaskTracker log, and Figure 4 a snippet of a DataNode log. 5 Analysis To demonstrate Salsa s approach, we focus on the logs generated by Hadoop s TaskTracker and DataNode daemons. The number of these daemons (and, thus, the 3
4 TaskTracker Records events for all s and tasks on its node [t] Launch Task [t] Copy outputs [t] Task Done Each task s control flow outputs to tasks on this or other nodes [t] Launch Task [t] is idling, waiting for outputs [t] Repeat until all outputs copied [t] Start Copy (of completed output) [t] Finish Copy [t] Merge Copy [t] Merge Sort [t] (User ) [t] Task Done can be interleaved Each task s control flow Incoming outputs for this task Idle Copy Merge Copy Sort User Figure 5 Derived Control-Flow for Hadoop s execution. number of corresponding logs) increases with the size of a Hadoop cluster, inevitably making it more difficult to analyze the associated set of logs manually. Thus, the TaskTracker and DataNode logs are attractive first targets for Salsa s automated log-analysis. At a high level, each TaskTracker log records events/activities related to the TaskTracker s execution of and tasks on its local host, as well as any dependencies between locally executing s and ouputs from other hosts. On the other hand, each DataNode log records events/activities related to the reading or writing (by both local and remote and tasks) of HDFS data-blocks that are located on the local disk. This is evident in Figure 3 and Figure Derived Control-Flow TaskTracker log. The TaskTracker spawns a new JVM for each or task on its host. Each thread is associated with a thread, with the s output being consumed by its associated. The and tasks are synchronized to the Copy and Copy activities in each of the two types of tasks, when the task s output is copied from its host to the host executing the associated. The s on one node can be synchronized to a on a different node SALSA derives this distributed control-flow across all Hadoop hosts in the cluster by collectively parsing all of the hosts TaskTracker logs. Based on the TaskTracker log, SALSA derives a state-machine for each unique or in the system. Each log-delineated activity within a task corresponds to a state. DataNode log. The DataNode daemon runs three main types of data-related threads (i) ReadBlock, which serves blocks to HDFS clients, (ii) WriteBlock, which receives blocks written by HDFS clients, and (iii) WriteBlock_Replicated, which receives blocks written by HDFS clients that are subsequently transferred to another DataNode for replication. The DataNode daemon runs in its own independent JVM, and the daemon spawns a new JVM thread for each thread of execution. Based on the DataNode log, SALSA derives a state-machine for each of the unique data-related threads on each host. Each log-delineated activity within a datarelated thread corresponds to a state. 5.2 Tokens of Interest SALSA can uniquely delineate the starts and ends of key activities (or states) in the TaskTracker logs. Table 1 lists the tokens that we use to identify states in the Task- Tracker log. [ID] and [ID] denote the identifiers used by Hadoop in the TaskTracker logs to uniquely identify s and s. The starts and ends of the Sort and User states in the task were not identifiable from the TaskTracker logs; the log entries only identified that these states were in progress, but not when they had started or ended. Additionally, the Copy processing activity is part of the task as reported by Hadoop s logs, and is currently indisguishable. SALSA was able to identify the starts and ends of the data-related threads in the DataNode logs with a few provisions (i) Hadoop had to be reconfigured to use DEBUG instead of its default INFO logging level, in order for the starts of states to be generated, and (ii) all states completed in a First-In First-Out (FIFO) ordering. Each datarelated thread in the DataNode log is identified by the unique identifier of the HDFS data block. The log messages identifying the ends of states in the DataNode- logs are listed in Table Data-Flow in Hadoop A data-flow dependency exist between two hosts when an activity on one host requires transferring data to/from another node. The DataNode daemon acts as a server, receiving blocks from clients that write to its disk, and sending blocks to clients that read from its disk. Thus, data-flow dependencies exist between each DataNode and each of its clients, for each of the ReadBlock and WriteBlock states. SALSA is able to identify the data-flow dependencies on a per-datanode basis by parsing the hostnames jointly with the log-messages in the DataNode log. Data exchanges occur to transfer outputs of completed s to their associated s in the Copy and 4
5 Processing Activity Start Token End Token LaunchTaskAction [ID] Task [ID] is done. Copy N/A N/A Idle LaunchTaskAction [ID] Task [TaskID] is done. Copy [ID] Copying [ID] output from [Hostname] [ID] done copying [TaskID] output from [Hostname]. MergeCopy [ID] Thread started Thread for merging in memory files [ID] Merge of the 3 files in InMemoryFileSystem complete. Local file is [Filename] Sort N/A N/A User N/A N/A Table 1 Tokens in TaskTracker-log messages for identifying starts and ends of states. Processing Activity ReadBlock WriteBlock WriteBlock_Replicated End Token Served block [BlockID] to [Hostname] Received block [BlockID] from [Hostname] Received [BlockID] from [Hostname] and mirrored to [Hostname] Table 2 Tokens in DataNode-log messages for identifying ends of data-related threads Copy phases. This dependency is captured, along with the hostnames of the source and destination hosts involved in the -output transfer. Tasks also act as clients of the DataNode in reading inputs and writing outputs to HDFS. However, these activities are not recorded in the TaskTracker logs, so these data-flow dependencies are not captured. 5.4 Extracted Metrics & Data We extract multiple statistics from the log data, based on SALSA s derived state-machine approach. We extract statistics for the following states,, Copy and MergeCopy. Histograms and average of duration of unidentified, concurrent states, with events coalesced by time, allowing for events to superimpose each other in a time-series. Histograms and exact task-specific duration of states, with events identified by task identifer in a time-series; Duration of completed-so-far execution of ongoing task-specific states. We cannot get average times for and Sort because these have no well-defined start and termination events in the log. For each DataNode and TaskTracker log, we can determine the number of each of the states being executed on the particular node at each point in time. We can also compute the durations of each of the occurrences of each of the following states (i), Copy, MergeCopy for the Task- Tracker log, and (ii) ReadBlock, WriteBlock and WriteBlock_Replicated for the DataNode log. On the data-flow side, for each of the ReadBlock and WriteBlock states, we can identify the endpoint host involved in the state, and, for each of the Copy states, the host whose state was involved. However, we are unable to compute durations for User and Sort because these have no well-defined start and termination events in the logs. 6 Data Collection & Experimentation We analyzed traces of system logs from a 6-node (5- slave, 1-master) Hadoop cluster. Each node consisted of an AMD Opeteron 1220 dual-core CPU with 4GB of memory, Gigabit Ethernet, and a dedicated 320GB disk for Hadoop, and ran the amd64 version Debian/GNU Linux 4.0. We used three candidate workloads, of which the first two are commonly used to benchmark Hadoop RandWriter write 32 GB of random data to disk; Sort sort 3 GB of records; Nutch open-source distributed web crawler for Hadoop [13] representative of a real-world workload Each experiment iteration consisted of a Hadoop job lasting approximately 20 minutes. We set the logging level of Hadoop to DEBUG, cleared Hadoop s system logs before each experiment iteration, and collected the logs after the completion of each experiment iteration. In addition, we collected system metrics from /proc to provide ground truth for our experiments. Target failures. To illustrate the value of SALSA for failure diagnosis in Hadoop, we injected three failures into Hadoop, as described in Table 3. A persistent failure was injected into 1 of the 5 slave nodes midway through each experiment iteration. We surveyed real-world Hadoop problems reported by users and developers in 40 postings from the Hadoop users mailing list from Sep Nov We selected two candidate failures from that list to demonstrate the use of SALSA for failure-diagnosis. 7 Use Case 1 Visualization We present automatically generated visualizations of Hadoop s aggregate control-flow and data-flow dependencies, as well as a conceptualized temporal control- 5
6 Symptom [Source] Reported Failure [Failure Name] Failure Injected Processing [Hadoop users mailing list, Sep ] CPU bottleneck resulted from running master and slave daemons on same machine [CPUHog] Emulate a CPU-intensive task that consumes 70% CPU utilization Disk [Hadoop users mailing list, Sep ] Excessive messages logged to file during startup [DiskHog] Sequential disk workload wrote 20GB of data to filesystem Table 3 Failures injected, the resource symptom category they correspond to, and the reported problem they simulate Figure 6 Visualization of aggregate control-flow for Hadoop s execution. Each vertex represents a Task- Tracker. Edges are labeled with the number of Copys from the source to the destination vertex. { On the same node and tasks created as a part of the Job Job Idle On other nodes { Copy outputs required by start to become available Copy Copy Copy Copy Required outputs from other nodes Copy output from same node (if required) All outputs required by are now gathered XxYyZz AaBbCc Start of a State End of a State Within the -related State XxYyZz Within the -related State AaBbCc MergeCopy Sort User All s and all s related to this Job have completed T 1 T 2 T 3 T4 Figure 7 Visualizing Hadoop s control- and data-flow. flow chart. These views were generated offline from logs collected for the Sort workload in our experiments. Such visualization of logs can help operators quickly explain and analyze distributed-system behavior. Aggregate control-flow dependencies (Figure 6). The key point where there are inter-host dependencies in Hadoop s derived control-flow model for the Task- Tracker log is the Copy state, when the Copy on the destination host for a s output is started only when the source has completed, and the Copy depends on the source copying its map output. This visualization captures dependencies among TaskTrackers in a Hadoop cluster, with the number of such Copy dependencies between each pair of nodes aggregated across the entire Hadoop run. As an example, this aggregate view can reveal hotspots of communication, highlighting particular key nodes (if any) on which the overall control-flow of Hadoop s execution hinges. This also visually captures the equity (or lack thereof) of distribution of tasks in Hadoop. Aggregate data-flow dependencies (Figure 8 ). The data-flows in Hadoop can be characterized by the number of blocks read from and written to each DataNode. This Figure 8 Visualization of aggregate data-flow for Hadoop s execution. Each vertex represents a DataNode and edges are labeled with the number of each type of block operation (i.e. read, write, or write_replicated), which traversed that path. visualization is based on an entire run of the Sort workload on our cluster, and summarizes the bulk transfers of data between each pair of nodes. This view would reveal any imbalances of data accesses to any DataNode in the cluster, and also provides hints as to the equity (or lack thereof) of distribution of workload amongst the s and s. Temporal control-flow dependencies (Figure 7). The control-flow view of Hadoop extracted from its logs can be visualized in a manner that correlates state occurrences causally. This visualization provides a timebased view of Hadoop s execution on each node, and also shows the control-flow dependencies amongst nodes. Such views allow for detailed, fine-grained tracing of Hadoop execution through time, and allow for intertemporal causality tracing. 8 Use Case 2 Failure Diagnosis 8.1 Algorithm Intuition. For each task and data-related thread, we can compute the histogram of the durations of its different states in the derived state-machine view. We have observed that the histograms of a specific state s durations tend to be similar across failure-free hosts, while those on failure-injected hosts tend to differ from those of failurefree nodes. Thus, we hypothesize that failures can be diagnosed by comparing the probability distributions of 6
7 T P FP T P FP T P FP Workload RandWriter Sort Nutch Fault CPUHog DiskHog MergeCopy CPUHog DiskHog ReadBlock CPUHog DiskHog WriteBlock CPUHog DiskHog Figure 9 Failure diagnosis results of the Distribution- Comparison algorithm for workload-injected failure combinations; T P = true-positive rate, FP = falsepositive rate the durations (as estimated from their histograms) for a given state across hosts, assuming that a failure affects fewer than n 2 hosts in a cluster of n slave hosts. Algorithm. First, for a given state on each node, probability density functions (PDFs) of the distributions of durations are estimated from their histograms using a kernel density estimation with a Gaussian kernel [17] to smooth the discrete boundaries in histograms. Then, the difference between these distributions from each pair of nodes is computed as the pair-wise distance between their estimated PDFs. The distance used was the square root of the Jensen-Shannon divergence, a symmetric version of the Kullback-Leibler divergence [6], a commonly-used distance metric in information theory to compare PDFs. Then, we constructed the matrix distmatrix, where distmatrix(i, j) is the distance between the estimated distributions on nodes i and j. The entries in distmatrix are compared to a threshold p. Each distmatrix(i, j) > threshold p indicates a potential problem at nodes i, j, and a node is indicted if at least half of its entries distmatrix(i, j) exceed threshold p. Algorithm tuning. threshold p is used for the peercomparison of PDFs across hosts; for higher values of threshold p, greater differences must be observed between PDFs before they are flagged as anomalous. By increasing threshold p, we can reduce false-positive rates, but may suffer a reduction in true positive rates as well. threshold p is kept constant for each (workload, metric) combination, and is tuned independently of the failure injected. 8.2 Results & Evaluation We evaluated our initial failure-diagnosis techniques based on our derived models of Hadoop s behavior, by examining the rates of true- and false-positives of the diagnosis on hosts in our fault-injected experiments, as described in 6. True-positive rates are computed as count i (fault injected on node i, node i indicted) count i (fault injected on node i), i.e., the proportion of failure-injected hosts that were correctly indicted. False-positive rates are computed as count i (fault not injected on node i, node i indicted) count i (fault not injected on node i), i.e., the proportion of failure-free hosts that were incorrectly indicted as faulty. A perfect failure-diagnosis algorithm would achieve a true-positive rate of 1.0 at a false-positive rate of 0.0. Figure 9 summarizes the performance of our algorithm. By using different metrics, we achieved varied results in diagnosing different failures for different workloads. Much of the difference is due to the fact that the manifestation of the failures on particular metrics is workload-dependent. In general, for each (workload, failure) combination, there are metrics that diagnose the failure with a high true-positive and low false-positive rate. We describe some of the (metric, workload) combinations that fared poorly. We did not indict any nodes using ReadBlock s durations on RandWriter. By design, the RandWriter workload has no ReadBlock states since its only function is to write data blocks. Hence, it is not possible to perform any diagnosis using ReadBlock states on the RandWriter workload. Also, MergeCopy on RandWriter is a disk-intensive operation that has minimal processing requirements. Thus, CPUHog does not significantly affect the MergeCopy operation, as there is little contention for the CPU between the failure and the MergeCopy operations. However, the MergeCopy operation is disk-intensive, and is affected by the DiskHog. We found that DiskHog and CPUHog could manifest in a correlated manner on some metrics. For the Sort workload, if a failure-free host attempted to read a data block from the failure-injected node, the failure would manifest on the ReadBlock metric at the failure-free node. By augmenting this analysis with the data-flow model, we improved results for DiskHog and CPUHog on Sort, as discussed in Correlated Failures Data-flow Augmentation Peer-comparison techniques are poor at diagnosing correlated failures across hosts, e.g., ReadBlock durations failed to diagnose DiskHog on the Sort workload. In such cases, our original algorithm often indicted failurefree nodes, but not the failure-injected nodes. We augmented our algorithm using previouslyobserved states with anomalously long durations, and superimposing the data-flow model. For a Hadoop job, we 7
8 identify a state as an outlier by comparing the state s duration with the PDF of previous durations of the state, as estimated from past histograms. Specifically, we check whether the state s duration is greater than the threshold h -percentile of this estimated PDF. Since each DataNode state is associated with a host performing a read and another (not necessarily different) host performing the corresponding write, we can count the number of anomalous states that each host was associated with. A host is then indicted by this technique if it was associated with at least half of all the anomalous states seen across all slave hosts. Hence, by augmenting the diagnosis with data-flow information, we were able to improve our diagnosis results for correlated failures. We achieved true- and false-positive rates, respectively, of (0.7, 0.1) for the CPUHog and (0.8, 0.05) for the DiskHog failures on the ReadBlock metric. 9 Conclusion and Future Work SALSA analyzes system logs to derive state-machine views, distributed control-flow and data-flow models and statistics of a system s execution. These different views of log data can be useful for a variety of purposes, such as visualization and failure diagnosis. We present SALSA and apply it concretely to Hadoop to visualize its behavior and to diagnose documented failures of interest.we also initiated some early work to diagnose correlated failures by superimposing the derived data-flow models on the control-flow models. For our future directions, we intend to correlate numerical OS/network-level metrics with log data, in order to analyze them jointly for failure diagnosis and workload characterization. We also intend to automate the visualization of the causality graphs for the distributed control-flow and data-flow models. Finally, we aim to generalize the format/structure/content of logs that are amenable to SALSA s approach, so that we can develop a log-parser/processing framework that accepts a highlevel definition of a system s logs, using which it then generates the desired set of views. References [1] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed system of black boxes. In ACM Symposium on Operating Systems Principles, pages 74 89, Bolton Landing, NY, Oct [2] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA, Dec [3] Chainsaw. http//logging.apache.org/chainsaw, [4] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint Problem determination in large, dynamic internet services. In IEEE Conference on Dependable Systems and Networks, Bethesda, MD, Jun [5] J. Dean and S. Ghemawat. Simplified data processing on large clusters. In USENIX Symposium on Operating Systems Design and Implementation, pages , San Francisco, CA, Dec [6] D. M. Endres and J. E. Schindelin. A new metric for probability distributions. Information Theory, IEEE Transactions on, 49(7) , [7] Hadoop. http//hadoop.apache.org/core, [8] J. L. Hellerstein, S. Ma, and C.-S. Perng. Discovering actionable patterns in event data. IBM Systems Journal, 41(3) , [9] C. Huang, I. Cohen, J. Symons, and T. Abdelzaher. Achieving scalable automated diagnosis of distributed systems performance problems, [10] S. Inc. Splunk The it search company, http// [11] Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. K. Sahoo. BlueGene/L failure analysis and prediction models. In IEEE Conference on Dependable Systems and Networks, pages , Philadelphia, PA, [12] 4J. http//logging.apache.org/log4j, [13] Nutch. http//lucene.apache.org/nutch, [14] A. Oliner and J. Stearley. What supercomputers say A study of five system logs. In IEEE Conference on Dependable Systems and Networks, pages , Edinburgh, UK, June [15] A. Oliner and J. Stearley. Bad words Finding faults in Spirit s syslogs. In 8th IEEE International Symposium on Cluster Computing and the Grid (CC- Grid 2008), pages , Lyon, France, May [16] H. G. S. Ghemawat and S. Leung. The Google file system. In ACM Symposium on Operating Systems Principles, pages 29 43, Lake George, NY, Oct [17] L. Wasserman. All of Statistics A Concise Course in Statistical Inference. Springer, 1st edition, Sep
Ganesha: Black-Box Diagnosis of MapReduce Systems
Ganesha: Black-Box Diagnosis of MapReduce Systems Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department, Carnegie Mellon University Pittsburgh,
Abstract. 1 Introduction. 2 Problem Statement. 1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department Carnegie Mellon University,
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Copyright Shraddha Pradip Sen 2011 ALL RIGHTS RESERVED
LOG ANALYSIS TECHNIQUE: PICVIZ by SHRADDHA PRADIP SEN DR. YANG XIAO, COMMITTEE CHAIR DR. XIAOYAN HONG DR. SUSAN VRBSKY DR. SHUHUI LI A THESIS Submitted in partial fulfillment of the requirements for the
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
1.1 Difficulty in Fault Localization in Large-Scale Computing Systems
Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Hadoop Performance Diagnosis By Post-execution Log Analysis
Hadoop Performance Diagnosis By Post-execution Log Analysis cs598 Project Proposal Zhijin Li, Chris Cai, Haitao Liu ([email protected],[email protected],[email protected]) University of Illinois
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Adaptive Task Scheduling for Multi Job MapReduce
Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Diagnosing Heterogeneous Hadoop Clusters
Diagnosing Heterogeneous Hadoop Clusters Shekhar Gupta, Christian Fritz, Johan de Kleer, and Cees Witteveen 2 Palo Alto Research Center, Palo Alto, CA 930, USA {ShekharGupta, cfritz, dekleer}@parccom 2
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM
SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM 1 KONG XIANGSHENG 1 Department of Computer & Information, Xinxiang University, Xinxiang, China E-mail:
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Telecom Data processing and analysis based on Hadoop
COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
ASDF: Automated, Online Fingerpointing for Hadoop (CMU-PDL-08-104)
Research Centers and Institutes Parallel Data Laboratory Carnegie Mellon University Year 2008 ASDF: Automated, Online Fingerpointing for Hadoop (CMU-PDL-08-104) Keith Bare Michael P. Kasick Soila Kavulya
Hadoop Performance Monitoring Tools
Hadoop Performance Monitoring Tools Kai Ren, Lianghong Xu, Zongwei Zhou Abstract Diagnosing performance problems in large-scale data-intensive programs is inherently difficult due to its scale and distributed
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
Big Data Analysis and Its Scheduling Policy Hadoop
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy
lprof : A Non- intrusive Request Flow Pro5iler for Distributed Systems
lprof : A Non- intrusive Request Flow Pro5iler for Distributed Systems Xu Zhao, Yongle Zhang, David Lion, Muhammad FaizanUllah, Yu Luo, Ding Yuan, Michael Stumm Oct. 8th 2014 Performance analysis tools
Cray: Enabling Real-Time Discovery in Big Data
Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
An improved task assignment scheme for Hadoop running in the clouds
Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni
HiTune: Dataflow-Based Performance Analysis for Big Data Cloud
HiTune: Dataflow-Based Performance Analysis for Big Data Cloud Jinquan Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, P.R.China, 200241 {jason.dai,
Towards a Resource Aware Scheduler in Hadoop
Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin Garegrat, Shiwali Mohan Computer Science and Engineering, University of Michigan, Ann Arbor December 21, 2009 Abstract Hadoop-MapReduce is
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
