Automated Diagnosis of Chronic Performance Problems in Production Systems

Similar documents

Ganesha: Black-Box Diagnosis of MapReduce Systems

Abstract. 1 Introduction. 2 Problem Statement. 1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop

Fingerprinting the datacenter: automated classification of performance crises

Diagnosing Heterogeneous Hadoop Clusters

Hadoop Performance Monitoring Tools

SALSA: Analyzing Logs as StAte Machines 1

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

ASDF: Automated, Online Fingerpointing for Hadoop (CMU-PDL )

Mobile Cloud Computing for Data-Intensive Applications

Hadoop Architecture. Part 1

Fault Tolerance in Hadoop for Work Migration

Chapter 7. Using Hadoop Cluster and MapReduce

Draco: Top-Down Statistical Diagnosis of Large-scale VoIP Networks

GraySort on Apache Spark by Databricks

Introduction to Hadoop

Energy Efficient MapReduce

Apache Hadoop. Alexandru Costan

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Hadoop & its Usage at Facebook

System Behavior Analysis by Machine Learning

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing

Telecom Data processing and analysis based on Hadoop

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems

Adaptive Probing: A Monitoring-Based Probing Approach for Fault Localization in Networks

Task Scheduling in Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

A Cost-Evaluation of MapReduce Applications in the Cloud

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Research on Job Scheduling Algorithm in Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

NetFlow Analysis with MapReduce

Hadoop Scheduler w i t h Deadline Constraint

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Distributed Framework for Data Mining As a Service on Private Cloud

Machine Learning Algorithm for Pro-active Fault Detection in Hadoop Cluster

Hadoop Performance Diagnosis By Post-execution Log Analysis

Towards Combining Online & Offline Management for Big Data Applications

Context-Aware Time Series Anomaly Detection for Complex Systems

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Research Article Hadoop-Based Distributed Sensor Node Management System

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Self-Compressive Approach for Distributed System Monitoring

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Failure Diagnosis of Complex Systems

Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Deploying Hadoop with Manager

Keywords: Big Data, HDFS, Map Reduce, Hadoop

MapReduce and Hadoop Distributed File System

Big Data - Infrastructure Considerations

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Figure 1. The cloud scales: Amazon EC2 growth [2].

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Introduction to Hadoop

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

CSE-E5430 Scalable Cloud Computing Lecture 2

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

CiteSeer x in the Cloud

Efficacy of Live DDoS Detection with Hadoop

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop & its Usage at Facebook

Challenges for Data Driven Systems

Software System Performance Debugging with Kernel Events Feature Guidance

Group Based Load Balancing Algorithm in Cloud Computing Virtualization

Virtualizing Apache Hadoop. June, 2012

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

Fingerprinting the Datacenter: Automated Classification of Performance Crises

Introduction to Cloud Computing

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Testing Big data is one of the biggest

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Snapshots in Hadoop Distributed File System

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Big Data With Hadoop

Transcription:

Automated Diagnosis of Chronic Performance Problems in Production Systems Soila P. Kavulya Christos Faloutsos, CMU Greg Ganger, CMU Matti Hiltunen, AT&T Labs Priya Narasimhan, CMU (Advisor)

Motivation Major outages are rare in production systems Such blackouts also detected by alarms Chronic performance problems (brownouts) System still works, but with degraded performance Problem typically affects subset of users/requests Admins often unaware until the user complains Chronics are fairly common Production VoIP system: Source of 42% of failed calls in worst month of outages Production Hadoop cluster (OpenCloud): Source of 78% of reported problems in 11-month period http://www.pdl.cmu.edu/ 2

Challenges (1) Scale: Thousands of network elements, millions of calls Under radar: Symptoms often not severe enough to trigger system alarms Application Servers Gateway Servers Customers with Non-IP service ISP IP Base Elements PSTN Multiple ongoing problems: Difficult to disambiguate Never seen before: First indication is cryptic customer complaint Customers with IP service http://www.pdl.cmu.edu/ 3

Challenges (2) Labeled failure-data not always available Difficult to diagnose problems not encountered before Desired level of instrumentation might not be possible Existing vendor instrumentation with limited control Cost of adding instrumentation might be high Instrumentation might be diverse, at different sampling rate http://www.pdl.cmu.edu/ 4

http://www.pdl.cmu.edu/ 5

Thesis statement Approach Instrumentation Anomaly detection Problem localization Experimental evaluation Fault injection Case studies Extensions Conclusion Outline http://www.pdl.cmu.edu/ 6

Thesis Statement Diagnosis of chronic performance problems in production systems is possible through the analysis of common white-box logs to extract local behavior and system-wide dependencies, coupled with the analysis of common black-box metrics to identify the resource at fault. http://www.pdl.cmu.edu/ 7

Goals and Non-goals Goals of approach Diagnosis using existing instrumentation in production systems Anomaly detection in the absence of labeled failure-data Differentiation of workload changes from anomalies Non-goals Diagnosis of system-wide outages Diagnosis of value faults and transient faults Root-cause analysis at code-level http://www.pdl.cmu.edu/ 8

Assumptions Majority of the system is working correctly Problems manifest as observable behavioral changes Exceptions or performance degradations Visible to the end-user All instrumentation is locally time-stamped Clocks are synchronized to enable system-wide correlation of data Instrumentation faithfully captures system behavior http://www.pdl.cmu.edu/ 9

Target Systems for Validation VoIP system at large ISP 10s of millions of calls per day 1000s of network elements with heterogeneous hardware 24x7 Ops team uses alarm correlation to diagnose outages Separate team troubleshoots long-term chronics Labeled traces available Hadoop: Open-source implementation of MapReduce Diverse kinds of data-intensive workloads Graph mining, language translation Hadoop clusters have homogeneous hardware 400-node Yahoo! M45, 64-node OpenCloud clusters Controlled experiments in Amazon EC2 cluster Long running jobs (> 100s): Hard to label failures http://www.pdl.cmu.edu/ 10

Contributions VoIP HADOOP Anomaly Detection Problem Localization Types of chronics Experimental Evaluation Heuristics-based Localize to customer/networkelement/resource/error-code Exceptions, performance degradation, single-source, multiple-source Production VoIP system, 1000s of network elements Peer comparison without labeled data Localize to node/task/ resource Exceptions, performance degradation, single-source, multiple-source OpenCloud, 64 nodes Publications SLAML 11, OSR 11, DSN 12 WASL 08, HotMetrics 09, ISSRE 09, ICDCS 10, NOMS 10, CCGRID 10 http://www.pdl.cmu.edu/ 11

Thesis statement Approach Instrumentation Anomaly detection Problem localization Experimental evaluation Fault injection Case studies Extensions Conclusion Outline http://www.pdl.cmu.edu/ 12

Overview of Approach visualizations to support root-cause inference Anomalous nodes Visualization white-box instrumentation black-box instrumentation White-box Analysis Black-box Analysis End-to-end flows Anomaly Detection Normalized blackbox metrics Labeled End-to-end flows Problem Localization list of problems ranked by severity http://www.pdl.cmu.edu/" 13

Black-Box Instrumentation For both Hadoop and VoIP Resource-usage metrics collected periodically from OS Monitoring interval varies from 1s to 15min Examples of metrics CPU utilization, CPU run-queue size Pages in, pages out Memory used, memory free Context-switches Packets received, packets sent Disk blocks read, disk blocks written http://www.pdl.cmu.edu/ 14

White-Box Instrumentation Each node logs each request that passes through it Timestamp, IP address, request duration/size, phone no., Log formats vary across components Application-specific parsers extract relevant attributes Construction of end-to-end traces Extract control flow information Control flow captures sequence of events executed E.g., dependencies between Maps and Reduces in Hadoop Extract data flow information Data flow captures transfer of data between components Stitch end-to-end flows using control and data flow information http://www.pdl.cmu.edu/ 15

Target Systems Instrumentation Hadoop Clusters (OpenCloud, Yahoo! M45) ISP s VoIP System Master JobTracker NameNode Application Servers Application Servers Application Servers ISP Gateways Job Job IP Base Elements Slaves Maps Reduces Maps TaskTracker Reduces PSTN DataNode HDFS Customers Customers http://www.pdl.cmu.edu/ 16

White-Box Analysis visualizations to support root-cause inference Anomalous nodes Visualization white-box instrumentation black-box instrumentation White-box Analysis Black-box Analysis End-to-end flows Anomaly Detection Normalized blackbox metrics Labeled End-to-end flows Problem Localization list of problems ranked by severity Questions How do we extract local control- and data-flow? How do we infer dependencies with other components? How do we deal with missing dependency information? http://www.pdl.cmu.edu/" 17

White-Box Logs Hadoop Logs control flow 2011-10-12 23:59:55,625 INFO org.apache.hadoop.mapred.tasktracker: attempt_201106031747_9630_m_013846_0 0.027189009% Records R/W=208/1! 2011-10-12 23:59:55,943 INFO org.apache.hadoop.mapred.tasktracker: Sent out 43000 bytes for reduce: 37 from map: attempt_201106031747_9630_m_013677_0 given 43000! VoIP Call Detail Record Mon Apr 29 07:30:14 2013! NAS-Identifier = other! Acct-Status-Type = Accounting-off! NAS-IP-Address = 172.30.11.36! Client-IP-Address = 172.30.11.36! Acct-Input-Packets = 5! Acct-Output-Packets = 5! Acct-Input-Octets = 100! Acct-Output-Octets = 2789! control and data flow control flow data flow http://www.pdl.cmu.edu/" 18

White-Box Analysis: Hadoop 1 Each node logs task (or block) info. locally TaskTracker 6 10:03:59, MAP LaunchTaskAction! task_188_m_98! 2 Domain-specific knowledge used to extract attributes of interest 3 Infer dependency using exact key match on task ID. TaskTracker 8 10:03:59, SHUFFLE task_656_r_900 copying task_188_m_98!! 10:03:59, REDUCE task_656_r_900 192.168.22.3! 4 Match on IP address within given time window Datanode 34 10:04:01, BLOCK_WRITE blk_8987676! 192.168.22.3 to 192.168.22.6! http://www.pdl.cmu.edu/" 19

White-Box Analysis: VoIP 1 Each node logs call outcomes locally in Call Detail Record IP Base Element 3 2 IPBE CDR! 10:03:59, START! 973-123-8888 to 409-555-5555! 192.156.1.2 to 11.22.34.1! 10:04:02, STOP! Domain-specific knowledge used to extract attributes of interest 3 Infer dependency using exact key match on phone no. Application Server 5 AS CDR! 10:04:05, ATTEMPT! 973-123-8888 to 409-555-5555\! 4 Approx match on partial phone no. and time Gateway Server 12 GS CDR! 10:04:10, ATTEMPT! 973-123-xxxx to 409-555-xxxx! http://www.pdl.cmu.edu/" 20

Output of White-box Analysis Inferred dependencies between components Reduce! Block Write! Timestamp: 10:04:01 Type: task_656_r_900! From IP: 192.168.22.3 From IP: 192.168.22.6! Timestamp: 10:04:01 Type: BLOCK_WRITE! BlockID: blk_8987676! From IP: 192.168.22.3 To IP: 192.168.22.6! Unstructured logs transformed into structured log http://www.pdl.cmu.edu/" 21

Anomaly Detection visualizations to support root-cause inference Anomalous nodes Visualization white-box instrumentation black-box instrumentation White-box Analysis Black-box Analysis End-to-end flows Anomaly Detection Normalized blackbox metrics Labeled End-to-end flows Problem Localization list of problems ranked by severity Questions How to detect performance problems in the absence of labeled data? How to distinguish legitimate application behavior vs. problems? http://www.pdl.cmu.edu/" 22

Anomaly Detection Reduce! Block Write! Some user-visible problems manifest as errors Detected by extracting error codes from failed flows, or Apply domain-specific heuristics Performance problems can be harder to detect Exploit the notions of peers to detect performance problems Determine what system behaviors can be considered equivalent ( peers ) under normal conditions Significant deviation from peers is regarded anomalous http://www.pdl.cmu.edu/ 23

24 rika (Swahili), noun. peer, contemporary, age-set, undergoing rites of passage (marriage) at similar times.

Anomaly Detection: Hadoop (1) Reduce! Block Write! Extract exceptions from failed and canceled tasks Detect performance problems using peers Empirical analysis of production data to identify peers 219,961 successful jobs (Yahoo! M45 and OpenCloud) 89% of jobs had low variance in their Map durations 65% of jobs had low variance in their Reduce durations Designate tasks belonging to the same job as peers http://www.pdl.cmu.edu/ 25

Anomaly Detection: Hadoop (2) At the same time, behavior amongst peers can legitimately diverge due to various application factors Identified 12 such factors on OpenCloud Example: HDFS bytes written/read Flag tasks which do not fit regression model as anomalous Exploit regression to automatically learn factors influencing task durations http://www.pdl.cmu.edu/ 26

Anomaly Detection: VoIP IP Base Element! App. Server! Gateway Server! Detecting blocked or dropped calls Extract defect codes (e.g., timeout) from failed calls Exploit domain-specific heuristics to detect problems Examples: Callback soon after call end, zero talk-time Detecting performance problems Designate peers to be calls belonging to the same service Simple statistical technique to detect peers when deviate Packet loss exceeds 90 th percentile of calls http://www.pdl.cmu.edu/ 27

Output of Anomaly Detection Labeled end-to-end flows FAIL Reduce! Block Write! Block Read! SUCCESS Map! Block Write! http://www.pdl.cmu.edu/ 28

Problem Localization visualizations to support root-cause inference Anomalous nodes Visualization white-box instrumentation black-box instrumentation White-box Analysis Black-box Analysis End-to-end flows Anomaly Detection Normalized blackbox metrics Labeled End-to-end flows Problem Localization list of problems ranked by severity Questions How to identify problems due to combination of factors? How to distinguish multiple ongoing problems? How to find resource that caused the problem? How to handle noise due to flawed anomaly detection? http://www.pdl.cmu.edu/" 29

Identify Suspect Attributes (1) STEP 1: Find individual attributes most likely to occur in failed flows Labeled end-to-end traces generated by anomaly-detection Task1: 09:31am,SUCCESS, Server1, Map2, BlockRead3,! Task2: 09:32am,FAIL, Server1, Map3, Server8, BlockRead5,!!:!!:! 10s of millions of flows 10s of thousands of attributes Server1! Server8! Map2! Map3! Outcome! Task1! 1! 0! 1! 0! SUCCESS! Task2! 1! 1! 0! 1! FAIL! Binary representation supports scalable representation of attributes as sparse table. http://www.pdl.cmu.edu/ 30

Identify Suspect Attributes (2) STEP 1: (contd) Estimate conditional probability distributions Prob(Success Attribute) vs. Prob(Failure Attribute) Update belief on distribution with each flow seen Success Server8! Failure Server8! Degree of Belief Anomaly score: Distance between distributions 0 0.2 0.4 0.6 0.8 1 Probability http://www.pdl.cmu.edu/ 31

Find Attribute Combinations STEP 2: Determine if chronic is triggered by combination of factors Find attribute combinations that maximize anomaly score Greedy, iterative search limits combinations explored Step 1: All flows 120! 350! Step 2: Remove all flows that match Signature1. Repeat. 290! Server6! Server8! 670 90 Server6! 450! Block Read! Map 65! Shuffles! Indict path with highest anomaly score http://www.pdl.cmu.edu/ 32

Rank Problems by Severity STEP 3: Rank list of identified problems by severity Rank problems based on number of flows affected Spurious attributes introduced by noise receive low-rank UI: Ranked list of chronics identified 1. Chronic signature1 Server8! BlockRead! 2. Chronic signature2 Server64! Failed Flows Time of Day (GMT) Visualization allows operators to identify recurrent problems http://www.pdl.cmu.edu/ 33

Fusing Black-box Metrics STEP 4: Determine if resource-usage metrics affected Annotate flows associated with culprit nodes (and peers) Culprit Node Peer Peer Server 8 Time: 10:03:59,! Map ID: task_188_m_98 Bytes Read: 7867! Duration: 25 seconds! Status: FAILED! Server 10 Time: 10:03:59,! Map ID: task_188_m_76 Bytes Read: 7867! Duration: 3 seconds! Status: SUCCESS! Server 13 Time: 10:03:59,! Map ID: task_188_m_85 Bytes Read: 6863! Duration: 2 seconds! Status: SUCCESS! Mean CPU: 70.4% Mean Memory: 500MB Mean DiskUtil: 30KB Mean CPU: 12.4% Mean Memory: 430MB Mean DiskUtil: 32KB Mean CPU: 15.4% Mean Memory: 480MB Mean DiskUtil: 23KB http://www.pdl.cmu.edu/" Mean resource-usage on node during event duration 34

Culprit Black-Box Metrics STEP 4: (contd) Compare distribution of each black-box metric for successful/failed flows Successful Flows Failed Flows Degree of Belief 0 20 40 60 80 100 CPU usage (%) Indict metric if difference between distributions is statistically significant http://www.pdl.cmu.edu/ 35

Thesis statement Approach Instrumentation Anomaly detection Problem localization Experimental evaluation Fault injection Case studies Extensions Conclusion Outline http://www.pdl.cmu.edu/ 36

Experimental Evaluation CASE STUDIES FAULT INJECTION HADOOP VOIP Workload Gridmix cluster benchmark One-week of ISP s call logs Injected faults Experimental setup Production Sytem Status Resource hogs/task hangs 10 iterations per fault 10-node EC2 cluster 2 1.2GHz cores, 7GB RAM OpenCloud Post-mortem offline analysis of real incidents Server/Customer problems 1000 simulated faults in total 1 simulation node 2 2.4GHz cores, 16GB RAM Production VoIP system at ISP Used in production for 2 years since 2011 http://www.pdl.cmu.edu/ 37

Expt #1: Impact of Dependencies QUESTION: Does knowledge of dependencies affect diagnosis? METHOD: Hadoop EC2 cluster, 10 nodes, fault injection. Apply problem localization with white-box metrics. Compare against approach without knowledge of dependencies. Disk Hog Map hang Reduce hang Flow-level http://www.pdl.cmu.edu/ 38 Without dependencies With dependencies Packet loss Knowledge of dependencies improves diagnosis

Expt #2: Impact of Fusion QUESTION: Does fusion of metrics provide insight on root-cause? METHOD: Hadoop EC2 cluster, 10 nodes, fault injection. Apply problem localization with fused white/black-box metrics. Top Metrics Indicted Fault Injected White box Black-box Insight on root-cause Disk hog Maps Disk Packet-loss Shuffles - Map hang (Hang1036) Maps - Reduce hang (Hang1152) Reduces - Fusion of metrics provides insight on most injected faults http://www.pdl.cmu.edu/ 39

Expt #3: Impact of Fault Probability QUESTION: Can we effectively diagnose low-probability faults? METHOD: One week of ISP s call logs, 1 node, fault simulation. Randomly label 1-10% of flows with attributes of interest as faulty. Apply problem localization with white-box metrics. Robust to changes in fault probability http://www.pdl.cmu.edu/ 40

Expt #4: Impact of Noise QUESTION: Does flawed anomaly-detection (noise) impact diagnosis? METHOD: One week of ISP s call logs, 1 node, fault simulation. Mislabel 5-20% of failed flows. Apply problem localization with white-box metrics http://www.pdl.cmu.edu/ 41 Recall robust to noisy labels. Precision drops due to spurious attributes.

Case #1: Multiple Hardware Issues INCIDENT: Multiple hardware problems in OpenCloud cluster User experiences multiple job failures with cryptic exceptions. Administrators initially suspected memory configuration issue. Took a week to resolve. Bad disk and bad NIC on two nodes. DIAGNOSIS APPLIED Apply problem-localization approach with white-box metrics. Correctly identified nodes with bad hardware in top-10 ranked list Identified multiple simultaneous problems affecting user s job. http://www.pdl.cmu.edu/ 42

Case #2: Quality (QoS) Violations INCIDENT: Calls in VoIP system experiencing high packet-loss Operators suspect issue with network elements at ISP. Took a weeks to resolve. DIAGNOSIS APPLIED Flag calls whose packet-loss exceeds 85 th percentile as faulty. Apply problem-localization approach with white-box metrics. Showed most QoS issues were tied to specific customers. Operators alerted customers of problem on their site. Customer fixed problem and QoS violation resolved. http://www.pdl.cmu.edu/ 43

Case #3: Performance Problem INCIDENT: Intermittent performance problem in VoIP system Intermittent performance problem with two application servers. Affected 0.1% of all calls passing through application servers. DIAGNOSIS APPLIED Apply problem-localization using white/black-box metrics. Post-mortem analysis confirmed issue with application servers. Also flagged anomalous CPU and memory usage. Successfully identified low-severity fault affecting multiple servers. http://www.pdl.cmu.edu/ 44

Thesis statement Approach Instrumentation Anomaly detection Problem localization Experimental evaluation Fault injection Case studies Extensions Conclusion Outline http://www.pdl.cmu.edu/ 45

Lessons Learned (1) Synthesis of end-to-end causal traces possible Local logs capture local control- and data-flow info Approximate-matching infers implicit dependencies In absence of labeled data, peer-comparison is feasible approach for anomaly detection Peers can be tasks (Hadoop), end-to-end flows, calls within the same service (VoIP) Regression can help to differentiate between Legitimate application behavior (more bytes read/written) vs. Anomalous behavior (task taking longer to run for other unexplained reasons) http://www.pdl.cmu.edu/" 46

Lessons Learned (2) Important to analyze both successful and failed flows Limiting analysis to only failed flows might elevate common elements over causal elements Fusion of white+black-box data can provide more insight into source of problem Ranking problems by severity helps tolerate noise Spurious labels receive lower ranking http://www.pdl.cmu.edu/ 47

Limitations No diagnosis for the Master node of a Hadoop cluster Problems at master typically result in system-wide issues Peer-groups are defined statically Need to automate identification of peers False positives occur if root-cause not in logs Algorithm tends to implicate adjacent network elements Need to incorporate more data to improve visibility Does not detect dormant problems that do not impact user-perceived system behavior Examples: Blacklisted nodes in Hadoop http://www.pdl.cmu.edu/ 48

Extensions (Future Work) Visualization in heterogeneous systems User study on diagnosis interfaces in Hadoop [CHIMIT11] Visual signatures of problems in Hadoop [LISA12] Visual signatures of problems in heterogeneous systems Extensible visualization framework for diagnosis Future Work Online monitoring and diagnosis Generic framework for monitoring and diagnosis [WADS09] Streaming implementation of problem-localization [DSN12] Scalable monitoring and diagnostic framework http://www.pdl.cmu.edu/ 49

Visualization visualizations to support root-cause inference Visualization white-box instrumentation White-box Analysis Black-box Analysis Anomaly Detection Problem Localization list of problems ranked by severity black-box instrumentation Questions How to develop compact visualizations for large clusters? Can visualizations help spot/discriminate different anomalies? http://www.pdl.cmu.edu/" 50

Theia: Visual Signatures of Problems Maps anomalies observed to broad problem classes Hardware failures, application issue, data skew Supports interactive data exploration Users drill-down from cluster- to job-level displays Hovering over the visualization gives more context Compact representation for scalability Can support clusters with 100s of nodes Performance degradation due to failing disk *USENIX LISA 2012 Best Student-Paper Award 1 26 Node ID Job ID (sorted by start time) http://www.pdl.cmu.edu/ 51

Conclusion Approach for diagnosis of chronic problems Amenable for use in production systems Infers dependencies from existing white-box logs Uses heuristics and peer-comparison to detect anomalies Localizes source of problem using statistical approach Incorporates both white-box and black-box logs Demonstrated for two production systems VoIP system at ISP (approach deployed for 2 years now) OpenCloud Hadoop cluster Initial progress on extensions (visualization) http://www.pdl.cmu.edu/ 52

Collaborators & Thanks VoIP (AT&T) Matti Hiltunen, Kaustubh Joshi, Scott Daniels Hadoop visualization Christos Faloutsos, U Kang, Elmer Garduno, Jason Campbell (Intel), HCI 05-610 team OpenCloud Greg Ganger, Garth Gibson, Julio Lopez, Kai Ren, Mitch Franzos, Michael Stroucken Hadoop diagnosis Jiaqi Tan, Xinghao Pan, Rajeev Gandhi, Keith Bare, Michael Kasick, Eugene Marinelli Soila Kavulya @ March 2012" 53

Publications (1) Diagnosis in VoIP Visualization, User studies, Surveys White-box diagnosis Black-box diagnosis 1. S. P. Kavulya, S. Daniels, K. Joshi, M. Hiltunen, R. Gandhi, P. Narasimhan. Draco: Statistical Diagnosis of Chronic Problems in Large Distributed Systems. IEEE Dependable Systems and networks (DSN 12), Boston, MA, Jun 2012. 2. S. P. Kavulya, K. Joshi, M. Hiltunen, S. Daniels, R. Gandhi, P. Narasimhan. Practical Experiences with Chronics Discovery in Large Telecommunications Systems. Best Papers from SLAML in Operating Systems Review (OSR 12), 2012. 3. S. P. Kavulya, K. Joshi, M. Hiltunen, S. Daniels, R. Gandhi, P. Narasimhan. Practical Experiences with Chronics Discovery in Large Telecommunications Systems. Workshop on Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques (SLAML 11), 2011. 4. E. Garduno, S. Kavulya, J. Tan, R. Gandhi and P. Narasimhan. Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters. In Large Installation System Administration Conference (LISA) 2012, San Diego, CA, Dec 2012. Best Student Paper Award. 5. S. P. Kavulya, K. Joshi, F. Di Giandomenico, P. Narasimhan. Failure Diagnosis of Complex Systems.Book on Resilience Assessment and Evaluation (RAE 12). Wolter, 2012. 6. J. Campbell, A. Ganesan, B. Gotow, S. Kavulya, J. Mulholland, P. Narasimhan, S. Ramasubramanian, M. Shuster, and J. Tan. Understanding and Improving the Diagnostic Workflow of MapReduce Users. In 5th ACM Symposium on Computer Human Interaction for Management of Information Technology (CHIMIT), Boston, MA, Dec 2011. 7. S. Kavulya, J. Tan, R. Gandhi, P. Narasimhan. An Analysis of Traces from a Production MapReduce Cluster. 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) 2010, Melbourne, Victoria, Australia, May 2010. 8. J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. Visual, Log-based Causal Tracing for Performance Debugging of MapReduce Systems. 30th IEEE International Conference on Distributed Computing Systems (ICDCS) 2010, Genoa, Italy, Jun 2010. 9. J. Tan, X. Pan, S. Kavulya, R. Gandhi, P. Narasimhan. Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop. USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '09), San Diego, CA, Jun 2009. 10. J. Tan, X. Pan, S. Kavulya, R. Gandhi, P. Narasimhan. SALSA: Analyzing Logs as State Machines. USENIX Workshop on Analysis of System Logs (WASL 08), San Diego, CA, Dec 2008. 11. J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. Lightweight Black-box Failure Detection for Distributed Systems. In Workshop on Management of Big Data systems (MBDS) 2012, co-located with the International Conference on Autonomic Computing, San Jose, SA, Sep 2012. 12. X. Pan, S. Kavulya, J. Tan, R. Gandhi, P. Narasimhan. Ganesha: Black-Box Diagnosis for MapReduce Systems. Workshop on Hot Topics in Measurement & Modeling of Computer Systems (HotMetrics), Seattle, WA, Jun 2009. 54

Publications (2) Black-box + White box diagnosis 13. J. Tan, X. Pan, S. Kavulya, E. Marinelli, R. Gandhi, P. Narasimhan. Kahuna: Problem Diagnosis for MapReduce-based Cloud Computing Environments. 12th IEEE/IFIP Network Operations and Management Symposium (NOMS) 2010, Osaka, Japan, Apr 2010. 14. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. Blind Men and the Elephant: Piecing Together Hadoop for Diagnosis. 20th IEEE International Symposium on Software Reliability Engineering (ISSRE), Industrial Track, Mysuru, India, Nov 2009. 15. S. Kavulya, R. Gandhi, P. Narasimhan. Gumshoe: Diagnosing Performance Problems in Replicated File-Systems. IEEE Symposium on Reliable Distributed systems (SRDS 08), Naples, Italy, October 2008. 16. S. Pertet, R. Gandhi, P. Narasimhan. Fingerpointing Correlated Failures in Replicated Systems. SysML, April 2007. http://www.pdl.cmu.edu/ 55

Questions? 56 Soila Kavulya @ March 2012"

Related Work (1) M. Attariyan, M. Chow, and J. Flinn, X-ray: Automating root-cause diagnosis of performance anomalies in production software. USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), Hollywood, CA, October 2012. P. Bodík, M. Goldszmidt, A. Fox, D. B. Woodard, H. Andersen. Fingerprinting the datacenter: automated classification of performance crises. EuroSys 2010. [Cohen05]: Capturing, indexing, clustering and retrieving system history. Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox. SOSP, 2005. [Kandula09]: Detailed diagnosis in enterprise networks. Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, Paramvir Bahl. SIGCOMM 2009. [Kasick10]: Black-Box Problem Diagnosis in Parallel File Systems. Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. FAST 2010. http://www.pdl.cmu.edu/ 57

Related Work (2) [Kiciman05]: Detecting application-level failures in component-based Internet Services. Emre Kiciman, Armando Fox. IEEE Trans. on Neural Networks 2005. [Mahimkar09]: Towards automated performance diagnosis in a large IPTV network. Ajay Anil Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, Qi Zhao. SIGCOMM 2009. [Sambasivan11]: Diagnosing Performance Changes by Comparing Request Flows. Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. NSDI 2011. [Tati12] S. Tati, B.-J. Ko, G. Cao, A. Swami, and T. F. L. Porta, Adaptive algorithms for diagnosing large-scale failures in computer networks. IEEE Conference on Dependable Systems and Networks (DSN), Boston, MA, Jun. 2012 http://www.pdl.cmu.edu/ 58