Error Log Processing for Accurate Failure Prediction Felix Salfner ICSI Berkeley Steffen Tschirpke Humboldt-Universität zu Berlin
Introduction Context of work: Error-based online failure prediction: error events Data used: C A C B data window present time prediction failure? t Commercial telecommunication system 200 components, 2000 classes Error- and failure logs In this talk we present the data preprocessing concepts we applied to obtain accurate failure prediction results 2
Contents Key facts on the data Overview of online failure prediction and data preprocessing process Detailed description of major preprocessing concepts Assigning IDs to Error Messages Failure Sequence Clustering Noise Filtering Experiments and Results 3
Key Facts on the Data Experimental setup: Telecommunication System response times Call Tracker error logs failure log 200 days of data from a 273 days period 26,991,314 error log records 1,560 failures of two types Failure Definition: response time calls If within a 5 min interval more than 0.01% of calls experience a response time > 250ms 250ms Performance Failures 5 min 5 min t 0.01% > 0.01% Failure! 4
Online Failure Prediction Approach: Pattern recognition using Hidden Semi-Markov Models Failure Sequence 1 Non-Failure Sequence 1 Failure Sequence 2 Non-Failure Sequence 2 B C A F A B C B A B F C B A t t d t l t d t d t l t d HSMM for Failure Ssequences HSMM for Non-Failure Sequences Objectives for data preprocessing: Create a data set to train HSMM models exposing key properties of system Identify how to process incoming data during runtime Tasks: Machine-processable data Error-ID assignment Separate sequences for inherent failure mechanisms Clustering Distinguishing, noise-free sequences Noise Filtering 5
Training Data Preprocessing Error Log Failure Log Error-ID assignment Tupling Timestamp extraction Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u Model 0 Model 1 Model u 6
Error ID Assignment Problem: Error logs contain no message IDs Example message of a log record: process 34: end of buffer reached Task: Assign an ID to message to characterize what has happened Approach: Two steps: Remove numbers process xx: end of buffer reached ID assignment based on Levenshtein's edit distance with constant threshold Data No of Messages Reduction Original 1,695,160 Without numbers 12,533 Levenshtein 1,435 7
Failure Sequence Clustering Error Log Failure Log Error-ID assignment Tupling Timestamp extraction Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u Model 0 Model 1 Model u 8
Failure Sequence Clustering (2) Goal: Divide set of training failure sequences into subsets Group according to sequence similarity Approach: A F 1 B A A C F 2 B A B F 3 A B Train a small HSMM for each sequence Apply each HSMM to all sequences Sequence log-likelihoods express similarities M 1 M 2-2.1-2.6-4.2-1.3-9.7-10.2 M 3-7.8-6.9-1.2 Make matrix symmetric by Apply standard clustering algorithm 9
Failure Sequence Clustering (3) 10
Noise Filtering Error Log Failure Log Error-ID assignment Tupling Timestamp extraction Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u Model 0 Model 1 Model u 11
Noise Filtering (2) Problem: Clustered failure sequences contain many unrelated errors Main reason: parallelism in the system Assumption: Indicative events occur more frequently prior to a failure than within other sequences Apply a statistical test to quantify what more frequently is A F 1 B A A C F 2 B A B F 3 A B A F 4 A B A C F 5 B A Clustering Filtering Group 1 F 3 F 5 B C A B B A Filtering Group n F 1 F 2 F 4 A B A A C B A A AB A A A C B A B A t t A A B A t Training Sequences for Failure Mechanism 1 Training Sequences for Failure Mechanism n time of failure 12
Noise Filtering (3) Testing variable derived from goodness-of-fit test: denotes the number of occurrences of error denotes the total number of errors in the time window. denotes the prior probability of occurrence of error Keep events in the sequence if Three ways to estimate priors from training data set Entire dataset Training sequences G 1 G 3 G 2 G 4 Failure training sequences Results 13
Experiments and Results Objective: Predict upcoming failures as accurate as possible Metric used: F-Measure: Precision: relative number of correct alarms to total number of alarms Recall: relative number of correct alarms to total number of failures F-Measure: harmonic mean of precision and recall Failure prediction is achieved by comparing sequence likelihood of an incoming sequence computed from failure and non-failure models Classification involves a customizable decision threshold Maximum F-Measure Data Max. F- Measure Relative Quality Optimal Results 0.66 100% Without grouping 0.5097 77% Without filtering 0.3601 55% B HSMM for failure sequences Sequence likelihood C t d classification Failure prediction A HSMM for non-failure sequences Sequence likelihood t 14
Conclusions We have presented the data preprocessing techniques that we have applied for online failure prediction in a commercial telecommunication system The presented techniques include: Assignment of IDs to error messages using Levenshtein's edit distance Failure sequence clustering Noise filtering based on a statistical test Using error and failure logs of the commercial telecommunication system, we showed that elaborate data preprocessing is an essential step to achieve accurate failure predictions
Backup 16
Tupling Goal: Remove multiple reporting of the same issue Approach: Problem: Combine messages of the same type if they occur closer in time to each other than a threshold ε. Determine the threshold value ε Solution suggested by Tsao and Siewiorek: Observe the number of tuples for various values of ε and apply the elbow rule ε
HSMM Model Structure for Failure Sequence Clustering s 2 s 1 s 3 F s 5 s 4
Cluster Distance Metrics Single linkage complete linkage Average linkage
Online Failure Prediction Error messages Error ID assignment error message Tuplingsequence Sequence Extraction Filtering 1 Filtering u Model 0 Model 1 Model u Sequence Likelihood 0 Sequence Likelihood 1 Sequence Likelihood u Classification Failure Prediction
Comparison of Techniques 0.7 0.0 0.2 0.4 0.6 periodic DFT Eventset SVD-SVM HSMM precision recall F-measure false positive rate 21
Hidden Semi-Markov Model g 13 (t) t 1 2 3 N-1 F b 1 (A) b 1 (B) b 1 (C) 0 g 12 (t) t b 2 (A) b 2 (B) b 2 (C) 0 g 23 (t) b 3 (A) b 3 (B) Discrete time Markov chain (DTMC) States (1,, N-1,F) Transition probabilities Hidden Markov Model (HMM) t b 3 (C) 0 b N-1 (A) b N-1 (B) b N-1 (C) 0 0 0 0 1 Each state can generate (error) symbols (A,B,C,F) Discrete probability distribution of symbols per state b i (X) Hidden Semi-Markov Model (HSMM) Time-dependent transition probabilities g ij (t) 22
Proactive Fault Management Running System Measurements Failure Avoidance Preparation for Failure Prediction Model Online Failure Prediction 23