Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin



Similar documents
Error Log Processing for Accurate Failure Prediction

Proactive Fault Management

Using Hidden Semi-Markov Models for Effective Online Failure Prediction

Online Failure Prediction in Cloud Datacenters

Discovering process models from empirical data

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Numerical Field Extraction in Handwritten Incoming Mail Documents

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Alignment and Preprocessing for Data Analysis

Supply chain management by means of FLM-rules

Document Image Retrieval using Signatures as Queries

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Practical Online Failure Prediction for Blue Gene/P: Period-based vs Event-driven

Mining the Software Change Repository of a Legacy Telephony System

System Management and Operation for Cloud Computing Systems

FALSE ALARMS IN FAULT-TOLERANT DOMINATING SETS IN GRAPHS. Mateusz Nikodem

Implementing Heuristic Miner for Different Types of Event Logs

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

How To Use Neural Networks In Data Mining

GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Ericsson T18s Voice Dialing Simulator

POMPDs Make Better Hackers: Accounting for Uncertainty in Penetration Testing. By: Chris Abbott

Failure Prediction in IBM BlueGene/L Event Logs

Less naive Bayes spam detection

Problem Detection and Diagnosis under Sporadic Operations and Changes

Discovering Structured Event Logs from Unstructured Audit Trails for Workflow Mining

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Finding soon-to-fail disks in a haystack

Discrete Optimization

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Proactive self-healing mechanisms for IP networks. IETF89/NMRG33 London Bruno Vidalenc, Laurent Ciavaglia

Inferring Probabilistic Models of cis-regulatory Modules. BMI/CS Spring 2015 Colin Dewey

Nino Pellegrino October the 20th, 2015

User Authentication/Identification From Web Browsing Behavior

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Introduction. A. Bellaachia Page: 1

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data

A NEW HYBRID ALGORITHM FOR BUSINESS INTELLIGENCE RECOMMENDER SYSTEM

Failures of software have been identified as the single largest source of unplanned downtime and

Database Marketing, Business Intelligence and Knowledge Discovery

Convention Paper Presented at the 135th Convention 2013 October New York, USA

1. Classification problems

Three Methods for ediscovery Document Prioritization:

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining 5. Cluster Analysis

Examining Case Management Demand using Event Log Complexity Metrics

Visual-based ID Verification by Signature Tracking

Azure Machine Learning, SQL Data Mining and R

Predicting Flight Delays

DNA: An Online Algorithm for Credit Card Fraud Detection for Games Merchants

The Basics of Graphical Models

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Big Data & Scripting Part II Streaming Algorithms

Using Data Mining for Mobile Communication Clustering and Characterization

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

Discrete Hidden Markov Model Training Based on Variable Length Particle Swarm Optimization Algorithm

TELEMETRY NETWORK INTRUSION DETECTION SYSTEM

The CUSUM algorithm a small review. Pierre Granjon

TIETS34 Seminar: Data Mining on Biometric identification

Approximate Search Engine Optimization for Directory Service

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

A Statistical Text Mining Method for Patent Analysis

Probabilistic Model Checking at Runtime for the Provisioning of Cloud Resources

Introduction to Algorithmic Trading Strategies Lecture 2

Intelligent Log Analyzer. André Restivo

Classification of Household Devices by Electricity Usage Profiles

Probabilistic Latent Semantic Analysis (plsa)

Globally Optimal Crowdsourcing Quality Management

A Framework for Intelligent Online Customer Service System

NC STATE UNIVERSITY Exploratory Analysis of Massive Data for Distribution Fault Diagnosis in Smart Grids

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

How To Cluster Of Complex Systems

Exploiting Evidence from Unstructured Data to Enhance Master Data Management

Enhancing Quality of Data using Data Mining Method

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Tutorial for proteome data analysis using the Perseus software platform

Seion. A Statistical Method for Alarm System Optimisation. White Paper. Dr. Tim Butters. Data Assimilation & Numerical Analysis Specialist

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Categorical Data Visualization and Clustering Using Subjective Factors

PayLess: A Low Cost Network Monitoring Framework for Software Defined Networks

Business Process Modeling

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Performance Metrics for Graph Mining Tasks

Mining a Corpus of Job Ads

Statistical machine learning, high dimension and big data

Multi-Algorithm Ontology Mapping with Automatic Weight Assignment and Background Knowledge

Data Mining Using Neural Network Approaches Using SAS and Java

Standardization and Its Effects on K-Means Clustering Algorithm

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Copyright Shraddha Pradip Sen 2011 ALL RIGHTS RESERVED

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Second International Workshop on Preservation of Evolving Big Data - Panel on Big Data Quality

Transcription:

Error Log Processing for Accurate Failure Prediction Felix Salfner ICSI Berkeley Steffen Tschirpke Humboldt-Universität zu Berlin

Introduction Context of work: Error-based online failure prediction: error events Data used: C A C B data window present time prediction failure? t Commercial telecommunication system 200 components, 2000 classes Error- and failure logs In this talk we present the data preprocessing concepts we applied to obtain accurate failure prediction results 2

Contents Key facts on the data Overview of online failure prediction and data preprocessing process Detailed description of major preprocessing concepts Assigning IDs to Error Messages Failure Sequence Clustering Noise Filtering Experiments and Results 3

Key Facts on the Data Experimental setup: Telecommunication System response times Call Tracker error logs failure log 200 days of data from a 273 days period 26,991,314 error log records 1,560 failures of two types Failure Definition: response time calls If within a 5 min interval more than 0.01% of calls experience a response time > 250ms 250ms Performance Failures 5 min 5 min t 0.01% > 0.01% Failure! 4

Online Failure Prediction Approach: Pattern recognition using Hidden Semi-Markov Models Failure Sequence 1 Non-Failure Sequence 1 Failure Sequence 2 Non-Failure Sequence 2 B C A F A B C B A B F C B A t t d t l t d t d t l t d HSMM for Failure Ssequences HSMM for Non-Failure Sequences Objectives for data preprocessing: Create a data set to train HSMM models exposing key properties of system Identify how to process incoming data during runtime Tasks: Machine-processable data Error-ID assignment Separate sequences for inherent failure mechanisms Clustering Distinguishing, noise-free sequences Noise Filtering 5

Training Data Preprocessing Error Log Failure Log Error-ID assignment Tupling Timestamp extraction Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u Model 0 Model 1 Model u 6

Error ID Assignment Problem: Error logs contain no message IDs Example message of a log record: process 34: end of buffer reached Task: Assign an ID to message to characterize what has happened Approach: Two steps: Remove numbers process xx: end of buffer reached ID assignment based on Levenshtein's edit distance with constant threshold Data No of Messages Reduction Original 1,695,160 Without numbers 12,533 Levenshtein 1,435 7

Failure Sequence Clustering Error Log Failure Log Error-ID assignment Tupling Timestamp extraction Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u Model 0 Model 1 Model u 8

Failure Sequence Clustering (2) Goal: Divide set of training failure sequences into subsets Group according to sequence similarity Approach: A F 1 B A A C F 2 B A B F 3 A B Train a small HSMM for each sequence Apply each HSMM to all sequences Sequence log-likelihoods express similarities M 1 M 2-2.1-2.6-4.2-1.3-9.7-10.2 M 3-7.8-6.9-1.2 Make matrix symmetric by Apply standard clustering algorithm 9

Failure Sequence Clustering (3) 10

Noise Filtering Error Log Failure Log Error-ID assignment Tupling Timestamp extraction Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u Sequences for Failure-Mechanism 1 Sequences for Failure-Mechanism u Model 0 Model 1 Model u 11

Noise Filtering (2) Problem: Clustered failure sequences contain many unrelated errors Main reason: parallelism in the system Assumption: Indicative events occur more frequently prior to a failure than within other sequences Apply a statistical test to quantify what more frequently is A F 1 B A A C F 2 B A B F 3 A B A F 4 A B A C F 5 B A Clustering Filtering Group 1 F 3 F 5 B C A B B A Filtering Group n F 1 F 2 F 4 A B A A C B A A AB A A A C B A B A t t A A B A t Training Sequences for Failure Mechanism 1 Training Sequences for Failure Mechanism n time of failure 12

Noise Filtering (3) Testing variable derived from goodness-of-fit test: denotes the number of occurrences of error denotes the total number of errors in the time window. denotes the prior probability of occurrence of error Keep events in the sequence if Three ways to estimate priors from training data set Entire dataset Training sequences G 1 G 3 G 2 G 4 Failure training sequences Results 13

Experiments and Results Objective: Predict upcoming failures as accurate as possible Metric used: F-Measure: Precision: relative number of correct alarms to total number of alarms Recall: relative number of correct alarms to total number of failures F-Measure: harmonic mean of precision and recall Failure prediction is achieved by comparing sequence likelihood of an incoming sequence computed from failure and non-failure models Classification involves a customizable decision threshold Maximum F-Measure Data Max. F- Measure Relative Quality Optimal Results 0.66 100% Without grouping 0.5097 77% Without filtering 0.3601 55% B HSMM for failure sequences Sequence likelihood C t d classification Failure prediction A HSMM for non-failure sequences Sequence likelihood t 14

Conclusions We have presented the data preprocessing techniques that we have applied for online failure prediction in a commercial telecommunication system The presented techniques include: Assignment of IDs to error messages using Levenshtein's edit distance Failure sequence clustering Noise filtering based on a statistical test Using error and failure logs of the commercial telecommunication system, we showed that elaborate data preprocessing is an essential step to achieve accurate failure predictions

Backup 16

Tupling Goal: Remove multiple reporting of the same issue Approach: Problem: Combine messages of the same type if they occur closer in time to each other than a threshold ε. Determine the threshold value ε Solution suggested by Tsao and Siewiorek: Observe the number of tuples for various values of ε and apply the elbow rule ε

HSMM Model Structure for Failure Sequence Clustering s 2 s 1 s 3 F s 5 s 4

Cluster Distance Metrics Single linkage complete linkage Average linkage

Online Failure Prediction Error messages Error ID assignment error message Tuplingsequence Sequence Extraction Filtering 1 Filtering u Model 0 Model 1 Model u Sequence Likelihood 0 Sequence Likelihood 1 Sequence Likelihood u Classification Failure Prediction

Comparison of Techniques 0.7 0.0 0.2 0.4 0.6 periodic DFT Eventset SVD-SVM HSMM precision recall F-measure false positive rate 21

Hidden Semi-Markov Model g 13 (t) t 1 2 3 N-1 F b 1 (A) b 1 (B) b 1 (C) 0 g 12 (t) t b 2 (A) b 2 (B) b 2 (C) 0 g 23 (t) b 3 (A) b 3 (B) Discrete time Markov chain (DTMC) States (1,, N-1,F) Transition probabilities Hidden Markov Model (HMM) t b 3 (C) 0 b N-1 (A) b N-1 (B) b N-1 (C) 0 0 0 0 1 Each state can generate (error) symbols (A,B,C,F) Discrete probability distribution of symbols per state b i (X) Hidden Semi-Markov Model (HSMM) Time-dependent transition probabilities g ij (t) 22

Proactive Fault Management Running System Measurements Failure Avoidance Preparation for Failure Prediction Model Online Failure Prediction 23