Developing Process Mining Tools

Similar documents

Social Network Analysis for Business Process Discovery

Business Process Analysis in Healthcare Environments: a Methodology based on Process Mining

ProM Framework Tutorial

Master Thesis September 2010 ALGORITHMS FOR PROCESS CONFORMANCE AND PROCESS REFINEMENT

Process Mining and Network Analysis

Trace Clustering in Process Mining

ProM 6 Tutorial. H.M.W. (Eric) Verbeek mailto:h.m.w.verbeek@tue.nl R. P. Jagadeesh Chandra Bose mailto:j.c.b.rantham.prabhakara@tue.

ProM 6 Exercises. J.C.A.M. (Joos) Buijs and J.J.C.L. (Jan) Vogelaar {j.c.a.m.buijs,j.j.c.l.vogelaar}@tue.nl. August 2010

Process Mining. ^J Springer. Discovery, Conformance and Enhancement of Business Processes. Wil M.R van der Aalst Q UNIVERS1TAT.

An Overview of Knowledge Discovery Database and Data mining Techniques

Tutorial for proteome data analysis using the Perseus software platform

BUsiness process mining, or process mining in a short

CHAPTER 1 INTRODUCTION

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Analyzer: An Mining Plug-in for the ProM Framework

Introduction. A. Bellaachia Page: 1

Discovering User Communities in Large Event Logs

Machine Learning using MapReduce

Investigating Clinical Care Pathways Correlated with Outcomes

A Study of Web Log Analysis Using Clustering Techniques

Contents. Dedication List of Figures List of Tables. Acknowledgments

Using Data Mining for Mobile Communication Clustering and Characterization

SPATIAL DATA CLASSIFICATION AND DATA MINING

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Social Media Mining. Data Mining Essentials

Approaching Process Mining with Sequence Clustering: Experiments and Findings

Business Process Modeling

Life-Cycle Support for Staff Assignment Rules in Process-Aware Information Systems

Model-Based Cluster Analysis for Web Users Sessions

Feature. Applications of Business Process Analytics and Mining for Internal Control. World

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Process Mining Tools: A Comparative Analysis

Workflow Simulation for Operational Decision Support

Summary and Outlook. Business Process Intelligence Course Lecture 8. prof.dr.ir. Wil van der Aalst.

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Analytics on Big Data

USING PROCESS MINING FOR ITIL ASSESSMENT: A CASE STUDY WITH INCIDENT MANAGEMENT

Process Mining: Using CPN Tools to Create Test Logs for Mining Algorithms

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

EDIminer: A Toolset for Process Mining from EDI Messages

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Decision Mining in Business Processes

Knowledge Discovery from patents using KMX Text Analytics

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Process Modelling from Insurance Event Log

Discovering process models from empirical data

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Final Project Report

Using Data Analytics to Detect Fraud

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Search Result Optimization using Annotators

Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

D A T A M I N I N G C L A S S I F I C A T I O N

Clustering Connectionist and Statistical Language Processing

Business Process Discovery

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Dotted Chart and Control-Flow Analysis for a Loan Application Process

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

Mining and Tracking Evolving Web User Trends from Large Web Server Logs

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Outlines. Business Intelligence. What Is Business Intelligence? Data mining life cycle

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance

Intelligent Log Analyzer. André Restivo

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

IT services for analyses of various data samples

Unsupervised Data Mining (Clustering)

Technical Report. The KNIME Text Processing Feature:

Big Data Text Mining and Visualization. Anton Heijs

Human-Readable BPMN Diagrams

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Graph Mining and Social Network Analysis

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Cluster Analysis: Advanced Concepts

A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM

Building Data Cubes and Mining Them. Jelena Jovanovic

The basic data mining algorithms introduced may be enhanced in a number of ways.

The Scientific Data Mining Process

Nintex Workflow 2010 Help Last updated: Friday, 26 November 2010

IMAN: DATA INTEGRATION MADE SIMPLE YOUR SOLUTION FOR SEAMLESS, AGILE DATA INTEGRATION IMAN TECHNICAL SHEET

Database Marketing, Business Intelligence and Knowledge Discovery

Modeling Guidelines Manual

Discovering Interacting Artifacts from ERP Systems (Extended Version)

Data Mining Solutions for the Business Environment

Clustering & Visualization

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Analytics for Business Intelligence and Decision Support

Integrity 10. Curriculum Guide

HYBRID INTELLIGENT SUITE FOR DECISION SUPPORT IN SUGARCANE HARVEST

Pentaho Data Mining Last Modified on January 22, 2007

Mining a Change-Based Software Repository

Transcription:

Developing Process Mining Tools An Implementation of Sequence Clustering for ProM Gabriel Martins Veiga Dissertation for the degree of Master of Science in Information Systems and Computer Engineering Jury President: Supervisor: Committee: Prof. José Tribolet Prof. Diogo R. Ferreira Prof. Andreas Wichert September 2009

Acknowledgments To my family, especially my parents and my brother who always supported me and made my academic path possible. To Prof. Diogo Ferreira for his excellent orientation and availability to help. His suggestions and guidance given throughout this year greatly improved the value of this dissertation. To the other members of our research group, namely Pedro Martins and Gil Aires for the support given and for the exchange of ideas that occurred during this past year. Also to all my friends especially the ones that accompanied me during the years spent in college. i

ii

Abstract The goal of process mining is to extract useful information from event logs that record the activities an organization performs. There are many process mining techniques to discover a process model, based on some event log. These techniques perform well on structured processes, but have problems with less structured ones. In this case the logs are very confusing and have high quantities of noise, making it difficult to extract useful information. The models generated for such logs tend to be difficult to read and contain unrelated behavior. In this work we present an approach that aims at overcoming these difficulties by extracting only the useful data and presenting it in an understandable manner. For this purpose sequence clustering algorithms are used to cluster the log into smaller logs (clusters) that correspond to a set of related cases. For each cluster, a model in the form of a Markov chain is presented. A preprocessing stage was also developed, to clean the log of certain irrelevant elements that complicate the models generated. The approach was fully implemented in the ProM framework and all the experiments were performed in that environment. Taking into account the results achieved for a real-world case study and the results of several experiments, we conclude that the approach is capable of dealing with complex logs, eliminating unnecessary behavior and partitioning different types of behavior into more understandable models. We also conclude that the sequence clustering algorithm provides good results when compared to other clustering methods to divide sequences in a process mining context. Keywords Process Mining, Preprocessing, Sequence Clustering, ProM, Markov Chains, Event Logs, Hierarchical Clustering, Process Models iii

iv

Resumo O objectivo da extracção de processos é obter informação relevante a partir dos logs de eventos que registam as actividades executadas numa organização. Existem várias técnicas nesta área que a partir desses logs geram modelos de processos. Estas técnicas apresentam bons resultados em processos bem estruturados, mas têm problemas quando aplicadas a processos pouco estruturados. Nestes casos os logs são muito confusos e têm uma grande quantidade de ruído, dificultando a extracção de informação útil. Para estes logs, o modelo gerado é difícil de compreender e poderá incluir comportamento de casos bastante distintos. Neste trabalho apresentamos uma abordagem que visa ultrapassar estas dificuldades, extraindo apenas a informação relevante e apresentando-a de forma legível. Para isso algoritmos de clustering de sequências são utilizados para dividir o log em logs mais pequenos (clusters) que correspondem a um conjunto de casos relacionados. Para cada cluster, um modelo em forma de cadeia de Markov é apresentado. Também se desenvolveu uma fase de pré-processamento, para limpar o log de elementos que poderão complicar desnecessariamente os modelos obtidos. A abordagem foi implementada na ferramenta ProM e todas as experiências foram executadas nesse ambiente. Tendo em conta os resultados obtidos num caso de estudo real e os resultados de diversas experiências, conclui-se que a abordagem é capaz de lidar com logs complexos, eliminando comportamento desnecessário e dividindo diferentes tipos de comportamento em modelos mais compreensíveis. Também se conclui que o algoritmo de clustering de sequências apresenta bons resultados quando comparado a outros algoritmos de clustering ao dividir sequências no contexto da extracção de processos. Palavras Chave Extracção de Processos, Pré-processamento, Clustering de Sequências, ProM, Cadeias de Markov, Logs de Eventos, Clustering Hierárquico, Modelos de Processos v

vi

Contents Introduction. Process Mining.......................................2 Motivation........................................ 2.3 Organization....................................... 2 2 Process Mining Tools 5 2. ProM........................................... 5 2.2 Mining Tools....................................... 7 2.3 Analysis Tools...................................... 9 2.4 Process Mining with Clustering............................. 0 2.5 Trace Clustering Approach for Process Mining.................... 0 2.6 Conclusion........................................ 3 3 Sequence Clustering for ProM 5 3. Sequence Clustering................................... 5 3.2 Applications of Sequence Clustering.......................... 8 3.3 Preprocessing....................................... 8 3.4 Implementation within ProM.............................. 20 3.4. Preprocessing Stage............................... 2 3.4.2 Sequence Clustering Stage............................ 2 vii

3.5 Hierarchical Clustering.................................. 24 3.6 Conclusion........................................ 24 4 Experiments and Evaluation 27 4. Issue Handling Process.................................. 27 4.2 Patient Treatment Process................................ 3 4.3 Telephone Repair Process: Comparing Clustering Methods............. 36 5 Case Study: Application Server Logs 39 5. Case study description.................................. 39 5.2 Log Structure....................................... 4 5.3 Preprocessing stage.................................... 42 5.4 Sequence Clustering results............................... 42 6 Conclusion 47 6. Main contributions.................................... 47 6.2 Future work........................................ 48 A Published paper 55 viii

List of Figures 2. Overview of the ProM Framework (adapted from [])................. 6 2.2 MXML Snapshot..................................... 7 2.3 Process Model of the example log............................ 9 2.4 DWS mining result.................................... 2.5 Trace Clustering result.................................. 2 2.6 Process Model for cluster (,)............................. 3 3. Example of a cluster model (Markov Chain) displayed in the sequence clustering plug-in........................................... 6 3.2 Markov chain Matrix representation......................... 6 3.3 Sequence Clustering plug-in in the ProM framework................. 20 3.4 Preprocessing stage for the Sequence Clustering plug-in............... 22 3.5 Cluster Inspection in the Sequence Clustering plug-in................ 23 3.6 Cluster model with no threshold............................ 23 3.7 Cluster model with an edge threshold of 0.06..................... 24 4. Model for the initial log of the issue handling process................. 28 4.2 Sequences present in the log of the Issue Handling Process.............. 29 4.3 Cluster : Issue Handling Process............................ 29 4.4 Cluster 2: Issue Handling Process............................ 29 4.5 Cluster 3: Issue Handling Process............................ 29 ix

4.6 Cluster 4: Issue Handling Process............................ 30 4.7 Cluster 3.: Issue Handling Process........................... 30 4.8 Cluster 3.2: Issue Handling Process........................... 3 4.9 Model for the initial log of the Patient Treatment Process.............. 32 4.0 Events present in the log of the Patient Treatment Process............. 34 4. Sequences present in the log of the Patient Treatment Process............ 34 4.2 Cluster : Patient Treatment Process.......................... 35 4.3 Cluster 2: Patient Treatment Process.......................... 35 4.4 Cluster 3: Patient Treatment Process.......................... 35 5. System infrastructure of a public institution...................... 40 5.2 Application Server Logs Snapshot............................ 4 5.3 Spaghetti model obtained from the application server logs using the heuristics miner. 43 5.4 Events related with exceptions in the application server logs............. 44 5.5 Some of the behavioral patterns discovered from the application server logs using the sequence clustering plug-in.............................. 46 x

List of Tables 2. Example of an event log with 70 process instances, for the process of patient care in a hospital (A: Register patient, B: Contact family doctor, C: Treat patient, D: Give prescription to the patient, E: Discharge patient)................ 8 4. Correspondence between letters and events...................... 33 4.2 Complexity metrics of the process models from the clusters generated by the three different clustering methods............................... 37 xi

xii

CHAPTER Introduction The growing demand for faster and more structured procedures in organizations has resulted in the proliferation of information systems. However, the existence of an information system to accomplish a given task does not ensure the most efficient way to execute that task, especially when several systems are required to execute it. Performance issues are a common problem that organizations face and therefore optimization is frequently a priority. Optimizing the way an organization performs its processes leads to an increase of efficiency, adding value to the organization. To optimize a process the organizations must understand how a process is being executed, usually this involved a long period of analysis, including interviews with all the persons responsible for a given part of the process. The appearance and proliferation of Process-Aware Information Systems [2] (such as ERP, WFM, CRM and SCM systems) has opened the door for a more efficient type of method to study the execution of processes, called process mining [3]. These systems typically record events executed during a business process execution and analyzing these logs can yield important knowledge to improve the execution of processes and improve the quality of the organization s services. This is where process mining comes in.. Process Mining The process mining area is concerned with the discovery, the monitoring and the improvement of real processes (not assumed processes) by extracting information from event logs. Process mining techniques can generally be grouped into three types: () discovery of process knowledge like process models [4, 5, 6], (2) conformance checking, i.e. measure the conformance between modeled

behavior (defined process models) and observed behavior (process execution present in logs) [7, 8] and (3) extension of a process model with information extracted from event logs (like identifying bottlenecks in a process model). The main application of process mining is the discovery of process models. Therefore, much investigation has been performed in order to improve the models produced. However there are still issues that complicate the discovery of comprehensible models and that need to be addressed. For processes with a lot of different cases and high diversity of behavior, the models generated tend to be very confusing and difficult to understand. These models are usually called spaghetti models [9]. Clustering techniques have been investigated as the means to deal with this complexity by dividing cases into clusters, leading to less confusing models. However, results still suffer from the presence of certain unusual cases that include noise and ad-hoc behavior, which are common in real-world environments. This type of behavior can have different origins, like human error in executing a given process, incomplete executions of a process or errors produced by the systems. Known types of noise are for example the inversion in the order of activities, existence of unrelated activities or lack of needed activities. Usually this type of behavior is not relevant to understand a process and it unnecessarily complicates the discovered models..2 Motivation In this dissertation we present an approach that is able to deal with these problems by means of sequence clustering techniques. This is a kind of model-based clustering that partitions the cases according to the order in which events occurred. For the purpose of this work the model used to represent each cluster is a first-order Markov Chain. The fact that this clustering is probabilistic makes it suitable to deal with logs containing many different types of behavior, possibly nonrecurrent behavior as well. When sequence clustering is applied, the log is divided into a number of clusters and the correspondent Markov Chains are generated. Additionally, the approach also comprises a preprocessing stage, where the goal is to clean the log of certain events that will only complicate the clustering method and its results. If after both techniques are applied the models are still confusing, sequence clustering can be re-applied hierarchically within each cluster until understandable results are obtained. The approach has been implemented in ProM [], an extensible framework for process mining that already includes many techniques to address challenges in this area..3 Organization This document is organized as follows: Chapter 2 provides an overview of existing work involving clustering and process mining. The framework in which the work was developed is presented, along with some of the most important techniques implemented in that framework. Chapter 3 presents the proposed approach, including the preprocessing stage and the sequence clustering algorithm. The implementation of these techniques in ProM is discussed, including the 2

inputs needed and the outputs produced. Chapter 4 presents three experiments using the techniques implemented that demonstrate the use of the techniques and compare the results with other clustering methods. Chapter 5 demonstrates the approach in a real-world case study where the goal was to understand the typical behavior of faults in an application server. Finally in chapter 6 we draw conclusions about this thesis and suggest some future work. 3

4

CHAPTER 2 Process Mining Tools Process mining techniques aim at the analysis of business processes by extracting information from event logs and are especially useful when little information is available about a given process and obtaining that information is complicated. Most of these techniques are available in the ProM framework. In this chapter we present some work related to concepts approached in this dissertation, focusing on process mining techniques available in the ProM framework. First we present the framework, then we show an overall view of the different types of process mining tools available in ProM and finally we explore in greater detail two of those tools that focus on applying clustering methods to process mining. 2. ProM The environment in which this dissertation is based is the ProM Framework [, 0]. ProM is an extensible framework aimed at process mining, that is issued under an open source license, therefore the development of plug-ins is possible and encouraged. Many plug-ins resulting from investigation work have been developed in three major categories: mining, analysis and conversion. Figure 2. presents an overview of the ProM Framework architecture, it shows the relations between the framework, the plug-ins and the event log. The event log that usually serves as input to the plug-ins has a specific format based on XML and defined for this framework, it is called MXML []. This format follows a specified schema definition, which means the log does not consist of random and disorganized information, rather For more information and to download ProM visit www.processmining.org 5

Figure 2.: Overview of the ProM Framework (adapted from []) it contains all the elements needed by the plug-ins at a known location. In Figure 2.2 a snapshot of a MXML log is presented. Each Process Instance corresponds to one execution of a given process and has a set of Audit Trial Entries associated. These entries correspond to the events that occurred during the execution of the process instance and are composed by several attributes like the WorkflowModelElement that represents the name of the event and the EventType that classifies the event according to its state (Start or Complete). There are also other attributes that identify the originator of a given event and the timestamp in which the event was executed. As shown in Figure 2., these event logs are generated by Process-aware Information Systems (PAIS) [2] and are read by the ProM framework using the Log Filter, that can also perform some filtering to those logs before any other task is performed. The Import plug-ins are used to load many different kinds of models like Aris graphs and the Mining plug-ins perform some kind of mining, storing the results as Frames. These Frames can be used to visualize a Petri Net [2] or a Social Network [3] for example. Analysis plug-ins can perform further analysis like checking the conformance of a process model and a log for example. Finally Conversion plug-ins can transform a mining result into another format and the Export plug-ins can store the results outside the framework in different kinds of format. Next we further explore the mining and analysis plug-ins available in ProM, in particular the ones relating to our work. 6

<ProcessInstance id="0" description=""> <AuditTrailEntry> <WorkflowModelElement>A</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>C</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>E</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> </ProcessInstance> <ProcessInstance id="" description=""> <AuditTrailEntry> <WorkflowModelElement>C</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>A</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>B</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>D</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> </ProcessInstance> 2.2 Mining Tools Figure 2.2: MXML Snapshot Mining tools are an implementation of a mining algorithm in ProM. They can be divided into three major types: () Control-flow discovery, (2) Organizational perspective and (3) Data perspective. Some of the control-flow discovery tools include: α-algorithm plug-in it implements the α-algorithm [4], constructing a Petri net that models the workflow of the process. It establishes a set of relations between tasks and assumes that a log is complete (all possible behavior is present). This algorithm presents some shortcomings, namely it is not robust to noise and it cannot mine processes with short-loops or duplicate tasks. Some work has been done to extend this algorithm, for instance to be able to mine short-loops [4] and to detect implicit dependencies [5]. Heuristics miner plug-in [5] it implements a heuristics driven algorithm, that is especially useful for dealing with noise, by only expressing the main behavior present in a log. This means that not all details are shown to the user and exceptions are ignored. To illustrate what kind of graph is presented by this tool, we created a simple example log 2 shown in 2 Real life logs generate much more complex and confusing models, this example is only used to present some concepts. 7

Id Process Instance Frequency ABCE 20 2 ACE 2 3 CABD 0 4 CAB 4 5 CDBE 4 Table 2.: Example of an event log with 70 process instances, for the process of patient care in a hospital (A: Register patient, B: Contact family doctor, C: Treat patient, D: Give prescription to the patient, E: Discharge patient) Table 2. and used the ProM implementation of this algorithm to come to the result shown in Figure 2.3. This tool can also be used when searching for long distance dependency relations. Genetic algorithm plug-in [6] it uses genetic algorithms to calculate the best possible process model for a log. Every individual is assigned a fitness measure that evaluates how well the individual can reproduce the behavior present in the input log. In this context, individuals are possible process models. Candidate individuals are generated using genetic operators like crossover and mutation and then the fittest are selected. This algorithm was proposed to deal with some issues involving the logs, like noise and incompleteness. Organizational perspective aims at understanding the different types of business relations established within an organization, some of the mining tools available in ProM that approach this subject are: Social network miner plug-in [6] it takes a log file and determines a social network of people. By using this tool we can identify roles and interactions in an organization, for example who usually works together or who hands over work to whom. Organizational miner plug-in from an event log containing originator information, it presents to the user a graph associating activities and originators. Tools that deal with the data perspective make use of additional data attributes present in logs, here is an example: Decision miner [7] this tool analyzes how data attributes of process instances or activities (such as timestamps or performance indicators) influence the routing of a process instance. To accomplish this, every decision point in the process model is analyzed and if possible linked to properties of individual cases (process instances) or activities. There are also some plug-ins that deal with less structured processes: Fuzzy miner [8] the process models of less structured processes, tend to be very confusing and hard to read (usually referred to as spaghetti models). This tool objective is to emphasize 8

Figure 2.3: Process Model of the example log graphically the most relevant behavior, by calculating the relevance of activities and their relations. To achieve this two metrics are used: () significance that measures the level of interest we have in events (for example by calculating their frequency on the log) and (2) correlation that determines how closely related two events that follow each other are, so that events highly related can be aggregated. 2.3 Analysis Tools Analysis plug-ins have a variety of purposes, like implementing some property analysis on a previously achieved mining result or comparing a process log and a predefined model of how a process should be executed. Next we present only a few of those that we consider more relevant: Conformance checker one important question that organizations would like to have answered is: Are our processes being executed as we planned? Answering this question has been an active field of investigation [7, 8]. This tool was implemented in ProM to address this problem. It analyzes the gap between a model and the real world, detecting violations (bad executions of a process) and ensuring transparency (the model might be outdated). To measure the conformance this tool uses two concepts: () fitness that checks if the event log complies with the control flow specified by the process model and (2) appropriateness that checks if the model describes the behavior present in the event log. 9

Basic performance analysis the objective of this tool is to calculate performance measures such as the execution time of a process or the waiting time. The tool then presents the results with several different kinds of graphs. LTL checker [9] this plug-in checks whether a log satisfies some Linear Temporal Logic (LTL) formula. For example it can check if a given activity is executed by the person that should be executing it or check whether an activity A that has to be executed after B is indeed always executed at the correct moment. 2.4 Process Mining with Clustering When generating process models like the one on Figure 2.3, conventional control-flow techniques tend to over-generalize. In the attempt to represent all the different behavior present in the log these techniques create models that allow for more behavior than the one actually observed. When a log has very different process instances the models generated are even more complex and confusing. Clustering was approached as a way to overcome this problem [20]. The approach was implemented in ProM as the Disjunctive Workflow Schema (DWS) mining plugin. In the methodology developed, first the complete log is examined and a model is generated using the Heuristics Miner [5]. Then the log is compared to the model to measure the model s quality. If the model generated is optimal and no over-generalization is detected the approach stops, otherwise the log is divided into clusters using the K-means clustering method and their models are tested. If the cluster models still allow for too much behavior the clusters are repartitioned and so on until optimal models are achieved. The result of this methodology is the set of all the models created and the over-generalization points. Let s apply this methodology to our example log described in Table 2.. By analyzing the model shown in Figure 2.3, we can conclude that it allows for behavior not present in the log, for example the sequence BCA. By running the DWS plug-in available in ProM we come to the result shown in Figure 2.4. The model is presented (right), the navigational tree of models generated where we can choose the one to view (top-left) and the over-generalization points detected, where the first one refers to the sequence we had identified, stating that A was never executed after BC (bottom-left marked in red). Other clustering methods investigated in the process mining area are presented in the next section. 2.5 Trace Clustering Approach for Process Mining Trace Clustering [2] is another approach investigated in the process mining area as a way to partition the log, grouping similar sequences together. The motivation behind this work was the existence of flexible environments, where the execution of processes does not follow a rigid set of rules; although the notion of process is present, the actors are able to execute it differently according to each case. An example of such environments is the healthcare, where strictly following a process is not a priority compared to providing the best care for patients. 0

Figure 2.4: DWS mining result In these environments and particularly when a large number of cases (process instances) is recorded in the log, the main problem is the diversity; i.e. single process instances differ significantly from one another, therefore there are several different types of sequences and the models generated by conventional techniques are very confusing (spaghetti-like models). This approach addresses the issue using distance-based clustering along with profiles, with the purpose of reducing the diversity by lowering the number of cases analyzed at once. Each profile is formed by a set of items that describe a case from a particular perspective. Every item is a metric that assigns a numeric value to each case and therefore a profile can be viewed as a vector containing the values of all the different items (profiles can be combined resulting in aggregate vectors). These vectors are then used to calculate the distance between two cases, using distance metrics (like the Euclidean distance or the Hamming distance). Examples of such profiles are: Transition The items in this profile are direct following relations of the sequence (that forms a process instance). For any two events (A, B) there is an item measuring how often B as directly followed A. Case Attributes The items in this profile are the data attributes of the process instance. When process instances are annotated with meta-information, comparing that information can be an efficient way to compare the instances. Finally clustering methods such as the ones presented next can be applied to group closely related

Figure 2.5: Trace Clustering result cases in the same cluster: K-means Clustering It is one of the most used clustering methods and constructs k clusters by dividing the data into k groups. Self-Organizing Map (SOM) It is a neural network technique, which is used to map high dimensional data onto low dimensional spaces. Similar cases are mapped close to each other in the SOM. In Figure 2.5 we can see the resulting output of this method (in ProM) when applied to our example log. Three clusters were generated and in the map we can analyze the similarity between different cases and different clusters. If the cases are close together in the map and the color separating them is light then those cases are very similar. These algorithms are available in the ProM framework via the trace clustering plug-in. Figure 2.6 shows the process model (generated by the Heuristics Miner) for one of the clusters created when applying the SOM method. We can now clearly identify a type of sequence (no diversity present in that subset of the original log), it refers to a case (CDBE) where the patient was not registered; i.e. when analyzing the clusters we can discover different types of behavior, including types of sequences that are not being executed as they should. Recent work has been done to improve the results produced by trace clustering. A context aware approach based on generic edit distance was presented in [22]. In this work a method was defined to automatically derive a cost function to calculate the costs of edit operations, which takes into account the context of an event within a sequence. Considering the context can be valuable given that the events present in the sequences and the order in which they occur have a semantic relevance. To understand the usefulness of this kind of clustering a comparison was made between different 2

Figure 2.6: Process Model for cluster (,) trace clustering methods [22]. Comparing the results produced by one trace clustering approach to another is not trivial, due to the difficulty in understanding if one cluster is better formed than another. A better formed cluster is one in which the sequences have a higher degree of similarity and consequently the models for those clusters are easier to understand. Therefore a process mining perspective was proposed to evaluate the goodness of clusters by analyzing the models of those clusters. Fitness and comprehensibility metrics were used to evaluate the complexity of the models. By comparing these metrics the approach proved to generate less complex cluster models, indicating that better formed clusters were achieved. 2.6 Conclusion In this chapter we have introduced some important concepts relating to our work. The framework used throughout this dissertation was presented and also the types of tools available in that framework. The continuous growth of ProM is due to the importance that process mining has gained in recent years, resulting in numerous research performed by people around the world. We presented in greater detail two solutions involving the application of clustering techniques to process mining. The Trace Clustering approach is particularly relevant to our work, being that they share the common goal of sub-dividing the initial log in smaller logs, as to facilitate the detection of patterns. 3

The difference between the two approaches are the techniques used to achieve that goal. The emphasis of our solution is to approach the problem of noise and ad-hoc behavior that complicate the identification of patterns in logs originated by real-world information systems and the results produced by the clustering methods presented. To accomplish this we combine different techniques that are presented in the next chapter. 4

CHAPTER 3 Sequence Clustering for ProM Like the clustering techniques described in the previous chapter, sequence clustering can take a set of sequences and group them into clusters, so that similar types of sequence are placed in the same cluster. However, this type of clustering is performed directly on the sequences, as opposed to being performed on features extracted from those sequences. Sequence clustering has been extensively used in the field of bioinformatics, for example to classify large protein datasets into different families [23]. Process mining also deals with sequences, but instead of aminoacids the sequences contain events that have occurred during the execution of a given process. Sequence clustering techniques are therefore a natural candidate to perform clustering on workflow logs. In this chapter the techniques that form our solution are explored and the way these techniques were implemented is presented, including the outputs produced. 3. Sequence Clustering The sequence clustering algorithm used here is based on first-order Markov chains [24, 25]. Each cluster is represented by the corresponding Markov chain and by all the sequences assigned to it. A Markov chain is composed by a set of states and by the transition probabilities between them. In first-order Markov chains the probability of a given transition to a future state depends only on the current state. For the purpose of process mining it becomes useful to augment the simple Markov chain model with two dummy states: the input and the output state. This is necessary in order to represent the probability of a given event being the first or the last event of the chain, which may become useful to distinguish between some types of sequences. Figure 3. shows a simple example of such a chain depicted in ProM via the sequence clustering 5

Figure 3.: Example of a cluster model (Markov Chain) displayed in the sequence clustering plug-in a b c d e 0.0.0 0.0 0.0 0.0 0.0 0.0 a 0.0 0.0 0.892 0.08 0.0 0.0 0.0 b 0.0 0.0 0.0 0.0.0 0.0 0.0 c 0.0 0.0 0.0 0.0.0 0.0 0.0 d 0.0 0.0 0.0 0.0 0.0 0.368 0.632 e 0.0 0.0 0.0 0.0 0.0 0.0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Figure 3.2: Markov chain Matrix representation plug-in developed in this work. In this model, darker elements (both states and transitions) are more recurrent than lighter ones. By analyzing the color of the elements and the probability associated with each transition it is possible to decide which elements should be kept for analysis, and which elements can be discarded. For example, one may choose to remove transitions that have very low probabilities, so that only the most typical behavior can be analyzed. Figure 3.2 corresponds to the matrix representation of the Markov chain shown in Figure 3., where each column and each line correspond to an event and the matrix is ordered alphabetically from a to e (considering that the first and the last state correspond to the two dummy states added). In this representation the matrix values are the transition probabilities, for example the transition from a to c has a 0.8% probability of occurring. Every line in the matrix must be normalized; i.e. the sum of all the transition probabilities originating on a given state must equal one. Notice that in the first column and in the last line all values are zero and will be so in every Markov chain generated by our solution, because the input state is a dummy first state that is never transitioned to and the output state is a dummy final state that never transitions anywhere. 6

As said before these are first order Markov chains, there are also n th order chains where the probability of transition to a future state depends on the previous n states. An example of recent work developed with higher-order Markov chains can be found in [26]. The assignment of sequences to clusters is based on the probability of each cluster producing the given sequence. In general, any given sequence will be assigned to the cluster that is able to produce it with higher probability. Let and denote the input and output states, respectively. To calculate the probability of a sequence x = {, x, x 2,, x L, } being produced by cluster c k the following formula is used: [ L ] p (x c k ) = p (x ; c k ) i=2 p (x i x i ; c k ) p ( x L ; c k ) (3.) where p (x i x i ; c k ) is the transition probability from x i to x i in the Markov chain associated with cluster c k. This formula handles the input and output states in the same way as any other regular state that corresponds to an event. The goal of sequence clustering is to estimate these parameters for all clusters c k with k =, 2,, K based on a set of input sequences. For that purpose, the algorithm relies on an Expectation Maximization procedure [27] to improve the model parameters iteratively. For a given number of clusters K the algorithm proceeds as follows:. Initialize randomly the state transition probabilities of the Markov chains associated with each cluster. 2. Assign each sequence to the cluster that can produce it with higher probability according to equation (3.). 3. Recompute the state transition probabilities of the Markov chain of each cluster, considering the sequences that were assigned to that cluster in the previous step. 4. Repeat steps 2 and 3 until the assignment of sequences to clusters does not change, and hence the cluster models do not change either. In other words, first we randomly distribute the sequences into the clusters (steps and 2), then in step 3 we re-estimate the cluster models (Markov chain and its probabilities) according to the sequences assigned to each cluster. After this first iteration we re-assign the sequences to clusters and again re-estimate the cluster models (steps 2 and 3). These two steps are executed repeatedly until the algorithm converges. The result is a set of Markov models that describe the behavior of each cluster. The random initialization of the transition probabilities is an important feature of this algorithm that introduces a certain level of uncertainty in the results achieved. The sequence clustering algorithm can therefore generate a different set of clusters for the same sequences. Throughout this dissertation we minimized the impact of this uncertainty by applying the algorithm several times (usually five) and choosing the result that occurred more often. 7

3.2 Applications of Sequence Clustering Sequence clustering algorithms have been an active field of investigation in the area of bioinformatics [23, 28], as mentioned earlier. Although this has been the area primarily associated with sequence clustering, some work has been done with this type of algorithms in other areas. In [24], the goal was to analyze the navigation patterns on a website, these patterns consisted of sequences of URL categories followed by users. Sequence clustering was the approach chosen to partition site users, placing users with similar navigation paths in the same cluster. The behavior of the users present in each cluster is then displayed and can be analyzed to understand the particular interests of different types of user. In the field of process mining, sequence clustering has also been investigated [25]. The motivation behind that work was the fact that an event log can contain events originating from different processes; i.e. the idea was to not make the assumption that an event log only contains events of one process, but instead can be a mixture of different processes without any information stating which events correspond to what processes. The goal was to develop an approach that would be able to extract sequences of related events (relating to the same case) from those chaotic logs. After identifying the sequences, the Microsoft Sequence Clustering Algorithm (available in SQL Server [29]) is applied to group similar sequences in the same cluster, without the need for any business logic information. The environment developed to test this approach was an application that executed sequences of actions over a database and recorded these actions in logs. After extracting the sequences from the event log with some methodology, the sequence clustering algorithm was applied with a specific number of clusters as to generate a new cluster for each of the different types of sequences identified. Consequently, the model generated for each cluster constitutes a deterministic graph (the transition probabilities all equal.0) and the visualization of each model leads to the identification of a sequence type executed in the environment tested. 3.3 Preprocessing Although the sequence clustering algorithm described above is robust to noise, all sequences must ultimately be assigned to a cluster. However, if a sequence is very uncommon and different from all the others it will affect the probabilistic model of that cluster and in the end will make it harder to interpret the model of that cluster. To avoid this problem, some preprocessing must be done to the input sequences prior to applying sequence clustering. This preprocessing can be seen as a way to clean the dataset of undesired states (events) and also as a way to eliminate undesirable sequences. For example, undesired events can be events that occur rarely and undesired sequences can be single step sequences that only have one state. Some of the steps that can be performed during preprocessing are described in [30] and include, for example, dropping events and sequences with low support. In this work we have extended these steps by allowing not only the least but also the most recurring events and sequences to be discarded. This was motivated by the fact that in some real-world applications the log is filled 8

with some very frequent but irrelevant events (such as debug messages) that must be removed in order to allow the analysis to focus on the relevant behavior. Spaghetti models are often cluttered with events that occur very often but only contribute to obscure the process model one aims to discover. The preprocessing steps implemented within the sequence clustering plug-in are optional and configurable. They focus on the following features:. Event type The events recorded in a MXML log file may represent different points in the lifetime of workflow activities, such as the start or completion of a given activity. For sequence clustering what is important is the order of activity execution, so we retain only one type of event and that is usually the completion event for each activity. Therefore only events of type complete are kept after this step. 2. Event support Some events may be so infrequent that they are not relevant for the purpose of discovering typical behavior. These events should be removed in order to facilitate analysis. On the other hand, some events may be so frequent that they too become irrelevant and even undesirable if they hide the behavior one aims to discover. Therefore, this preprocessing can remove events both with very low and too high support. 3. Consecutive repetitions Sequence clustering is a means to analyze the transitions between states in a process. If an event is followed by an equal event then it should be considered only once, since the state of the process has not changed. Consecutive repetitions are therefore removed, for example: the sequence A C C D becomes A C D. 4. Sequence length After the previous preprocessing steps, it may happen that some sequences collapse to only a few events or even to a single event. This preprocessing step provides the possibility to discard those sequences. It also provides the possibility to discard exceedingly long sequences which can have undesirable effects in the analysis results. Sequence length can therefore be limited to a certain range. 5. Sequence support Some sequences may be rather unique so that they hardly contribute to the discovery of typical behavior. In principle the previous preprocessing steps will prevent the existence of such sequences at this stage but, as with events, sequences that occur very rarely can be removed from the dataset. In some applications such as fault detection it may be useful to actually discard the most common sequences and focus instead on the less frequent ones, so sequence support can also be limited to a certain range. The order presented is the order in which the preprocessing steps should be applied, because if the steps are applied in a different order the results may differ. For example, rare sequences should only be removed at the final stage, because previous steps may transform them into common sequences. Imagining we have the rare sequence A B C D, but in step 2 state B is considered to have low support and is removed, then it becomes A C D. This new sequence might not be a rare sequence and therefore should not be removed. 9

Figure 3.3: Sequence Clustering plug-in in the ProM framework 3.4 Implementation within ProM The above preprocessing steps and the sequence clustering algorithm have been implemented as a new plug-in for the process mining framework ProM [], which offers an environment suitable for extension. Figure 3.3 presents a general view of our solution inserted in that environment and is discussed in detail throughout this section. We particularly approach the interaction between the techniques developed and ProM, including the inputs needed and the outputs produced. The starting point as for the majority of the plug-ins in ProM is an event log. We assume that the log contains a variety of process instances corresponding to one process and each of these process instances contains a set of audit trail entries. These entries correspond to a given event executed within the process instance and have some attributes like the name, the type and the entity responsible. An event usually marks the beginning or the end of an activity, so these two concepts are closely related and therefore are used interchangeably throughout this dissertation. The set of all the entries of a given process instance is considered the sequence of events that were performed in that process instance. Different process instances (of the same process) may be composed by a different sequence, which represent alternative ways in which the process was executed. The main goal of our solution is to group sequences that are somehow related, with an event log as an input we now have several sequences available, so we can start the implementation of the techniques described. 20

3.4. Preprocessing Stage In this stage the log is cleaned of certain elements that might influence negatively the usefulness of the final results, the objective is not to alter the format of the log, it is to prepare the sequences for the sequence clustering algorithm to group them afterwards. The preprocessing stage receives an input log in MXML format [], which means that the elements we need are already available at a known location. If the log we intend to analyze is not in the format mentioned, a framework called ProM Import [3] can be used to restructure and convert the log to the accepted format. The other input at this stage are some options provided by the user, which specify the parameters to be used in the preprocessing steps described earlier. Figure 3.4 presents a screenshot of this stage and the options can be seen in the top-right corner: () the minimum percentage of occurrence of an event, (2) the maximum percentage of occurrence of an event, (3) the minimum size of a sequence, (4) the maximum size of a sequence, (5) the minimum occurrence of a sequence and (6) the maximum occurrence of a sequence. After the user specifies which elements to keep, the sequences present in the log (left in the figure) are altered or removed. The view is then refreshed to show the sequences of the preprocessed log. When implementing this technique and the sequence clustering technique in ProM, the original log is never changed, instead filters are used to create a new log. Rather than modifying the original log, what filters do is construct a new log based on the original one and on the results produced by those techniques. This is an existing component of ProM that includes different types of filters that can be used prior to any mining tool. To support our techniques new filters were created. The result produced at this stage is a filtered log. This log is made available to the ProM framework, so that it may be analyzed with other plug-ins if desired, so instead of acting just as a first stage to sequence clustering the preprocessing stage can also be used together with other types of analysis available in the framework. 3.4.2 Sequence Clustering Stage The sequence clustering stage receives the filtered log as input from the preprocessing stage and also the desired number of clusters. In general the plug-in will generate a solution with the provided number of clusters except when some clusters turn out to be empty. Each cluster can be used again as an event log in ProM, so it becomes possible to further subdivide it into clusters, or for another process mining plug-in to analyze it. These features allow the user to drill-down through the behavior of clusters. The plug-in provides special functionalities for visualizing the results, both in terms of sequences that belong to each cluster and in terms of the Markov chain for each cluster. In the first type of visualization depicted in Figure 3.5, the clusters generated are shown (left-hand side), the set of instances present in each cluster are presented (middle) and finally the sequence of events that compose each type of instance can also be inspected (right-hand side). As shown in Figure 3.5 the sequences within each cluster are aggregated according to their types, which means Both the schema definition for MXML and the ProM Import framework can be downloaded from www.processmining.org 2

Figure 3.4: Preprocessing stage for the Sequence Clustering plug-in we can immediately identify how many different types of sequence were assigned to a cluster and how many sequences are there of each type. As an example we can see in the figure the inspection of Cluster 2, to which fifteen types of process instances were assigned. There are forty instances of the type highlighted in the image and the sequence of events that compose this type of process instance is the one shown in the right-hand side and is formed by six events. This visualization is especially useful to identify the frequency of occurrence for different types of behavior, for example one can conclude that the last type of process instance seen in the figure that occurs only once is a rare sequence of events, probably originating from noise or ad-hoc behavior and therefore not relevant to understand the behavior of the process being analyzed. The second type of visualization available makes this plug-in a mixture between a mining plug-in and an analysis plug-in. On one hand, sequence clustering is a mining plug-in that extracts models of behavior for the different behavioral patterns found in an input event log. Figure 3.6 shows the type of results that the plug-in is able to present. When visualizing the results, the user can adjust thresholds that correspond to the minimum and maximum probability of both edges and nodes (right-hand side of fig.3.6). This allows the user to adjust what is shown in the graphical model by removing elements (both states and transitions) that are either too frequent or too rare. This feature facilitates the understanding of spaghetti models by taking advantage of the probabilistic nature of sequence clustering and without having to re-run the algorithm again. This can be seen as a post-processing of the cluster models achieved. An example of the usefulness of this feature is the difference between figure 3.6 that represents all the behavior present in a cluster 2 and figure 3.7 that only represents transitions occurring above the threshold of 0.06. The difference in the complexity of the two models after eliminating less recurrent behavior is noticeable. 2 This cluster results from the division into two clusters of a log available in the ProM website 22

Figure 3.5: Cluster Inspection in the Sequence Clustering plug-in Figure 3.6: Cluster model with no threshold 23