Developing Process Mining Tools

Size: px
Start display at page:

Download "Developing Process Mining Tools"

Transcription

1 Developing Process Mining Tools An Implementation of Sequence Clustering for ProM Gabriel Martins Veiga Dissertation for the degree of Master of Science in Information Systems and Computer Engineering Jury President: Supervisor: Committee: Prof. José Tribolet Prof. Diogo R. Ferreira Prof. Andreas Wichert September 2009

2

3 Acknowledgments To my family, especially my parents and my brother who always supported me and made my academic path possible. To Prof. Diogo Ferreira for his excellent orientation and availability to help. His suggestions and guidance given throughout this year greatly improved the value of this dissertation. To the other members of our research group, namely Pedro Martins and Gil Aires for the support given and for the exchange of ideas that occurred during this past year. Also to all my friends especially the ones that accompanied me during the years spent in college. i

4 ii

5 Abstract The goal of process mining is to extract useful information from event logs that record the activities an organization performs. There are many process mining techniques to discover a process model, based on some event log. These techniques perform well on structured processes, but have problems with less structured ones. In this case the logs are very confusing and have high quantities of noise, making it difficult to extract useful information. The models generated for such logs tend to be difficult to read and contain unrelated behavior. In this work we present an approach that aims at overcoming these difficulties by extracting only the useful data and presenting it in an understandable manner. For this purpose sequence clustering algorithms are used to cluster the log into smaller logs (clusters) that correspond to a set of related cases. For each cluster, a model in the form of a Markov chain is presented. A preprocessing stage was also developed, to clean the log of certain irrelevant elements that complicate the models generated. The approach was fully implemented in the ProM framework and all the experiments were performed in that environment. Taking into account the results achieved for a real-world case study and the results of several experiments, we conclude that the approach is capable of dealing with complex logs, eliminating unnecessary behavior and partitioning different types of behavior into more understandable models. We also conclude that the sequence clustering algorithm provides good results when compared to other clustering methods to divide sequences in a process mining context. Keywords Process Mining, Preprocessing, Sequence Clustering, ProM, Markov Chains, Event Logs, Hierarchical Clustering, Process Models iii

6 iv

7 Resumo O objectivo da extracção de processos é obter informação relevante a partir dos logs de eventos que registam as actividades executadas numa organização. Existem várias técnicas nesta área que a partir desses logs geram modelos de processos. Estas técnicas apresentam bons resultados em processos bem estruturados, mas têm problemas quando aplicadas a processos pouco estruturados. Nestes casos os logs são muito confusos e têm uma grande quantidade de ruído, dificultando a extracção de informação útil. Para estes logs, o modelo gerado é difícil de compreender e poderá incluir comportamento de casos bastante distintos. Neste trabalho apresentamos uma abordagem que visa ultrapassar estas dificuldades, extraindo apenas a informação relevante e apresentando-a de forma legível. Para isso algoritmos de clustering de sequências são utilizados para dividir o log em logs mais pequenos (clusters) que correspondem a um conjunto de casos relacionados. Para cada cluster, um modelo em forma de cadeia de Markov é apresentado. Também se desenvolveu uma fase de pré-processamento, para limpar o log de elementos que poderão complicar desnecessariamente os modelos obtidos. A abordagem foi implementada na ferramenta ProM e todas as experiências foram executadas nesse ambiente. Tendo em conta os resultados obtidos num caso de estudo real e os resultados de diversas experiências, conclui-se que a abordagem é capaz de lidar com logs complexos, eliminando comportamento desnecessário e dividindo diferentes tipos de comportamento em modelos mais compreensíveis. Também se conclui que o algoritmo de clustering de sequências apresenta bons resultados quando comparado a outros algoritmos de clustering ao dividir sequências no contexto da extracção de processos. Palavras Chave Extracção de Processos, Pré-processamento, Clustering de Sequências, ProM, Cadeias de Markov, Logs de Eventos, Clustering Hierárquico, Modelos de Processos v

8 vi

9 Contents Introduction. Process Mining Motivation Organization Process Mining Tools 5 2. ProM Mining Tools Analysis Tools Process Mining with Clustering Trace Clustering Approach for Process Mining Conclusion Sequence Clustering for ProM 5 3. Sequence Clustering Applications of Sequence Clustering Preprocessing Implementation within ProM Preprocessing Stage Sequence Clustering Stage vii

10 3.5 Hierarchical Clustering Conclusion Experiments and Evaluation Issue Handling Process Patient Treatment Process Telephone Repair Process: Comparing Clustering Methods Case Study: Application Server Logs Case study description Log Structure Preprocessing stage Sequence Clustering results Conclusion Main contributions Future work A Published paper 55 viii

11 List of Figures 2. Overview of the ProM Framework (adapted from []) MXML Snapshot Process Model of the example log DWS mining result Trace Clustering result Process Model for cluster (,) Example of a cluster model (Markov Chain) displayed in the sequence clustering plug-in Markov chain Matrix representation Sequence Clustering plug-in in the ProM framework Preprocessing stage for the Sequence Clustering plug-in Cluster Inspection in the Sequence Clustering plug-in Cluster model with no threshold Cluster model with an edge threshold of Model for the initial log of the issue handling process Sequences present in the log of the Issue Handling Process Cluster : Issue Handling Process Cluster 2: Issue Handling Process Cluster 3: Issue Handling Process ix

12 4.6 Cluster 4: Issue Handling Process Cluster 3.: Issue Handling Process Cluster 3.2: Issue Handling Process Model for the initial log of the Patient Treatment Process Events present in the log of the Patient Treatment Process Sequences present in the log of the Patient Treatment Process Cluster : Patient Treatment Process Cluster 2: Patient Treatment Process Cluster 3: Patient Treatment Process System infrastructure of a public institution Application Server Logs Snapshot Spaghetti model obtained from the application server logs using the heuristics miner Events related with exceptions in the application server logs Some of the behavioral patterns discovered from the application server logs using the sequence clustering plug-in x

13 List of Tables 2. Example of an event log with 70 process instances, for the process of patient care in a hospital (A: Register patient, B: Contact family doctor, C: Treat patient, D: Give prescription to the patient, E: Discharge patient) Correspondence between letters and events Complexity metrics of the process models from the clusters generated by the three different clustering methods xi

14 xii

15 CHAPTER Introduction The growing demand for faster and more structured procedures in organizations has resulted in the proliferation of information systems. However, the existence of an information system to accomplish a given task does not ensure the most efficient way to execute that task, especially when several systems are required to execute it. Performance issues are a common problem that organizations face and therefore optimization is frequently a priority. Optimizing the way an organization performs its processes leads to an increase of efficiency, adding value to the organization. To optimize a process the organizations must understand how a process is being executed, usually this involved a long period of analysis, including interviews with all the persons responsible for a given part of the process. The appearance and proliferation of Process-Aware Information Systems [2] (such as ERP, WFM, CRM and SCM systems) has opened the door for a more efficient type of method to study the execution of processes, called process mining [3]. These systems typically record events executed during a business process execution and analyzing these logs can yield important knowledge to improve the execution of processes and improve the quality of the organization s services. This is where process mining comes in.. Process Mining The process mining area is concerned with the discovery, the monitoring and the improvement of real processes (not assumed processes) by extracting information from event logs. Process mining techniques can generally be grouped into three types: () discovery of process knowledge like process models [4, 5, 6], (2) conformance checking, i.e. measure the conformance between modeled

16 behavior (defined process models) and observed behavior (process execution present in logs) [7, 8] and (3) extension of a process model with information extracted from event logs (like identifying bottlenecks in a process model). The main application of process mining is the discovery of process models. Therefore, much investigation has been performed in order to improve the models produced. However there are still issues that complicate the discovery of comprehensible models and that need to be addressed. For processes with a lot of different cases and high diversity of behavior, the models generated tend to be very confusing and difficult to understand. These models are usually called spaghetti models [9]. Clustering techniques have been investigated as the means to deal with this complexity by dividing cases into clusters, leading to less confusing models. However, results still suffer from the presence of certain unusual cases that include noise and ad-hoc behavior, which are common in real-world environments. This type of behavior can have different origins, like human error in executing a given process, incomplete executions of a process or errors produced by the systems. Known types of noise are for example the inversion in the order of activities, existence of unrelated activities or lack of needed activities. Usually this type of behavior is not relevant to understand a process and it unnecessarily complicates the discovered models..2 Motivation In this dissertation we present an approach that is able to deal with these problems by means of sequence clustering techniques. This is a kind of model-based clustering that partitions the cases according to the order in which events occurred. For the purpose of this work the model used to represent each cluster is a first-order Markov Chain. The fact that this clustering is probabilistic makes it suitable to deal with logs containing many different types of behavior, possibly nonrecurrent behavior as well. When sequence clustering is applied, the log is divided into a number of clusters and the correspondent Markov Chains are generated. Additionally, the approach also comprises a preprocessing stage, where the goal is to clean the log of certain events that will only complicate the clustering method and its results. If after both techniques are applied the models are still confusing, sequence clustering can be re-applied hierarchically within each cluster until understandable results are obtained. The approach has been implemented in ProM [], an extensible framework for process mining that already includes many techniques to address challenges in this area..3 Organization This document is organized as follows: Chapter 2 provides an overview of existing work involving clustering and process mining. The framework in which the work was developed is presented, along with some of the most important techniques implemented in that framework. Chapter 3 presents the proposed approach, including the preprocessing stage and the sequence clustering algorithm. The implementation of these techniques in ProM is discussed, including the 2

17 inputs needed and the outputs produced. Chapter 4 presents three experiments using the techniques implemented that demonstrate the use of the techniques and compare the results with other clustering methods. Chapter 5 demonstrates the approach in a real-world case study where the goal was to understand the typical behavior of faults in an application server. Finally in chapter 6 we draw conclusions about this thesis and suggest some future work. 3

18 4

19 CHAPTER 2 Process Mining Tools Process mining techniques aim at the analysis of business processes by extracting information from event logs and are especially useful when little information is available about a given process and obtaining that information is complicated. Most of these techniques are available in the ProM framework. In this chapter we present some work related to concepts approached in this dissertation, focusing on process mining techniques available in the ProM framework. First we present the framework, then we show an overall view of the different types of process mining tools available in ProM and finally we explore in greater detail two of those tools that focus on applying clustering methods to process mining. 2. ProM The environment in which this dissertation is based is the ProM Framework [, 0]. ProM is an extensible framework aimed at process mining, that is issued under an open source license, therefore the development of plug-ins is possible and encouraged. Many plug-ins resulting from investigation work have been developed in three major categories: mining, analysis and conversion. Figure 2. presents an overview of the ProM Framework architecture, it shows the relations between the framework, the plug-ins and the event log. The event log that usually serves as input to the plug-ins has a specific format based on XML and defined for this framework, it is called MXML []. This format follows a specified schema definition, which means the log does not consist of random and disorganized information, rather For more information and to download ProM visit 5

20 Figure 2.: Overview of the ProM Framework (adapted from []) it contains all the elements needed by the plug-ins at a known location. In Figure 2.2 a snapshot of a MXML log is presented. Each Process Instance corresponds to one execution of a given process and has a set of Audit Trial Entries associated. These entries correspond to the events that occurred during the execution of the process instance and are composed by several attributes like the WorkflowModelElement that represents the name of the event and the EventType that classifies the event according to its state (Start or Complete). There are also other attributes that identify the originator of a given event and the timestamp in which the event was executed. As shown in Figure 2., these event logs are generated by Process-aware Information Systems (PAIS) [2] and are read by the ProM framework using the Log Filter, that can also perform some filtering to those logs before any other task is performed. The Import plug-ins are used to load many different kinds of models like Aris graphs and the Mining plug-ins perform some kind of mining, storing the results as Frames. These Frames can be used to visualize a Petri Net [2] or a Social Network [3] for example. Analysis plug-ins can perform further analysis like checking the conformance of a process model and a log for example. Finally Conversion plug-ins can transform a mining result into another format and the Export plug-ins can store the results outside the framework in different kinds of format. Next we further explore the mining and analysis plug-ins available in ProM, in particular the ones relating to our work. 6

21 <ProcessInstance id="0" description=""> <AuditTrailEntry> <WorkflowModelElement>A</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>C</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>E</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> </ProcessInstance> <ProcessInstance id="" description=""> <AuditTrailEntry> <WorkflowModelElement>C</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>A</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>B</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>D</WorkflowModelElement> <EventType>complete</EventType> </AuditTrailEntry> </ProcessInstance> 2.2 Mining Tools Figure 2.2: MXML Snapshot Mining tools are an implementation of a mining algorithm in ProM. They can be divided into three major types: () Control-flow discovery, (2) Organizational perspective and (3) Data perspective. Some of the control-flow discovery tools include: α-algorithm plug-in it implements the α-algorithm [4], constructing a Petri net that models the workflow of the process. It establishes a set of relations between tasks and assumes that a log is complete (all possible behavior is present). This algorithm presents some shortcomings, namely it is not robust to noise and it cannot mine processes with short-loops or duplicate tasks. Some work has been done to extend this algorithm, for instance to be able to mine short-loops [4] and to detect implicit dependencies [5]. Heuristics miner plug-in [5] it implements a heuristics driven algorithm, that is especially useful for dealing with noise, by only expressing the main behavior present in a log. This means that not all details are shown to the user and exceptions are ignored. To illustrate what kind of graph is presented by this tool, we created a simple example log 2 shown in 2 Real life logs generate much more complex and confusing models, this example is only used to present some concepts. 7

22 Id Process Instance Frequency ABCE 20 2 ACE 2 3 CABD 0 4 CAB 4 5 CDBE 4 Table 2.: Example of an event log with 70 process instances, for the process of patient care in a hospital (A: Register patient, B: Contact family doctor, C: Treat patient, D: Give prescription to the patient, E: Discharge patient) Table 2. and used the ProM implementation of this algorithm to come to the result shown in Figure 2.3. This tool can also be used when searching for long distance dependency relations. Genetic algorithm plug-in [6] it uses genetic algorithms to calculate the best possible process model for a log. Every individual is assigned a fitness measure that evaluates how well the individual can reproduce the behavior present in the input log. In this context, individuals are possible process models. Candidate individuals are generated using genetic operators like crossover and mutation and then the fittest are selected. This algorithm was proposed to deal with some issues involving the logs, like noise and incompleteness. Organizational perspective aims at understanding the different types of business relations established within an organization, some of the mining tools available in ProM that approach this subject are: Social network miner plug-in [6] it takes a log file and determines a social network of people. By using this tool we can identify roles and interactions in an organization, for example who usually works together or who hands over work to whom. Organizational miner plug-in from an event log containing originator information, it presents to the user a graph associating activities and originators. Tools that deal with the data perspective make use of additional data attributes present in logs, here is an example: Decision miner [7] this tool analyzes how data attributes of process instances or activities (such as timestamps or performance indicators) influence the routing of a process instance. To accomplish this, every decision point in the process model is analyzed and if possible linked to properties of individual cases (process instances) or activities. There are also some plug-ins that deal with less structured processes: Fuzzy miner [8] the process models of less structured processes, tend to be very confusing and hard to read (usually referred to as spaghetti models). This tool objective is to emphasize 8

23 Figure 2.3: Process Model of the example log graphically the most relevant behavior, by calculating the relevance of activities and their relations. To achieve this two metrics are used: () significance that measures the level of interest we have in events (for example by calculating their frequency on the log) and (2) correlation that determines how closely related two events that follow each other are, so that events highly related can be aggregated. 2.3 Analysis Tools Analysis plug-ins have a variety of purposes, like implementing some property analysis on a previously achieved mining result or comparing a process log and a predefined model of how a process should be executed. Next we present only a few of those that we consider more relevant: Conformance checker one important question that organizations would like to have answered is: Are our processes being executed as we planned? Answering this question has been an active field of investigation [7, 8]. This tool was implemented in ProM to address this problem. It analyzes the gap between a model and the real world, detecting violations (bad executions of a process) and ensuring transparency (the model might be outdated). To measure the conformance this tool uses two concepts: () fitness that checks if the event log complies with the control flow specified by the process model and (2) appropriateness that checks if the model describes the behavior present in the event log. 9

24 Basic performance analysis the objective of this tool is to calculate performance measures such as the execution time of a process or the waiting time. The tool then presents the results with several different kinds of graphs. LTL checker [9] this plug-in checks whether a log satisfies some Linear Temporal Logic (LTL) formula. For example it can check if a given activity is executed by the person that should be executing it or check whether an activity A that has to be executed after B is indeed always executed at the correct moment. 2.4 Process Mining with Clustering When generating process models like the one on Figure 2.3, conventional control-flow techniques tend to over-generalize. In the attempt to represent all the different behavior present in the log these techniques create models that allow for more behavior than the one actually observed. When a log has very different process instances the models generated are even more complex and confusing. Clustering was approached as a way to overcome this problem [20]. The approach was implemented in ProM as the Disjunctive Workflow Schema (DWS) mining plugin. In the methodology developed, first the complete log is examined and a model is generated using the Heuristics Miner [5]. Then the log is compared to the model to measure the model s quality. If the model generated is optimal and no over-generalization is detected the approach stops, otherwise the log is divided into clusters using the K-means clustering method and their models are tested. If the cluster models still allow for too much behavior the clusters are repartitioned and so on until optimal models are achieved. The result of this methodology is the set of all the models created and the over-generalization points. Let s apply this methodology to our example log described in Table 2.. By analyzing the model shown in Figure 2.3, we can conclude that it allows for behavior not present in the log, for example the sequence BCA. By running the DWS plug-in available in ProM we come to the result shown in Figure 2.4. The model is presented (right), the navigational tree of models generated where we can choose the one to view (top-left) and the over-generalization points detected, where the first one refers to the sequence we had identified, stating that A was never executed after BC (bottom-left marked in red). Other clustering methods investigated in the process mining area are presented in the next section. 2.5 Trace Clustering Approach for Process Mining Trace Clustering [2] is another approach investigated in the process mining area as a way to partition the log, grouping similar sequences together. The motivation behind this work was the existence of flexible environments, where the execution of processes does not follow a rigid set of rules; although the notion of process is present, the actors are able to execute it differently according to each case. An example of such environments is the healthcare, where strictly following a process is not a priority compared to providing the best care for patients. 0

25 Figure 2.4: DWS mining result In these environments and particularly when a large number of cases (process instances) is recorded in the log, the main problem is the diversity; i.e. single process instances differ significantly from one another, therefore there are several different types of sequences and the models generated by conventional techniques are very confusing (spaghetti-like models). This approach addresses the issue using distance-based clustering along with profiles, with the purpose of reducing the diversity by lowering the number of cases analyzed at once. Each profile is formed by a set of items that describe a case from a particular perspective. Every item is a metric that assigns a numeric value to each case and therefore a profile can be viewed as a vector containing the values of all the different items (profiles can be combined resulting in aggregate vectors). These vectors are then used to calculate the distance between two cases, using distance metrics (like the Euclidean distance or the Hamming distance). Examples of such profiles are: Transition The items in this profile are direct following relations of the sequence (that forms a process instance). For any two events (A, B) there is an item measuring how often B as directly followed A. Case Attributes The items in this profile are the data attributes of the process instance. When process instances are annotated with meta-information, comparing that information can be an efficient way to compare the instances. Finally clustering methods such as the ones presented next can be applied to group closely related

26 Figure 2.5: Trace Clustering result cases in the same cluster: K-means Clustering It is one of the most used clustering methods and constructs k clusters by dividing the data into k groups. Self-Organizing Map (SOM) It is a neural network technique, which is used to map high dimensional data onto low dimensional spaces. Similar cases are mapped close to each other in the SOM. In Figure 2.5 we can see the resulting output of this method (in ProM) when applied to our example log. Three clusters were generated and in the map we can analyze the similarity between different cases and different clusters. If the cases are close together in the map and the color separating them is light then those cases are very similar. These algorithms are available in the ProM framework via the trace clustering plug-in. Figure 2.6 shows the process model (generated by the Heuristics Miner) for one of the clusters created when applying the SOM method. We can now clearly identify a type of sequence (no diversity present in that subset of the original log), it refers to a case (CDBE) where the patient was not registered; i.e. when analyzing the clusters we can discover different types of behavior, including types of sequences that are not being executed as they should. Recent work has been done to improve the results produced by trace clustering. A context aware approach based on generic edit distance was presented in [22]. In this work a method was defined to automatically derive a cost function to calculate the costs of edit operations, which takes into account the context of an event within a sequence. Considering the context can be valuable given that the events present in the sequences and the order in which they occur have a semantic relevance. To understand the usefulness of this kind of clustering a comparison was made between different 2

27 Figure 2.6: Process Model for cluster (,) trace clustering methods [22]. Comparing the results produced by one trace clustering approach to another is not trivial, due to the difficulty in understanding if one cluster is better formed than another. A better formed cluster is one in which the sequences have a higher degree of similarity and consequently the models for those clusters are easier to understand. Therefore a process mining perspective was proposed to evaluate the goodness of clusters by analyzing the models of those clusters. Fitness and comprehensibility metrics were used to evaluate the complexity of the models. By comparing these metrics the approach proved to generate less complex cluster models, indicating that better formed clusters were achieved. 2.6 Conclusion In this chapter we have introduced some important concepts relating to our work. The framework used throughout this dissertation was presented and also the types of tools available in that framework. The continuous growth of ProM is due to the importance that process mining has gained in recent years, resulting in numerous research performed by people around the world. We presented in greater detail two solutions involving the application of clustering techniques to process mining. The Trace Clustering approach is particularly relevant to our work, being that they share the common goal of sub-dividing the initial log in smaller logs, as to facilitate the detection of patterns. 3

28 The difference between the two approaches are the techniques used to achieve that goal. The emphasis of our solution is to approach the problem of noise and ad-hoc behavior that complicate the identification of patterns in logs originated by real-world information systems and the results produced by the clustering methods presented. To accomplish this we combine different techniques that are presented in the next chapter. 4

29 CHAPTER 3 Sequence Clustering for ProM Like the clustering techniques described in the previous chapter, sequence clustering can take a set of sequences and group them into clusters, so that similar types of sequence are placed in the same cluster. However, this type of clustering is performed directly on the sequences, as opposed to being performed on features extracted from those sequences. Sequence clustering has been extensively used in the field of bioinformatics, for example to classify large protein datasets into different families [23]. Process mining also deals with sequences, but instead of aminoacids the sequences contain events that have occurred during the execution of a given process. Sequence clustering techniques are therefore a natural candidate to perform clustering on workflow logs. In this chapter the techniques that form our solution are explored and the way these techniques were implemented is presented, including the outputs produced. 3. Sequence Clustering The sequence clustering algorithm used here is based on first-order Markov chains [24, 25]. Each cluster is represented by the corresponding Markov chain and by all the sequences assigned to it. A Markov chain is composed by a set of states and by the transition probabilities between them. In first-order Markov chains the probability of a given transition to a future state depends only on the current state. For the purpose of process mining it becomes useful to augment the simple Markov chain model with two dummy states: the input and the output state. This is necessary in order to represent the probability of a given event being the first or the last event of the chain, which may become useful to distinguish between some types of sequences. Figure 3. shows a simple example of such a chain depicted in ProM via the sequence clustering 5

30 Figure 3.: Example of a cluster model (Markov Chain) displayed in the sequence clustering plug-in a b c d e a b c d e Figure 3.2: Markov chain Matrix representation plug-in developed in this work. In this model, darker elements (both states and transitions) are more recurrent than lighter ones. By analyzing the color of the elements and the probability associated with each transition it is possible to decide which elements should be kept for analysis, and which elements can be discarded. For example, one may choose to remove transitions that have very low probabilities, so that only the most typical behavior can be analyzed. Figure 3.2 corresponds to the matrix representation of the Markov chain shown in Figure 3., where each column and each line correspond to an event and the matrix is ordered alphabetically from a to e (considering that the first and the last state correspond to the two dummy states added). In this representation the matrix values are the transition probabilities, for example the transition from a to c has a 0.8% probability of occurring. Every line in the matrix must be normalized; i.e. the sum of all the transition probabilities originating on a given state must equal one. Notice that in the first column and in the last line all values are zero and will be so in every Markov chain generated by our solution, because the input state is a dummy first state that is never transitioned to and the output state is a dummy final state that never transitions anywhere. 6

31 As said before these are first order Markov chains, there are also n th order chains where the probability of transition to a future state depends on the previous n states. An example of recent work developed with higher-order Markov chains can be found in [26]. The assignment of sequences to clusters is based on the probability of each cluster producing the given sequence. In general, any given sequence will be assigned to the cluster that is able to produce it with higher probability. Let and denote the input and output states, respectively. To calculate the probability of a sequence x = {, x, x 2,, x L, } being produced by cluster c k the following formula is used: [ L ] p (x c k ) = p (x ; c k ) i=2 p (x i x i ; c k ) p ( x L ; c k ) (3.) where p (x i x i ; c k ) is the transition probability from x i to x i in the Markov chain associated with cluster c k. This formula handles the input and output states in the same way as any other regular state that corresponds to an event. The goal of sequence clustering is to estimate these parameters for all clusters c k with k =, 2,, K based on a set of input sequences. For that purpose, the algorithm relies on an Expectation Maximization procedure [27] to improve the model parameters iteratively. For a given number of clusters K the algorithm proceeds as follows:. Initialize randomly the state transition probabilities of the Markov chains associated with each cluster. 2. Assign each sequence to the cluster that can produce it with higher probability according to equation (3.). 3. Recompute the state transition probabilities of the Markov chain of each cluster, considering the sequences that were assigned to that cluster in the previous step. 4. Repeat steps 2 and 3 until the assignment of sequences to clusters does not change, and hence the cluster models do not change either. In other words, first we randomly distribute the sequences into the clusters (steps and 2), then in step 3 we re-estimate the cluster models (Markov chain and its probabilities) according to the sequences assigned to each cluster. After this first iteration we re-assign the sequences to clusters and again re-estimate the cluster models (steps 2 and 3). These two steps are executed repeatedly until the algorithm converges. The result is a set of Markov models that describe the behavior of each cluster. The random initialization of the transition probabilities is an important feature of this algorithm that introduces a certain level of uncertainty in the results achieved. The sequence clustering algorithm can therefore generate a different set of clusters for the same sequences. Throughout this dissertation we minimized the impact of this uncertainty by applying the algorithm several times (usually five) and choosing the result that occurred more often. 7

32 3.2 Applications of Sequence Clustering Sequence clustering algorithms have been an active field of investigation in the area of bioinformatics [23, 28], as mentioned earlier. Although this has been the area primarily associated with sequence clustering, some work has been done with this type of algorithms in other areas. In [24], the goal was to analyze the navigation patterns on a website, these patterns consisted of sequences of URL categories followed by users. Sequence clustering was the approach chosen to partition site users, placing users with similar navigation paths in the same cluster. The behavior of the users present in each cluster is then displayed and can be analyzed to understand the particular interests of different types of user. In the field of process mining, sequence clustering has also been investigated [25]. The motivation behind that work was the fact that an event log can contain events originating from different processes; i.e. the idea was to not make the assumption that an event log only contains events of one process, but instead can be a mixture of different processes without any information stating which events correspond to what processes. The goal was to develop an approach that would be able to extract sequences of related events (relating to the same case) from those chaotic logs. After identifying the sequences, the Microsoft Sequence Clustering Algorithm (available in SQL Server [29]) is applied to group similar sequences in the same cluster, without the need for any business logic information. The environment developed to test this approach was an application that executed sequences of actions over a database and recorded these actions in logs. After extracting the sequences from the event log with some methodology, the sequence clustering algorithm was applied with a specific number of clusters as to generate a new cluster for each of the different types of sequences identified. Consequently, the model generated for each cluster constitutes a deterministic graph (the transition probabilities all equal.0) and the visualization of each model leads to the identification of a sequence type executed in the environment tested. 3.3 Preprocessing Although the sequence clustering algorithm described above is robust to noise, all sequences must ultimately be assigned to a cluster. However, if a sequence is very uncommon and different from all the others it will affect the probabilistic model of that cluster and in the end will make it harder to interpret the model of that cluster. To avoid this problem, some preprocessing must be done to the input sequences prior to applying sequence clustering. This preprocessing can be seen as a way to clean the dataset of undesired states (events) and also as a way to eliminate undesirable sequences. For example, undesired events can be events that occur rarely and undesired sequences can be single step sequences that only have one state. Some of the steps that can be performed during preprocessing are described in [30] and include, for example, dropping events and sequences with low support. In this work we have extended these steps by allowing not only the least but also the most recurring events and sequences to be discarded. This was motivated by the fact that in some real-world applications the log is filled 8

33 with some very frequent but irrelevant events (such as debug messages) that must be removed in order to allow the analysis to focus on the relevant behavior. Spaghetti models are often cluttered with events that occur very often but only contribute to obscure the process model one aims to discover. The preprocessing steps implemented within the sequence clustering plug-in are optional and configurable. They focus on the following features:. Event type The events recorded in a MXML log file may represent different points in the lifetime of workflow activities, such as the start or completion of a given activity. For sequence clustering what is important is the order of activity execution, so we retain only one type of event and that is usually the completion event for each activity. Therefore only events of type complete are kept after this step. 2. Event support Some events may be so infrequent that they are not relevant for the purpose of discovering typical behavior. These events should be removed in order to facilitate analysis. On the other hand, some events may be so frequent that they too become irrelevant and even undesirable if they hide the behavior one aims to discover. Therefore, this preprocessing can remove events both with very low and too high support. 3. Consecutive repetitions Sequence clustering is a means to analyze the transitions between states in a process. If an event is followed by an equal event then it should be considered only once, since the state of the process has not changed. Consecutive repetitions are therefore removed, for example: the sequence A C C D becomes A C D. 4. Sequence length After the previous preprocessing steps, it may happen that some sequences collapse to only a few events or even to a single event. This preprocessing step provides the possibility to discard those sequences. It also provides the possibility to discard exceedingly long sequences which can have undesirable effects in the analysis results. Sequence length can therefore be limited to a certain range. 5. Sequence support Some sequences may be rather unique so that they hardly contribute to the discovery of typical behavior. In principle the previous preprocessing steps will prevent the existence of such sequences at this stage but, as with events, sequences that occur very rarely can be removed from the dataset. In some applications such as fault detection it may be useful to actually discard the most common sequences and focus instead on the less frequent ones, so sequence support can also be limited to a certain range. The order presented is the order in which the preprocessing steps should be applied, because if the steps are applied in a different order the results may differ. For example, rare sequences should only be removed at the final stage, because previous steps may transform them into common sequences. Imagining we have the rare sequence A B C D, but in step 2 state B is considered to have low support and is removed, then it becomes A C D. This new sequence might not be a rare sequence and therefore should not be removed. 9

34 Figure 3.3: Sequence Clustering plug-in in the ProM framework 3.4 Implementation within ProM The above preprocessing steps and the sequence clustering algorithm have been implemented as a new plug-in for the process mining framework ProM [], which offers an environment suitable for extension. Figure 3.3 presents a general view of our solution inserted in that environment and is discussed in detail throughout this section. We particularly approach the interaction between the techniques developed and ProM, including the inputs needed and the outputs produced. The starting point as for the majority of the plug-ins in ProM is an event log. We assume that the log contains a variety of process instances corresponding to one process and each of these process instances contains a set of audit trail entries. These entries correspond to a given event executed within the process instance and have some attributes like the name, the type and the entity responsible. An event usually marks the beginning or the end of an activity, so these two concepts are closely related and therefore are used interchangeably throughout this dissertation. The set of all the entries of a given process instance is considered the sequence of events that were performed in that process instance. Different process instances (of the same process) may be composed by a different sequence, which represent alternative ways in which the process was executed. The main goal of our solution is to group sequences that are somehow related, with an event log as an input we now have several sequences available, so we can start the implementation of the techniques described. 20

35 3.4. Preprocessing Stage In this stage the log is cleaned of certain elements that might influence negatively the usefulness of the final results, the objective is not to alter the format of the log, it is to prepare the sequences for the sequence clustering algorithm to group them afterwards. The preprocessing stage receives an input log in MXML format [], which means that the elements we need are already available at a known location. If the log we intend to analyze is not in the format mentioned, a framework called ProM Import [3] can be used to restructure and convert the log to the accepted format. The other input at this stage are some options provided by the user, which specify the parameters to be used in the preprocessing steps described earlier. Figure 3.4 presents a screenshot of this stage and the options can be seen in the top-right corner: () the minimum percentage of occurrence of an event, (2) the maximum percentage of occurrence of an event, (3) the minimum size of a sequence, (4) the maximum size of a sequence, (5) the minimum occurrence of a sequence and (6) the maximum occurrence of a sequence. After the user specifies which elements to keep, the sequences present in the log (left in the figure) are altered or removed. The view is then refreshed to show the sequences of the preprocessed log. When implementing this technique and the sequence clustering technique in ProM, the original log is never changed, instead filters are used to create a new log. Rather than modifying the original log, what filters do is construct a new log based on the original one and on the results produced by those techniques. This is an existing component of ProM that includes different types of filters that can be used prior to any mining tool. To support our techniques new filters were created. The result produced at this stage is a filtered log. This log is made available to the ProM framework, so that it may be analyzed with other plug-ins if desired, so instead of acting just as a first stage to sequence clustering the preprocessing stage can also be used together with other types of analysis available in the framework Sequence Clustering Stage The sequence clustering stage receives the filtered log as input from the preprocessing stage and also the desired number of clusters. In general the plug-in will generate a solution with the provided number of clusters except when some clusters turn out to be empty. Each cluster can be used again as an event log in ProM, so it becomes possible to further subdivide it into clusters, or for another process mining plug-in to analyze it. These features allow the user to drill-down through the behavior of clusters. The plug-in provides special functionalities for visualizing the results, both in terms of sequences that belong to each cluster and in terms of the Markov chain for each cluster. In the first type of visualization depicted in Figure 3.5, the clusters generated are shown (left-hand side), the set of instances present in each cluster are presented (middle) and finally the sequence of events that compose each type of instance can also be inspected (right-hand side). As shown in Figure 3.5 the sequences within each cluster are aggregated according to their types, which means Both the schema definition for MXML and the ProM Import framework can be downloaded from 2

36 Figure 3.4: Preprocessing stage for the Sequence Clustering plug-in we can immediately identify how many different types of sequence were assigned to a cluster and how many sequences are there of each type. As an example we can see in the figure the inspection of Cluster 2, to which fifteen types of process instances were assigned. There are forty instances of the type highlighted in the image and the sequence of events that compose this type of process instance is the one shown in the right-hand side and is formed by six events. This visualization is especially useful to identify the frequency of occurrence for different types of behavior, for example one can conclude that the last type of process instance seen in the figure that occurs only once is a rare sequence of events, probably originating from noise or ad-hoc behavior and therefore not relevant to understand the behavior of the process being analyzed. The second type of visualization available makes this plug-in a mixture between a mining plug-in and an analysis plug-in. On one hand, sequence clustering is a mining plug-in that extracts models of behavior for the different behavioral patterns found in an input event log. Figure 3.6 shows the type of results that the plug-in is able to present. When visualizing the results, the user can adjust thresholds that correspond to the minimum and maximum probability of both edges and nodes (right-hand side of fig.3.6). This allows the user to adjust what is shown in the graphical model by removing elements (both states and transitions) that are either too frequent or too rare. This feature facilitates the understanding of spaghetti models by taking advantage of the probabilistic nature of sequence clustering and without having to re-run the algorithm again. This can be seen as a post-processing of the cluster models achieved. An example of the usefulness of this feature is the difference between figure 3.6 that represents all the behavior present in a cluster 2 and figure 3.7 that only represents transitions occurring above the threshold of The difference in the complexity of the two models after eliminating less recurrent behavior is noticeable. 2 This cluster results from the division into two clusters of a log available in the ProM website 22

37 Figure 3.5: Cluster Inspection in the Sequence Clustering plug-in Figure 3.6: Cluster model with no threshold 23

Social Network Analysis for Business Process Discovery

Social Network Analysis for Business Process Discovery Social Network Analysis for Business Process Discovery Cláudia Sofia da Costa Alves Dissertation for the degree of Master in Information Systems and Computer Engineering Supervisor: President: Vogal: Prof.

More information

Business Process Analysis in Healthcare Environments: a Methodology based on Process Mining

Business Process Analysis in Healthcare Environments: a Methodology based on Process Mining Business Process Analysis in Healthcare Environments: a Methodology based on Process Mining Álvaro Rebuge a, Diogo R. Ferreira b a Hospital de São Sebastião, EPE Rua Dr. Cândido de Pinho, 4520-211 Santa

More information

ProM Framework Tutorial

ProM Framework Tutorial ProM Framework Tutorial Authors: Ana Karla Alves de Medeiros (a.k.medeiros@.tue.nl) A.J.M.M. (Ton) Weijters (a.j.m.m.weijters@tue.nl) Technische Universiteit Eindhoven Eindhoven, The Netherlands February

More information

Master Thesis September 2010 ALGORITHMS FOR PROCESS CONFORMANCE AND PROCESS REFINEMENT

Master Thesis September 2010 ALGORITHMS FOR PROCESS CONFORMANCE AND PROCESS REFINEMENT Master in Computing Llenguatges i Sistemes Informàtics Master Thesis September 2010 ALGORITHMS FOR PROCESS CONFORMANCE AND PROCESS REFINEMENT Student: Advisor/Director: Jorge Muñoz-Gama Josep Carmona Vargas

More information

Process Mining and Network Analysis

Process Mining and Network Analysis Towards Comprehensive Support for Organizational Mining Minseok Song and Wil M.P. van der Aalst Eindhoven University of Technology P.O.Box 513, NL-5600 MB, Eindhoven, The Netherlands. {m.s.song, w.m.p.v.d.aalst}@tue.nl

More information

Trace Clustering in Process Mining

Trace Clustering in Process Mining Trace Clustering in Process Mining M. Song, C.W. Günther, and W.M.P. van der Aalst Eindhoven University of Technology P.O.Box 513, NL-5600 MB, Eindhoven, The Netherlands. {m.s.song,c.w.gunther,w.m.p.v.d.aalst}@tue.nl

More information

ProM 6 Tutorial. H.M.W. (Eric) Verbeek mailto:h.m.w.verbeek@tue.nl R. P. Jagadeesh Chandra Bose mailto:j.c.b.rantham.prabhakara@tue.

ProM 6 Tutorial. H.M.W. (Eric) Verbeek mailto:h.m.w.verbeek@tue.nl R. P. Jagadeesh Chandra Bose mailto:j.c.b.rantham.prabhakara@tue. ProM 6 Tutorial H.M.W. (Eric) Verbeek mailto:h.m.w.verbeek@tue.nl R. P. Jagadeesh Chandra Bose mailto:j.c.b.rantham.prabhakara@tue.nl August 2010 1 Introduction This document shows how to use ProM 6 to

More information

ProM 6 Exercises. J.C.A.M. (Joos) Buijs and J.J.C.L. (Jan) Vogelaar {j.c.a.m.buijs,j.j.c.l.vogelaar}@tue.nl. August 2010

ProM 6 Exercises. J.C.A.M. (Joos) Buijs and J.J.C.L. (Jan) Vogelaar {j.c.a.m.buijs,j.j.c.l.vogelaar}@tue.nl. August 2010 ProM 6 Exercises J.C.A.M. (Joos) Buijs and J.J.C.L. (Jan) Vogelaar {j.c.a.m.buijs,j.j.c.l.vogelaar}@tue.nl August 2010 The exercises provided in this section are meant to become more familiar with ProM

More information

Process Mining. ^J Springer. Discovery, Conformance and Enhancement of Business Processes. Wil M.R van der Aalst Q UNIVERS1TAT.

Process Mining. ^J Springer. Discovery, Conformance and Enhancement of Business Processes. Wil M.R van der Aalst Q UNIVERS1TAT. Wil M.R van der Aalst Process Mining Discovery, Conformance and Enhancement of Business Processes Q UNIVERS1TAT m LIECHTENSTEIN Bibliothek ^J Springer Contents 1 Introduction I 1.1 Data Explosion I 1.2

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

BUsiness process mining, or process mining in a short

BUsiness process mining, or process mining in a short , July 2-4, 2014, London, U.K. A Process Mining Approach in Software Development and Testing Process: A Case Study Rabia Saylam, Ozgur Koray Sahingoz Abstract Process mining is a relatively new and emerging

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1.1 Research Motivation In today s modern digital environment with or without our notice we are leaving our digital footprints in various data repositories through our daily activities,

More information

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Understanding Web personalization with Web Usage Mining and its Application: Recommender System Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,

More information

EMailAnalyzer: An E-Mail Mining Plug-in for the ProM Framework

EMailAnalyzer: An E-Mail Mining Plug-in for the ProM Framework EMailAnalyzer: An E-Mail Mining Plug-in for the ProM Framework Wil M.P. van der Aalst 1 and Andriy Nikolov 2 1 Department of Information Systems, Eindhoven University of Technology, P.O. Box 513, NL-5600

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

Discovering User Communities in Large Event Logs

Discovering User Communities in Large Event Logs Discovering User Communities in Large Event Logs Diogo R. Ferreira, Cláudia Alves IST Technical University of Lisbon, Portugal {diogo.ferreira,claudia.alves}@ist.utl.pt Abstract. The organizational perspective

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Investigating Clinical Care Pathways Correlated with Outcomes

Investigating Clinical Care Pathways Correlated with Outcomes Investigating Clinical Care Pathways Correlated with Outcomes Geetika T. Lakshmanan, Szabolcs Rozsnyai, Fei Wang IBM T. J. Watson Research Center, NY, USA August 2013 Outline Care Pathways Typical Challenges

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

Contents. Dedication List of Figures List of Tables. Acknowledgments

Contents. Dedication List of Figures List of Tables. Acknowledgments Contents Dedication List of Figures List of Tables Foreword Preface Acknowledgments v xiii xvii xix xxi xxv Part I Concepts and Techniques 1. INTRODUCTION 3 1 The Quest for Knowledge 3 2 Problem Description

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Approaching Process Mining with Sequence Clustering: Experiments and Findings

Approaching Process Mining with Sequence Clustering: Experiments and Findings Approaching Process Mining with Sequence Clustering: Experiments and Findings Diogo Ferreira 1,3, Marielba Zacarias 2,3, Miguel Malheiros 3, Pedro Ferreira 3 1 IST Technical University of Lisbon, Taguspark,

More information

Business Process Modeling

Business Process Modeling Business Process Concepts Process Mining Kelly Rosa Braghetto Instituto de Matemática e Estatística Universidade de São Paulo kellyrb@ime.usp.br January 30, 2009 1 / 41 Business Process Concepts Process

More information

Life-Cycle Support for Staff Assignment Rules in Process-Aware Information Systems

Life-Cycle Support for Staff Assignment Rules in Process-Aware Information Systems Life-Cycle Support for Staff Assignment Rules in Process-Aware Information Systems Stefanie Rinderle-Ma 1,2 and Wil M.P. van der Aalst 2 1 Department Databases and Information Systems, Faculty of Engineering

More information

Model-Based Cluster Analysis for Web Users Sessions

Model-Based Cluster Analysis for Web Users Sessions Model-Based Cluster Analysis for Web Users Sessions George Pallis, Lefteris Angelis, and Athena Vakali Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece gpallis@ccf.auth.gr

More information

Feature. Applications of Business Process Analytics and Mining for Internal Control. World

Feature. Applications of Business Process Analytics and Mining for Internal Control. World Feature Filip Caron is a doctoral researcher in the Department of Decision Sciences and Information Management, Information Systems Group, at the Katholieke Universiteit Leuven (Flanders, Belgium). Jan

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Process Mining Tools: A Comparative Analysis

Process Mining Tools: A Comparative Analysis EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics and Computer Science Process Mining Tools: A Comparative Analysis Irina-Maria Ailenei in partial fulfillment of the requirements for the degree

More information

Workflow Simulation for Operational Decision Support

Workflow Simulation for Operational Decision Support Workflow Simulation for Operational Decision Support A. Rozinat 1, M. T. Wynn 2, W. M. P. van der Aalst 1,2, A. H. M. ter Hofstede 2, and C. J. Fidge 2 1 Information Systems Group, Eindhoven University

More information

Summary and Outlook. Business Process Intelligence Course Lecture 8. prof.dr.ir. Wil van der Aalst. www.processmining.org

Summary and Outlook. Business Process Intelligence Course Lecture 8. prof.dr.ir. Wil van der Aalst. www.processmining.org Business Process Intelligence Course Lecture 8 Summary and Outlook prof.dr.ir. Wil van der Aalst www.processmining.org Overview Chapter 1 Introduction Part I: Preliminaries Chapter 2 Process Modeling and

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

USING PROCESS MINING FOR ITIL ASSESSMENT: A CASE STUDY WITH INCIDENT MANAGEMENT

USING PROCESS MINING FOR ITIL ASSESSMENT: A CASE STUDY WITH INCIDENT MANAGEMENT USING PROCESS MINING FOR ITIL ASSESSMENT: A CASE STUDY WITH INCIDENT MANAGEMENT Diogo R. Ferreira 1,2, Miguel Mira da Silva 1,3 1 IST Technical University of Lisbon, Portugal 2 Organizational Engineering

More information

Process Mining: Using CPN Tools to Create Test Logs for Mining Algorithms

Process Mining: Using CPN Tools to Create Test Logs for Mining Algorithms Mining: Using CPN Tools to Create Test Logs for Mining Algorithms Ana Karla Alves de Medeiros and Christian W. Günther 2 Outline Introduction Mining The need for synthetic logs MXML format Using the Logging

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

EDIminer: A Toolset for Process Mining from EDI Messages

EDIminer: A Toolset for Process Mining from EDI Messages EDIminer: A Toolset for Process Mining from EDI Messages Robert Engel 1, R. P. Jagadeesh Chandra Bose 2, Christian Pichler 1, Marco Zapletal 1, and Hannes Werthner 1 1 Vienna University of Technology,

More information

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days or 2008 Five Days Prerequisites Students should have experience with any relational database management system as well as experience with data warehouses and star schemas. It would be helpful if students

More information

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING Practical Applications of DATA MINING Sang C Suh Texas A&M University Commerce r 3 JONES & BARTLETT LEARNING Contents Preface xi Foreword by Murat M.Tanik xvii Foreword by John Kocur xix Chapter 1 Introduction

More information

Decision Mining in Business Processes

Decision Mining in Business Processes Decision Mining in Business Processes A. Rozinat and W.M.P. van der Aalst Department of Technology Management, Eindhoven University of Technology P.O. Box 513, NL-5600 MB, Eindhoven, The Netherlands {a.rozinat,w.m.p.v.d.aalst}@tm.tue.nl

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Process Modelling from Insurance Event Log

Process Modelling from Insurance Event Log Process Modelling from Insurance Event Log P.V. Kumaraguru Research scholar, Dr.M.G.R Educational and Research Institute University Chennai- 600 095 India Dr. S.P. Rajagopalan Professor Emeritus, Dr. M.G.R

More information

Discovering process models from empirical data

Discovering process models from empirical data Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Using Data Analytics to Detect Fraud

Using Data Analytics to Detect Fraud Using Data Analytics to Detect Fraud Fundamental Data Analysis Techniques 2016 Association of Certified Fraud Examiners, Inc. Discussion Question For each data analysis technique discussed in this section,

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data

Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data Master of Business Information Systems, Department of Mathematics and Computer Science Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data Master Thesis Author: ing. Y.P.J.M.

More information

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 505-510 International Research Publications House http://www. irphouse.com /ijict.htm K-means

More information

D A T A M I N I N G C L A S S I F I C A T I O N

D A T A M I N I N G C L A S S I F I C A T I O N D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

Business Process Discovery

Business Process Discovery Sandeep Jadhav Introduction Well defined, organized, implemented, and managed Business Processes are very critical to the success of any organization that wants to operate efficiently. Business Process

More information

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Dotted Chart and Control-Flow Analysis for a Loan Application Process

Dotted Chart and Control-Flow Analysis for a Loan Application Process Dotted Chart and Control-Flow Analysis for a Loan Application Process Thomas Molka 1,2, Wasif Gilani 1 and Xiao-Jun Zeng 2 Business Intelligence Practice, SAP Research, Belfast, UK The University of Manchester,

More information

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

Web Usage Mining: Identification of Trends Followed by the user through Neural Network International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 617-624 International Research Publications House http://www. irphouse.com /ijict.htm Web

More information

Mining and Tracking Evolving Web User Trends from Large Web Server Logs

Mining and Tracking Evolving Web User Trends from Large Web Server Logs Mining and Tracking Evolving Web User Trends from Large Web Server Logs Basheer Hawwash and Olfa Nasraoui* Knowledge Discovery and Web Mining Laboratory, Department of Computer Engineering and Computer

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

Outlines. Business Intelligence. What Is Business Intelligence? Data mining life cycle

Outlines. Business Intelligence. What Is Business Intelligence? Data mining life cycle Outlines Business Intelligence Lecture 15 Why integrate BI into your smart client application? Integrating Mining into your application Integrating into your application What Is Business Intelligence?

More information

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance 3.1 Introduction This research has been conducted at back office of a medical billing company situated in a custom

More information

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt>

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt> Intelligent Log Analyzer André Restivo 9th January 2003 Abstract Server Administrators often have to analyze server logs to find if something is wrong with their machines.

More information

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

DATA WAREHOUSE E KNOWLEDGE DISCOVERY DATA WAREHOUSE E KNOWLEDGE DISCOVERY Prof. Fabio A. Schreiber Dipartimento di Elettronica e Informazione Politecnico di Milano DATA WAREHOUSE (DW) A TECHNIQUE FOR CORRECTLY ASSEMBLING AND MANAGING DATA

More information

IT services for analyses of various data samples

IT services for analyses of various data samples IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

Big Data Text Mining and Visualization. Anton Heijs

Big Data Text Mining and Visualization. Anton Heijs Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark

More information

Human-Readable BPMN Diagrams

Human-Readable BPMN Diagrams Human-Readable BPMN Diagrams Refactoring OMG s E-Mail Voting Example Thomas Allweyer V 1.1 1 The E-Mail Voting Process Model The Object Management Group (OMG) has published a useful non-normative document

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep Neil Raden Hired Brains Research, LLC Traditionally, the job of gathering and integrating data for analytics fell on data warehouses.

More information

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya Chapter 6 Basics of Data Integration Fundamentals of Business Analytics Learning Objectives and Learning Outcomes Learning Objectives 1. Concepts of data integration 2. Needs and advantages of using data

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM

A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM MS. DIMPI K PATEL Department of Computer Science and Engineering, Hasmukh Goswami college of Engineering, Ahmedabad, Gujarat ABSTRACT The Internet

More information

Building Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu

Building Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu Building Data Cubes and Mining Them Jelena Jovanovic Email: jeljov@fon.bg.ac.yu KDD Process KDD is an overall process of discovering useful knowledge from data. Data mining is a particular step in the

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Nintex Workflow 2010 Help Last updated: Friday, 26 November 2010

Nintex Workflow 2010 Help Last updated: Friday, 26 November 2010 Nintex Workflow 2010 Help Last updated: Friday, 26 November 2010 1 Workflow Interaction with SharePoint 1.1 About LazyApproval 1.2 Approving, Rejecting and Reviewing Items 1.3 Configuring the Graph Viewer

More information

IMAN: DATA INTEGRATION MADE SIMPLE YOUR SOLUTION FOR SEAMLESS, AGILE DATA INTEGRATION IMAN TECHNICAL SHEET

IMAN: DATA INTEGRATION MADE SIMPLE YOUR SOLUTION FOR SEAMLESS, AGILE DATA INTEGRATION IMAN TECHNICAL SHEET IMAN: DATA INTEGRATION MADE SIMPLE YOUR SOLUTION FOR SEAMLESS, AGILE DATA INTEGRATION IMAN TECHNICAL SHEET IMAN BRIEF Application integration can be a struggle. Expertise in the form of development, technical

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

Modeling Guidelines Manual

Modeling Guidelines Manual Modeling Guidelines Manual [Insert company name here] July 2014 Author: John Doe john.doe@johnydoe.com Page 1 of 22 Table of Contents 1. Introduction... 3 2. Business Process Management (BPM)... 4 2.1.

More information

Discovering Interacting Artifacts from ERP Systems (Extended Version)

Discovering Interacting Artifacts from ERP Systems (Extended Version) Discovering Interacting Artifacts from ERP Systems (Extended Version) Xixi Lu 1, Marijn Nagelkerke 2, Dennis van de Wiel 2, and Dirk Fahland 1 1 Eindhoven University of Technology, The Netherlands 2 KPMG

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

Integrity 10. Curriculum Guide

Integrity 10. Curriculum Guide Integrity 10 Curriculum Guide Live Classroom Curriculum Guide Integrity 10 Workflows and Documents Administration Training Integrity 10 SCM Administration Training Integrity 10 SCM Basic User Training

More information

HYBRID INTELLIGENT SUITE FOR DECISION SUPPORT IN SUGARCANE HARVEST

HYBRID INTELLIGENT SUITE FOR DECISION SUPPORT IN SUGARCANE HARVEST HYBRID INTELLIGENT SUITE FOR DECISION SUPPORT IN SUGARCANE HARVEST FLÁVIO ROSENDO DA SILVA OLIVEIRA 1 DIOGO FERREIRA PACHECO 2 FERNANDO BUARQUE DE LIMA NETO 3 ABSTRACT: This paper presents a hybrid approach

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

Mining a Change-Based Software Repository

Mining a Change-Based Software Repository Mining a Change-Based Software Repository Romain Robbes Faculty of Informatics University of Lugano, Switzerland 1 Introduction The nature of information found in software repositories determines what

More information