Methods for the specification and verification of business processes MPB (6 cfu, 295AA)

Similar documents

Chapter 5 Process Discovery: An Introduction

Data Science. Research Theme: Process Mining

Using Process Mining to Bridge the Gap between BI and BPM

Summary and Outlook. Business Process Intelligence Course Lecture 8. prof.dr.ir. Wil van der Aalst.

Process Mining. ^J Springer. Discovery, Conformance and Enhancement of Business Processes. Wil M.R van der Aalst Q UNIVERS1TAT.

Process Mining: Making Knowledge Discovery Process Centric

Master Thesis September 2010 ALGORITHMS FOR PROCESS CONFORMANCE AND PROCESS REFINEMENT

Implementing Heuristic Miner for Different Types of Event Logs

Process Modelling from Insurance Event Log

Process Mining and Visual Analytics: Breathing Life into Business Process Models

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining. Data Analysis and Knowledge Discovery

Process Mining Data Science in Action

Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data

ProM 6 Exercises. J.C.A.M. (Joos) Buijs and J.J.C.L. (Jan) Vogelaar {j.c.a.m.buijs,j.j.c.l.vogelaar}@tue.nl. August 2010

Process Mining Using BPMN: Relating Event Logs and Process Models

Process Mining. Data science in action

Business Intelligence and Process Modelling

CHAPTER 1 INTRODUCTION

Categorical Data Visualization and Clustering Using Subjective Factors

Business Process Modeling

Mining Process Models with Non-Free-Choice Constructs

ProM Framework Tutorial

Analysis of Service Level Agreements using Process Mining techniques

Business Process Quality Metrics: Log-based Complexity of Workflow Patterns

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Compact Representations and Approximations for Compuation in Games

BIS 3106: Business Process Management. Lecture Two: Modelling the Control-flow Perspective

A Capability Model for Business Analytics: Part 2 Assessing Analytic Capabilities

Process Mining and Fraud Detection

Methods for the specification and verification of business processes MPB (6 cfu, 295AA)

Chapter 4 Getting the Data

BUsiness process mining, or process mining in a short

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT?

Mining Configurable Process Models from Collections of Event Logs

Model Discovery from Motor Claim Process Using Process Mining Technique

Enhanced data mining analysis in higher educational system using rough set theory

Runtime Verification - Monitor-oriented Programming - Monitor-based Runtime Reflection

Process Mining The influence of big data (and the internet of things) on the supply chain

EDIminer: A Toolset for Process Mining from EDI Messages

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Notes on Complexity Theory Last updated: August, Lecture 1

Mathematics for Computer Science/Software Engineering. Notes for the course MSM1F3 Dr. R. A. Wilson

Introduction. A. Bellaachia Page: 1

Protein Protein Interaction Networks

Towards Cross-Organizational Process Mining in Collections of Process Models and their Executions

Continued Fractions and the Euclidean Algorithm

Pragmatic guidelines for Business Process Modeling

A Model-driven Approach to Predictive Non Functional Analysis of Component-based Systems

Decision Mining in Business Processes

Load Balancing and Switch Scheduling

Information Management course

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

OPRE 6201 : 2. Simplex Method

9.2 Summation Notation

Row Ideals and Fibers of Morphisms

Dotted Chart and Control-Flow Analysis for a Loan Application Process

2) Write in detail the issues in the design of code generator.

The Goldberg Rao Algorithm for the Maximum Flow Problem

Measuring the Performance of an Agent

Globally Optimal Crowdsourcing Quality Management

Declaration of Conformity 21 CFR Part 11 SIMATIC WinCC flexible 2007

not possible or was possible at a high cost for collecting the data.

Solution to Homework 2

Investigating Clinical Care Pathways Correlated with Outcomes

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

The Graphical Method: An Example

Chapter 6. The stacking ensemble approach

DATA ANALYSIS II. Matrix Algorithms

RSA VIA LIFECYCLE AND GOVERNENCE: ROLE MANAGEMENT BEST PRACTICES

Regular Languages and Finite Automata

Social Media Mining. Data Mining Essentials

Data Mining Algorithms Part 1. Dejan Sarka

11 Ideals Revisiting Z

Kirsten Sinclair SyntheSys Systems Engineers

Text Analytics. A business guide

Life-Cycle Support for Staff Assignment Rules in Process-Aware Information Systems

Module 10. Coding and Testing. Version 2 CSE IIT, Kharagpur

Data Mining: A Preprocessing Engine

4 Testing General and Automated Controls

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Modeling Guidelines Manual

Just the Factors, Ma am

A Business Process Services Portal

Anomaly Detection in Predictive Maintenance

Nonlinear Optimization: Algorithms 3: Interior-point methods

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

How To Check For Differences In The One Way Anova

Information, Entropy, and Coding

Transcription:

Methods for the specification and verification of business processes MPB (6 cfu, 295AA) Roberto Bruni http://www.di.unipi.it/~bruni 24 - Process Mining 1

Object We overview the key principles of process mining 2

Process Mining Process mining is a relative young research discipline that sits between machine learning and data mining on the one hand and process modeling and analysis on the other hand. The idea of process mining is to discover, monitor and improve real processes (i.e., not assumed processes) by extracting knowledge from event logs readily available in today s systems. 3

Processes, Cases, Events, Attributes A process consists of cases. A case consists of events such that each event relates to precisely one case. Events within a case are ordered. Events can have attributes. Examples of typical attribute names are activity, time, costs, and resource. 4

Event Logs Let us assume that it is possible to sequentially record events such that each event: refers to an activity (i.e., a well-defined step in the process) and is related to a particular case (i.e., a process instance). 5

Event Log Example 1.4 Analyzing an Example Log 13 Table 1.1 A fragment of some event log: each line corresponds to an event Case id Event id Properties Timestamp Activity Resource Cost... 1 35654423 30-12-2010:11.02 Register request Pete 50... 35654424 31-12-2010:10.06 Examine thoroughly Sue 400... 35654425 05-01-2011:15.12 Check ticket Mike 100... 35654426 06-01-2011:11.18 Decide Sara 200... 35654427 07-01-2011:14.24 Reject request Pete 200... 2 35654483 30-12-2010:11.32 Register request Mike 50... 35654485 30-12-2010:12.12 Check ticket Mike 100... 35654487 30-12-2010:14.16 Examine casually Pete 400... 35654488 05-01-2011:11.22 Decide Sara 200... 35654489 08-01-2011:12.05 Pay compensation Ellen 200... 3 35654521 30-12-2010:14.32 Register request Pete 50... 35654522 30-12-2010:15.06 Examine casually Mike 400... 6

Mining Scheme 1.3 Process Mining 9 Fig. 1.4 Positioning of the three main types of process mining: discovery, conformance, and engiovedì 13 dicembre 2012

Discovery A discovery technique takes an event log and produces a model without using any a-priori information. If the event log contains information about resources, one can also discover resource-related models, e.g., a social network showing how people work together in an organization. 8

Conformance An existing process model is compared with an event log of the same process. Conformance checking can be used to check if reality, as recorded in the log, conforms to the model and vice versa. Conformance checking may be used to detect, locate and explain deviations, and to measure the severity of these deviations. 9

Enhancement The idea is to extend/improve an existing process model using information about the actual process recorded in some event log. Whereas conformance checking measures the alignment between model and reality, this third type of process mining aims at changing or extending the a-priori model. 10

Enhancement: Repair One type of enhancement is repair, i.e., modifying the model to better reflect reality. For example, if two activities are modeled sequentially but in reality can happen in any order, then the model may be corrected to reflect this. 11

Four Perspectives 12

Control-Flow Perspective The control-flow perspective focuses on the control-flow, i.e., the ordering of activities. The goal of mining this perspective is to find a good characterization of all possible paths, e.g., expressed in terms of a Petri net or some other notation (e.g., EPC, BPMN, and UML AD). We shall focus on this perspective 13

Organizational Perspective The organizational perspective focuses on information about resources hidden in the log, i.e., which actors (e.g., people, systems, roles, and departments) are involved and how are they related. The goal is to either structure the organization by classifying people in terms of roles and organizational units or to show the social network. 14

Case Perspective The case perspective focuses on properties of cases. Obviously, a case can be characterized by its path in the process or by the originators working on it. However, cases can also be characterized by the values of the corresponding data elements. For example, if a case represents a replenishment order, it may be interesting to know the supplier or the number of products ordered. 15

Time Perspective The time perspective is concerned with the timing and frequency of events (performance checking). When events bear timestamps it is possible to discover bottlenecks, measure service levels, monitor the utilization of resources, and predict the remaining processing time of running cases. 16

Play-in, Play-out, Replay 17

Play-in 1.5 Play-in, Play-out, and Replay 19 18

1.5 Play-in, Play-out, and Replay 19 Play-out 19

Replay Fig. 1.8 Three ways of relating event logs (or other sources of information containing example behavior) and process models: Play-in, Play-out, and Replay than 56 cigarettes tend to die young ) and association rules ( people that buy diapers also buy beer ). Unfortunately, it is not possible to use conventional data mining techniques to Play-in process models. 20 Only recently, process mining techniques have become readily available to discover process models based on event

An Example 21

Event Log Example 1.4 Analyzing an Example Log 13 Table 1.1 A fragment of some event log: each line corresponds to an event Case id Event id Properties Timestamp Activity Resource Cost... 1 35654423 30-12-2010:11.02 Register request Pete 50... 35654424 31-12-2010:10.06 Examine thoroughly Sue 400... 35654425 05-01-2011:15.12 Check ticket Mike 100... 35654426 06-01-2011:11.18 Decide Sara 200... 35654427 07-01-2011:14.24 Reject request Pete 200... 2 35654483 30-12-2010:11.32 Register request Mike 50... 35654485 30-12-2010:12.12 Check ticket Mike 100... 35654487 30-12-2010:14.16 Examine casually Pete 400... 35654488 05-01-2011:11.22 Decide Sara 200... 35654489 08-01-2011:12.05 Pay compensation Ellen 200... 3 35654521 30-12-2010:14.32 Register request Pete 50... 35654522 30-12-2010:15.06 Examine casually Mike 400... 22

Table 1.1 A fragment of some event log: each line corresponds to an event Case id Event id Properties Timestamp Activity Resource Cost... 1 35654423 30-12-2010:11.02 Register request Pete 50... Event Log Example Table 1.1 (Continued) 35654487 30-12-2010:14.16 Examine casually Pete 400... 35654533 15-01-2011:10.45 Pay compensation Ellen 200... 4 35654641 06-01-2011:15.02 Register request Pete 50... 35654643 07-01-2011:12.06 Check ticket Mike 100... 35654644 08-01-2011:14.43 Examine thoroughly Sean 400... 35654645 09-01-2011:12.02 Decide Sara 200... 35654647 12-01-2011:15.44 Reject request Ellen 200... 5 35654711 06-01-2011:09.02 Register request Ellen 50... 35654712 07-01-2011:10.16 Examine casually Mike 400... 35654714 08-01-2011:11.22 Check ticket Pete 100... 35654715 10-01-2011:13.28 Decide Sara 200... 35654716 11-01-2011:16.18 Reinitiate request Sara 200... 35654718 14-01-2011:14.33 Check ticket Ellen 100... 35654719 16-01-2011:15.50 Examine casually Mike 400... 35654720 19-01-2011:11.18 Decide Sara 200... 35654721 20-01-2011:12.48 Reinitiate request Sara 200... 35654722 21-01-2011:09.06 Examine casually Sue 400... 35654724 21-01-2011:11.34 Check ticket Pete 100... 35654725 23-01-2011:13.12 Decide Sara 200... 35654726 24-01-2011:14.56 Reject request Mike 200... 23 Table 1.1 (Continued) Case id Event id Properties 6 35654871 06-01-2011:15.02 R 35654873 06-01-2011:16.06 E Timestamp Activity Resource Cost... 35654874 07-01-2011:16.22 C 14 1 Introduction 35654875 07-01-2011:16.52 D 6 35654871 06-01-2011:15.02 Register request Mike 50... 35654424 31-12-2010:10.06 Examine thoroughly Sue 400... 35654873 06-01-2011:16.06 Examine casually Ellen 400... 35654425 05-01-2011:15.12 Check ticket Mike 100... 35654874 07-01-2011:16.22 Check ticket Mike 100... 35654426 06-01-2011:11.18 Decide Sara 200... 35654875 07-01-2011:16.52 Decide Sara 200... 35654427 07-01-2011:14.24 Reject request Pete 200... Case id Event id Properties 35654877 16-01-2011:11.47 Pay compensation Mike 200... 2 35654483 30-12-2010:11.32 Register request Mike 50........................ 35654485 30-12-2010:12.12 Check ticket Mike 100 Timestamp... Activity Resource Cost... 35654877 16-01-2011:11.47 P.......... 35654488 05-01-2011:11.22 Decide Sara 200... 6 35654871 06-01-2011:15.02 Table 1.2 A more compact 35654489 08-01-2011:12.05 Pay compensation Ellen 200... Table Register representation of log shown Case 1.2request id A more compact MikeTrace 50... 3 35654521 30-12-2010:14.32 Register request Pete 50... in Table 1.1: a = register representation of log shown Case id 35654873 06-01-2011:16.06 Examine casually Ellen 400... request, b = examine 1 a, b, d, e, h 35654522 30-12-2010:15.06 Examine casually Mike 400... thoroughly, c = examine in Table 1.1: a = register 35654524 30-12-2010:16.34 Check ticket 35654874 Ellen 100 07-01-2011:16.22... Check2 ticket Mike a, d, c, e, g 100... casually, d = check ticket, request, 35654525 06-01-2011:09.18 Decide Sara 200... 3 b = examine 1 a, c, d, e, f, b, d, e, g 35654875 07-01-2011:16.52 e = decide, f = reinitiate 35654526 06-01-2011:12.18 Reinitiate request Sara 200... request, g = pay thoroughly, Decide4 c = examine Sara 200... a, d, b, e, 2h 35654527 06-01-2011:13.06 Examine thoroughly 35654877 Sean 400 16-01-2011:11.47... compensation, and h = casually, Pay rejectcompensation 5 d = check ticket, Mike a, c, d, e, f, d, 200 c, e, f, c, d, e,... h request 35654530 08-01-2011:11.43 Check ticket Pete 100... 6 a, c, d, e, 3g e = decide, f = reinitiate 35654531 09-01-2011:09.55 Decide Sara 200........................ request, g = pay 4...... compensation, and h = reject 5 request Table 1.2 A more compact representation of log shown in Table 1.1: a = register request, b = examine thoroughly, c = examine casually, d = check ticket, e = decide, f = reinitiate request, g = pay compensation, and h = reject request Case id Trace 1 a, b, d, e, h 2 a, d, c, e, g Fig. 1.5 The process model discovered by the α-algorithm [103] based on the set of traces { a, b, d, e, h, a, d, c, e, g, a, c, d, e, f, b, d, e, g, a, d, b, e, h, a, c, d, e, f, d, c, e, f, c, d, e, h, a, c, d, e, g } After executing h, the case ends in the desired final marking with just a token in place end. Similarly, it can be checked that the other five traces shown in Table 1.2 6... 3 a, c, d, e, f, b, d, e, g 4 a, d, b, e, h 5 a, c, d, e, f, d, c, e, f, c, d, e, h 6 a, c, d, e, g......

e = decide, f = reinitiate request, g = pay compensation, and h = reject request 4 a, d, b, e, h 5 a, c, d, e, f, d, c, e, f, c, d, e, h 14 6 a, c, d, e, g 1 Introduction Discovery...... Example Table 1.1 (Continued) Case id Event id Properties Timestamp Activity Resource Cost... 6 35654871 06-01-2011:15.02 Register request Mike 50... 35654873 06-01-2011:16.06 Examine casually Ellen 400... 35654874 07-01-2011:16.22 Check ticket Mike 100... 35654875 07-01-2011:16.52 Decide Sara 200... 35654877 16-01-2011:11.47 Pay compensation Mike 200........................ Table 1.2 A more compact Fig. 1.5 The process model discovered representation by theofα-algorithm log shown [103] Case based id on the set of traces Trace { a, b, d, e, h, a, d, c, e, g, a, inc, Table d, e, f, 1.1: b, d, a e, = g, register a, d, b, e, h, a, c, d, e, f, d, c, e, f, c, d, e, h, a, c, d, e, g } request, b = examine 1 a, b, d, e, h thoroughly, c = examine 2 a, d, c, e, g casually, d = check ticket, After executing h, the case ends in the desired final marking 3 with just a token in a, c, d, e, f, b, d, e, g e = decide, f = reinitiate place end. Similarly, it can request, be checked g = that pay the other five traces 4 shown in Table 1.2 a, d, b, e, h are also possible in the model compensation, and that alland of these h = reject traces result 5 in the marking with a, c, d, e, f, d, c, e, f, c, d, e, h just a token in place end. request 6 a, c, d, e, g...... 24

e = decide, f = reinitiate request, g = pay compensation, and h = reject request 4 a, d, b, e, h 5 a, c, d, e, f, d, c, e, f, c, d, e, h 14 6 a, c, d, e, g 1 Introduction Discovery...... Example Table 1.1 (Continued) Case id Event id Properties Timestamp Activity Resource Cost... 6 35654871 06-01-2011:15.02 Register request Mike 50... 35654873 06-01-2011:16.06 Examine casually Ellen 400... 35654874 07-01-2011:16.22 Check ticket Mike 100... 35654875 07-01-2011:16.52 Decide Sara 200... 35654877 16-01-2011:11.47 Pay compensation Mike 200........................ Table 1.2 A more compact All cases start Fig. 1.5 The process model discovered representation with by theofα-algorithm log a shown [103] Case based id on the set of traces Trace { a, b, d, e, h, a, d, c, e, g, a, inc, Table d, e, f, 1.1: b, d, a e, = g, register a, d, b, e, h, a, c, d, e, f, d, c, e, f, c, d, e, and h, a, c, end d, e, g } with either g or h. request, b = examine 1 a, b, d, e, h thoroughly, c = examine 2 a, d, c, e, g casually, d = check ticket, After executing h, the case ends in the desired final marking 3 with just a token in a, c, d, e, f, b, d, e, g e = decide, f = reinitiate place end. Similarly, it can request, be checked g = that pay the other five traces 4 shown in Table 1.2 a, d, b, e, h are also one possible of inthe model examination compensation, and that alland of these h = reject traces result 5 in the marking with a, c, d, e, f, d, c, e, f, c, d, e, h just a token in place end. request activities (b or c). 6 a, c, d, e, g...... 25 Every e is preceded by d and

e = decide, f = reinitiate request, g = pay compensation, and h = reject request 4 a, d, b, e, h 5 a, c, d, e, f, d, c, e, f, c, d, e, h 14 6 a, c, d, e, g 1 Introduction Discovery...... Example Table 1.1 (Continued) Case id Event id Properties Timestamp Activity Resource Cost... 6 35654871 06-01-2011:15.02 Register request Mike 50... 35654873 06-01-2011:16.06 Examine casually Ellen 400... 35654874 07-01-2011:16.22 Check ticket Mike 100... 35654875 07-01-2011:16.52 Decide Sara 200... 35654877 16-01-2011:11.47 Pay compensation Mike 200........................ Table 1.2 A more compact Moreover, e Fig. 1.5 The process model discovered representation followed by theofα-algorithm log shown [103] Case based id on the set of traces Trace { a, b, d, e, h, a, d, c, e, g, a, inc, Table d, e, f, 1.1: b, d, a e, = g, register a, d, b, e, h, a, c, d, e, f, d, c, e, f, c, d, e, h, a, c, d, e, by g } f, g, or h. request, b = examine 1 a, b, d, e, h thoroughly, c = examine 2 a, d, c, e, g casually, d = check ticket, After executing h, the case ends in the desired final marking 3 with just a token in a, c, d, e, f, b, d, e, g The repeated e = execution decide, f = reinitiate place end. Similarly, it can request, be checked g = that pay the other five traces 4 shown in Table 1.2 a, d, b, e, h are of also b possible or c, ind, the model and compensation, and e that suggests alland of these h = reject traces result 5 in the marking with a, c, d, e, f, d, c, e, f, c, d, e, h just a token in place end. request the presence of a loop. 6 a, c, d, e, g...... 26

e = decide, f = reinitiate request, g = pay compensation, and h = reject request 4 a, d, b, e, h 5 a, c, d, e, f, d, c, e, f, c, d, e, h 14 6 a, c, d, e, g 1 Introduction Discovery...... Example Table 1.1 (Continued) Case id Event id Properties Timestamp Activity Resource Cost... 6 35654871 06-01-2011:15.02 Register request Mike 50... 35654873 06-01-2011:16.06 Examine casually Ellen 400... 35654874 07-01-2011:16.22 Check ticket Mike 100... 35654875 07-01-2011:16.52 Decide Sara 200... 35654877 16-01-2011:11.47 Pay compensation Mike 200........................ Table 1.2 A more compact These characteristics Fig. 1.5 process model discovered representation by theofα-algorithm logare shown [103] Case based id on the set of traces Trace { a, b, d, e, h, a, d, c, e, g, a, inc, Table d, e, f, 1.1: b, d, a e, = g, register a, d, b, e, h, a, c, d, e, f, d, c, e, f, c, d, e, h, a, c, d, e, g } request, b = examine 1 a, b, d, e, h thoroughly, c = examine 2 a, d, c, e, g casually, d = check ticket, After executing h, the case ends in the desired final marking 3 with just a token in a, c, d, e, f, b, d, e, g e = decide, f = reinitiate place end. Similarly, it can request, be checked g = that pay the other five traces 4 shown in Table 1.2 a, d, b, e, h are also possible in the model compensation, and that alland of these h = reject traces result 5 in the marking with a, c, d, e, f, d, c, e, f, c, d, e, h just a token in place end. request be a good balance between 6 a, c, d, e, g...... 27 dequately captured by the net. When comparing the event log and the model, there seems to overfitting and underfitting.

Overfitting and Underfitting One of the challenges of process mining is to balance between overfitting (the model is too specific and only allows for the accidental behavior observed) and underfitting (the model is too general and allows for behavior unrelated to the behavior observed). 28

Discussion The Petri net shown also allows for traces not in the log. For example, other possible traces are <a, d, c, e, f, b, d, e, g> and <a, c, d, e, f, c, d, e, f, c, d, e, f, c, d, e, f, b, d, e, g> This is a desired phenomenon as the goal is not to represent just the particular set of example traces in the event log. Process mining algorithms need to generalize the behavior contained in the log to show the most likely underlying model that is not invalidated by the next set of observations 29

Mining Other Models We used Petri nets to represent the discovered process models, because Petri nets are a succinct way of representing processes and have unambiguous but intuitive semantics. However, some mining techniques are independent of the 1.2 Limitations of Modeling 5 desired representation. Fig. 1.2 The same process modeled in terms of BPMN 30

Another Discovery 14 1 Introduction Table 1.1 (Continued) Example Case id Event id Properties 1.4 Analyzing an Example Log 15 Timestamp Activity Resource Cost... 6 35654871 06-01-2011:15.02 Register request Mike 50... 35654873 06-01-2011:16.06 Examine casually Ellen 400... 35654874 07-01-2011:16.22 Check ticket Mike 100... 35654875 07-01-2011:16.52 Decide Sara 200... 35654877 16-01-2011:11.47 Pay compensation Mike 200........................ Table 1.2 A more compact Fig. 1.6 The process model representation discovered byofthe logα-algorithm shown Case based id on Cases 1 and 4, i.e., Trace the set of traces { a, b, d, e, h, a, d, b, ine, Table h } 1.1: a = register request, b = examine 1 a, b, d, e, h thoroughly, c = examine 2 a, d, c, e, g The Petri net shown incasually, Fig. 1.5 d = also check allows ticket, for traces 3 not present in Table a, 1.2. c, d, For e, f, b, d, e, g e = decide, f = reinitiate example, the traces a, d, request, c, e, f, g = b, pay d, e, g and a, 4 c, d, e, f, c, d, e, f, c, a, d, d, e, b, f, e, c, h d, e, f, b, d, e, g are also compensation, possible. and This h = is reject a desired 5 phenomenon as the a, c, goal d, e, is f, d, c, e, f, c, d, e, h not to represent just the request particular set of example 6 traces in the event log. a, Process c, d, e, g mining algorithms need to generalize the behavior... contained in the log to. show.. the most likely underlying model that is not invalidated by the next set of observations. 31 One of the challenges of process mining is to balance between overfitting (the

14 1 Introduction Table 1.1 (Continued) Case id Event id Properties Conformance Example Timestamp Activity Resource Cost... 6 35654871 06-01-2011:15.02 Register request Mike 50... 35654873 06-01-2011:16.06 Examine casually Ellen 400... 35654874 07-01-2011:16.22 Check ticket Mike 100... 35654875 07-01-2011:16.52 Decide Sara 200... 35654877 16-01-2011:11.47 Pay compensation Mike 200... 1.4... Analyzing an Example Log............... 15... Table 1.2 A more compact representation of log shown Case id Trace in Table 1.1: a = register request, b = examine 1 a, b, d, e, h thoroughly, c = examine 2 a, d, c, e, g casually, d = check ticket, Table 1.3 Another event log: 3 a, c, d, e, f, b, d, e, g e = decide, f = reinitiate request, g = pay 4 Cases 7, 8, and a, 10 d, b, are e, h not compensation, Fig. 1.6 The process and h = model rejectdiscovered 5 possible by the α-algorithm according based a, onc, Cases d, to e, f, 1Fig. and d, c, 4, e, i.e., 1.5 f, c, the d, set e, h of request traces { a, b, d, e, h, a, d, b, e, 6 h } a, c, d, e, g 16 1 Introduction...... The Petri net shown in Fig. 1.5 also allows for traces not present in Table 1.2. For example, the traces a, d, c, e, f, b, d, e, g and a, c, d, e, f, c, d, e, f, c, d, e, f, c, d, e, f, b, d, e, g are also possible. This is a desired phenomenon as the goal is not to represent just the particular set of example traces in the event log. Process mining algorithms need to generalize the behavior contained in the log to show the most likely underlying model that is not invalidated by the next set of observations. One of the challenges of process mining is to balance between overfitting (the model is too specific and only allows for the accidental behavior observed) and underfitting (the model is too general and allows for behavior unrelated to the behavior observed). When comparing the event log and the model, there seems to be a good balance between overfitting and underfitting. All cases start with a and end with either g or h. Every e is preceded by d and one of the examination activities (b or c). Moreover, e is followed by f, g, or h. The repeated execution of b or c, d, and e suggests the presence of a loop. These characteristics are adequately captured by Fig. 1.5 The process model discovered by the α-algorithm [103] based on the set of traces the net of Fig. 1.5. { a, b, d, e, h, a, d, c, e, g, a, c, d, e, f, b, d, e, g, a, d, b, e, h, a, c, d, e, f, d, c, e, f, c, d, e, h, Let a, usc, now d, e, g } consider an event log consisting of only two traces a, b, d, e, h and a, d, b, e, h, i.e., Cases 1 and 4 of the original log. For this log, the α-algorithm constructs the Petri net shown in Fig. 1.6. This model only allows for two traces After and these executing are exactly h, the the case ones ends in the in the small desired eventfinal log. bmarking and d are with modeled just a token as being Case id 32 Trace 1 a, b, d, e, h 2 a, d, c, e, g 3 a, c, d, e, f, b, d, e, g 4 a, d, b, e, h 5 a, c, d, e, f, d, c, e, f, c, d, e, h 6 a, c, d, e, g 7 a, b, e, g 8 a, b, d, e 9 a, d, c, e, f, d, c, e, f, b, d, e, h 10 a, c, d, e, f, b, d, g

Process Discovery: α-algorithm 33

Process Discovery Process discovery is the activity that combines Discovery with the Control-flow Perspective. The general problem: A process discovery algorithm is a function that maps an event log L onto a process model M such that the model M is representative for the behavior seen in the event log L. We focus on simple event logs and Petri net models (possibly sound workflow nets). 34

etri net that can replay event log L 1. Ideally, the Petri net is a sound WF-net efined in Sect. 2.2.3. Based on these choices, we reformulate the process discove roblem and make it more concrete. Simple Event Log efinition 5.2 (Specific process discovery problem) A process discovery algorith s a function γ that maps a log L B(A ) onto F-net 1 discovered for L 1 =[ a, b, c, d 3 a marked Petri, a, c, b, d 2 net γ (L) = (N, M deally, N is a sound WF-net and all traces in L correspond, a, to e, possible d ] firing s uences of (N, M). Let A be a set of activities. make things more concrete, we define the target to be a Petri ne A simple trace over A is a finite sequence of activities. Function γ defines a so-called Play-in technique as described in Chap. 1. Base, nwe L 1, ause process a simple discovery event algorithm log as γ could input discover (cf. Definition the WF-net shown 4.4). in AFig. simp 5..e., γ (L 1 A ) = simple (N 1, [start]). event Each log trace over ina Lis 1 corresponds a multiset to of atraces. possible firing s uence of WF-net N 1 shown in Fig. 5.1. Therefore, it is easy to see that the WF-n an indeed replay all traces [ in the event log. In fact, each of the three L 1 = a, b, c, d 3, a, c, b, d 2 ] possible firin equences of WF-net N 1 appears in L 1., a, e, d Let us now consider another event log: multi-set of traces over some set of activities A, i.e., L B(A ple log describing the history of six cases. The goal is now to di L 2 = [ a, b, c, d 3, a, c, b, d 4, a, b, c, e, f, b, c, d 2, a, b, c, e, f, c, b, d, a, c, b, e, f, b, c, d 2, a, c, b, e, f, b, c, e, f, c, b, d ] hat can replay event log L 1. Ideally, the Petri net is a sound W Sect. 2.2.3. Based on these choices, we reformulate the process d nd make it more concrete. 35 2 is a simple event log consisting of 13 cases represented by 6 different trace

Challenges 5.4 Challenges 151 Simple structure Other behaviours allowed No completely unrelated behaviour Fig. 5.22 Balancing the four quality dimensions: fitness, simplicity, precision, and generalization made. For example: What is the penalty if 36 a step needs to be skipped and what is the penalty if tokens remain in the WF-net after replay? Later, we will give concrete

Appropriateness 5.4 Challenges Fig. 5.22 Balancing the four quality dimensions: fitness, simplicity, precision, and gen 37 made. For example: What is the penalty if a step needs to be skipped an the penalty if tokens remain in the WF-net after replay? Later, we will giv definitions for fitness. In Sect. 3.6.1, we defined performance measures like error, accurac fp-rate, precision, recall, and F1 score. Recall, also known as the tp-rate, the proportion of positive instances indeed classified as positive (tp/p). T in the log are positive instances. When such an instance can be replay model, then the instance is indeed classified as positive. Hence, the variou of fitness can be seen as variants of the recall measure. Most of the notion in Sect. 3.6.1 cannot be used because there are no negative examples, i.e. are unknown (see Fig. 3.14). Since the event log does not contain informa events that could not happen at a particular point in time, other notations ar The simplicity dimension refers to Occam s Razor. This principle wa discussed in Sect. 3.6.3. In the context of process discovery, this mean simplest model that can explain the behavior seen in the log, is the be The complexity of the model could be defined by the number of nodes in the underlying graph. Also more sophisticated metrics can be used, e.g that take the structuredness or entropy of the model into account. Se an empirical evaluation of the model complexity metrics defined in lite Sect. 3.6.3, we also mentioned that this principle can be operationalized

α-algorithm The α-algorithm was one of the first process discovery algorithms that could adequately deal with concurrency. It has several limitations, but it provides a good introduction into the topic: The α-algorithm is simple and many of its ideas have been embedded in more complex and robust techniques. The α-algorithm scans the event log for particular patterns, called log-based ordering relations, to create a footprint of the log. 38

(c, d), (e, d) d # L1 L1 b L1 # L1 L1 L1 # L1 c L1 L1 # L1 L1 # L1 Log-based Ordering d # L1 L1 L1 # L1 L1 e L1 # L1 # L1 L1 # L1 Relations Definition 5.3 (Log-based ordering relations) L L B(A B(A ). Let a, ). b Let A : a, b A : Let L be an event log over A, i.e., a, a> L b, b if c, and d. onlyhowever, if there is a trace d σ = t 1,t 2,t 3,...,t n and i {1,...,n 1} such that σ L and t i = a and t i+1 = b L1 c because c never hea log. such L b if and that L1 onlyσ contains if a> L band ball t i L = pairs a a andoft i+1 activities = b in a # L b if and only if a L b and b L a a a L b if and L b if and only if a> only if a> L b and b> L a L b and b L a Consider for instance L 1 =[ a, b, c, d 3, a, c, b, d 2, a, e, d ] again. For this and a sometimes event log, the L b if and only the other if a> way following log-based ordering L b and around. b> relations can be found L b a# L1 e > L1 = { (a, b), (a, c), (a, e), (b, c), (c, b), (b, d), (c, d), (e, d) } L1 = { (a, b), (a, c), (a, e), (b, d), (c, d), (e, d) } e L1 # L1, (c, c), (c, e), (d, a), (d, d), (e, b), (e, c), (e, e) } Definition 5.3 (Log-based ordering relations) tivities in a directly follows relation. c> L1 d Let L b a> L b if and only if there is a trace σ = t 1,t 2,t 3,.. d because sometimes d directly follows c and d and a # L d b if L1 and c). only b L1 if ac because L b andb> b L1L ac and Consider for instance L 1 =[ a, b, c, d 3, a, c, b, d A : x L y, y L x, x # L y, or x L y, event log, the following log-based ordering relations ca holds 39 # L1 = { for any pair of activities. Therefore, the (a, a), (a, d), (b, b), (b, e), (c, c), (c, e), (d, a), (d, d), (e, b), (e, c), (e, e) }

, b A : c L1 L1 # L1 L1 # L1 e L1 # L1 # L1 L1 # L1 d # L1 L1 L1 # L1 L1 Log-based e L1 # L1 # L1 Ordering L1 # L1 nly if there is a trace σ = t 1,t 2,t 3,...,t n and i {1,..., Definition 5.3 (Log-based ordering ordering relations) relations) L Let a, b A : Land B(At i = ). Let a and a, b ta i+1 : = b Let L be an event log over A, i.e., Relations: Example a> L b if and only if there is a trace σ = t 1,t 2,t 3,...,t n and i {1,...,n 1} such that σ L and t i = a and t i+1 = b a L b if and only if a> L b and b L a a # L b if and only if a L b and b L a a L b if and only if a> L b and b> L a nly if a> L b and b L a > L1 = { (a, b), (a, c), (a, e), (b, c), (c, b), (b, d), (c, d), (e, d) } L1 = { (a, b), (a, c), (a, e), (b, d), (c, d), (e, d) } # L1 = { (a, a), (a, d), (b, b), (b, e), (c, c), (c, e), (d, a), (d, d), (e, b), (e, c), (e, e) } L1 = { (b, c), (c, b) } Relation > L1 contains all pairs of activities in a directly follows relation. c> L1 d because d directly follows c in trace a, b, c, d. However, d L1 c because c never directly follows d in any trace in the log. L1 contains all pairs of activities in a causality relation, e.g., c L1 d because sometimes d directly follows c and never the other way around (c > L1 d and d L1 c). b L1 c because b> L1 c and c> L1 b, i.e., sometimes c follows b and sometimes the other way around. b # L1 e, d), because (b, b L1 b), e and(b, e L1 b. e), (c, c), (c, e), (d, a), (d, d), (e, b), (e, c), For any log L over A and x, y A : x L y, y L x, x # L y, or x L y, 40 i.e., } precisely one of these relations holds for any pair of activities. Therefore, the footprint of a log can be captured in a matrix as shown in Table 5.1. Let L be an event log over A, i.e., a> L b if and only if there is a trace σ = t 1,t 2,t 3,...,t n and i {1,...,n 1} lysuch if athat σ L b Land t i b= a and L at i+1 = b a L b if and only if a> L b and b L a ly if a> a # Consider L b if and L b and b> for instance only Lif 1 =[ a, a L b, b L a c, d and 3, a, bc, b, L d a 2, a, e, d ] again. For this a event L log, b ifthe and following onlylog-based if a> ordering L b and relations b> can L abe found stance L 1 =[ a, b, c, d 3, a, c, b, d 2, a, e, d ] again. F Consider for instance L 1 =[ a, b, c, d 3, a, c, b, d 2, a, e, d ] again. For this event wing log, log-based the followingordering log-based ordering relations relations can canbe found > L1 = { (a, b), (a, c), (a, e), (b, c), (c, b), (b, d), (c, d), (e, d) }, c), (a, e), (b, c), (c, b), (b, d), (c, d), (e, d) } L1 = { (a, b), (a, c), (a, e), (b, d), (c, d), (e, d) }, c), (a, e), (b, d), (c, d), (e, d) } # L1 = { (a, a), (a, d), (b, b), (b, e), (c, c), (c, e), (d, a), (d, d), (e, b), (e, c), (e, e) } L1 = { (b, c), (c, b) } Relation > L1 contains all pairs of activities in a directly follows relation. c> L1 d

and t i = a and t i+1 = b only if a> L b and b L a Footprint Matrix: nly if a L b and b L a nly if a> L b and b> L a Example instance L 1 =[ a, b, c, d 3, a, c, b, d 2, a, e, d ] again. F lowing log-based ordering relations can be found a, c), (a, e), (b, c), (c, b), (b, d), (c, d), (e, d) } a, c), (a, e), (b, d), (c, d), (e, d) } 5 Process Discovery: An Introduction a b c d e a # L1 L1 L1 # L1 L1 b L1 # L1 L1 L1 # L1 c L1 L1 # L1 L1 # L1 a, d), (b, b), (b, e), (c, c), (c, e), (d, a), (d, d), (e, b), (e, c) d } # L1 L1 L1 # L1 L1 c, b) e L1 # L1 # L1 L1 # L1 tains all pairs of activities in a directly follows relation. c 41

c, d 2, c, e, f, c, b # c # Patterns d # # e # # Footprints are f useful to # discover typical patterns of activities # in the corresponding process model 42

d # # # e # # # f # # Patterns Footprints are useful to discover typical patterns of activities in the corresponding process model 43

Patterns Footprints are useful to discover typical patterns of activities in the corresponding process model. 5.4 Typical process patterns and the footprints they leave in the event log 44

5.2.2 Algorithm After showing the basic idea and some examples, we describe the α-algorithm [103]. Definition 5.4 (α-algorithm) as follows. The Algorithm Let L be an event log over T A. α(l) is defined (1) T L ={t T σ L t σ } transitions (2) T I ={t T σ L t = first(σ )} start event (3) T O ={t T σ L t = last(σ )} end event (4) X L ={(A, B) A T L A = B T L B = a A b B a L b a1,a 2 A a 1 # L a 2 b1,b 2 B b 1 # L b 2 } decision point (5) Y L ={(A, B) X L (A,B ) X L A A B B = (A, B) = (A,B )} (6) P L ={p (A,B) (A, B) Y L } { i L,o L } places (7) F L ={(a, p (A,B) ) (A, B) Y L a A} { (p (A,B), b) (A, B) Y L b B} {(i L, t) t T I } {(t, o L ) t T O } arcs (8) α(l) = (P L,T L,F L ) net max decision point L is an event log over some set T of activities. In Step 1, it is checked which activities do appear in the log (T L ). These will correspond to the transitions of the generated WF-net. T I is the set of start activities, 45 i.e., all activities that appear first in some trace (Step 2). T O is the set of end activities, i.e., all activities that appear last in

The Core of the Algorithm: Steps 4, 5 How to identify L? Rearrange the lumns ing to,...,a m } and,...,b n } and other rows and m the footprint a 1 a 2... a m b 1 b 2... b n a 1 # #... #... a 2 # #... #.............................. a m # #... #... b 1... # #... # b 2... # #... #........................... b n... # #... # consider L 1 again. Clearly, A ={a} and B ={b, 46 e} meet the requirements

The Core of the Algorithm: Step 4, 5 5 Process Discovery: An Introduction (A,B) sitions in set ns in set B to identify arrange the s a 1 a 2... a m b 1 b 2... b n 47

rows and columns corresponding to A ={a 1,a 2,...,a m } and B ={b 1,b 2,...,b n } and remove the other rows and columns from the footprint The Algorithm: 5 Process Discovery: Example An Introduction nt oflet L 1 : us consider L a L1 c, a 1 again. Clearly, b 2 A ={a} and... B ={b, e} # meet the # requirements... # b c d e stated in Step 4. Also A ={a}.. and. B... ={b}... meet... the same... requirements........ X.. L is. the.. set of all such pairs that meet the b a # n requirements just... mentioned. # In this # case:... # L1 L1 L1 # L1 L1 X b L1 # L1 L1 L1 # L1 L1 = {( {a}, {b} ), ( {a}, {c} ), ( {a}, {e} ), ( {a}, {b, e} ), ( {a}, {c, e} ), c L1 L1 # L1 L1 # Let us consider ( L L1 1 again. ) ( Clearly, ) ( A ={a} ) and ( B ={b, e} ) meet ( the requirements )} stated ind {b}, Step 4. Also # {d}, {c}, L1 A ={a} {d}, {e}, {d} and L1 B ={b}, {b, e}, meet L1 the same # {d}, {c, e}, L1 requirements. {d} L1 X L is the set of allesuch pairs that L1 meet the # L1 requirements # L1 just mentioned. L1 In this# case: If one would insert a place for any element in X L1 L1, there would be too many places. Therefore, only the X L1 = {( maximal {a}, {b} ), ( pairs {a}, {c} ) (A,, ( B) should {a}, {e} ), ( be included. {a}, {b, e} ), ( Note that {a}, {c, e} ) for any, pair (A, B) X L, nonempty ( ) ( set A ) A, ( and nonempty ) ( set ) B ( B, it)} is implied that (A,B ) X L. In {b}, Step {d} 5,, all {c}, nonmaximal Let {d} L, {e}, be {d} anpairs event, {b, are e}, log removed, {d} over, {c, A thus e},, {d} i.e., yielding: Log-based ordering relations) a, b A : b 2... # #... # a 1 # #... #.............................. a 2 # #... #... b n... # #... #........................... a m # #... #... b 1... # #... # If one would Y insert a place for any element in X L1, there would be too many places. L1 = {( {a}, {b, e} ), ( {a}, {c, e} ), ( {b, e}, {d} ), ( {c, e}, {d} )} Therefore, only the maximal pairs (A, B) should be included. Note that for any pair (A, B) X L, nonempty set A A, and nonempty set B B, it is implied Step 5 can that (A,B also be understood in terms the footprint matrix. Consider Table 5.4 ) X L. In Step 5, all nonmaximal pairs are removed, thus yielding: only if there is a trace σ = t 1,t 2,t 3,...,t n and i {1,...,n 1} L and t i = a and t i+1 = b d only if a> L b and b L a only giovedì 13 dicembre if a 2012 L b and b{( L a ) ( ) ( ) ( )} and let A and B be such that A 48 A and B B. Removing rows and columns A B \ (A B ) results in a matrix still having the pattern shown in

X L1 = {a}, {b}, {a}, {c}, {a}, {e}, {a}, {b, e}, {a}, {c, e}, ( ) ( ) ( ) ( ) ( )} {b}, {d}, {c}, {d}, {e}, {d}, {b, e}, {d}, {c, e}, {d} only if a> L b and b L a nly if a L b and b L a The Algorithm: Example nly if a> L b and b> L a If one would insert a place for any element in X L1, there would be too many places. Therefore, only the maximal pairs (A, B) should be included. Note that for any pair (A, B) X L, nonempty set A A, and nonempty set B B, it is implied that (A,B ) X L. In Step 5, all nonmaximal pairs are removed, thus yielding: instance L 1 =[ a, b, c, d 3, a, c, b, d 2, a, e, d ] again. F lowing log-based ordering relations can be found Y L1 = {( {a}, {b, e} ), ( {a}, {c, e} ), ( {b, e}, {d} ), ( {c, e}, {d} )} 126 5 Process Discovery: An Introduction Step 5 can also be understood in terms the footprint matrix. Consider Table 5.4 and let A and B be such that A A and B B. Removing rows and columns A B \ (A B ) results in a matrix still having the pattern shown in Table 5.4. Therefore, we only consider maximal matrices for constructing Y L. a, c), (a, e), (b, c), (c, b), (b, d), (c, d), (e, d) } a, c), (a, e), (b, d), (c, d), (e, d) } a, d), (b, b), (b, e), (c, c), (c, e), (d, a), (d, d), (e, b), (e, c) } c, b) Fig. 5.1 WF-net N 1 discovered for L 1 =[ a, b, c, d 3, a, c, b, d 2, a, e, d ] tains all pairs of activities in a directly follows relation. c ments. To make things more concrete, we define the target to be a Petri net model. Moreover, we use a simple event log as 49 input (cf. Definition 4.4). A simple event log L is a multi-set of traces over some set of activities A, i.e., L B(A ). For y follows c in trace a, b, c, d. However, d L c because

i.e., γ (L 1 ) = (N 1, [start]). Each trace in L 1 corresponds to a possible firing sequence of WF-net N 1 shown in Fig. 5.1. Therefore, it is easy to see that the WF-net can indeed replay all traces in the event log. In fact, each of the three possible firing sequences of WF-net N 1 appears in L 1. Let us now consider another event log: Other Examples L 2 = [ a, b, c, d 3, a, c, b, d 4, a, b, c, e, f, b, c, d 2, a, b, c, e, f, c, b, d, a, c, b, e, f, b, c, d 2, a, c, b, e, f, b, c, e, f, c, b, d ] Process Discovery 131 L 2 is a asimple bevent log c consisting d of e 13 cases f represented by 6 different traces. Based on event log L 2, some γ could discover WF-net N 2 shown in Fig. 5.2. This a # # # # WF-net can indeed replay all traces in the log. However, not all firing sequences of b # N 2 correspond to traces in L 2. For example, the firing sequence a, c, b, e, f, c, b, d c # does not appear in L 2. In fact, there are infinitely many firing sequences because of d # # # # the loop construct in N 2. Clearly, 5.1 these Problemcannot Statement all appear in the event log. Therefore, e # # # Definition 5.2 does not require all firing sequences of (N, M) to be traces in L. f # # # 50

g. 5.6 WF-net N 4 derived from L 4 =[ a, c, d 45, b, c, d 42, a, c, e 38, b, c, e 22 ] Other Examples L 3 = [ a, b, c, d, e, f, b, d, c, e, g, a, b, d, c, e, g 2, a, b, c, d, e, f, b, c, d, e, f, b, d, c, e, g ] d from L 3 =[ a, b, c, d, e, f, b, d, c, e, g, a, b, d, c, e, g 2, a, b, c,, g ] a b c d e f g e α-algorithm constructs WF-net N 3 based on L 3 (see Fig. 5.5). a # # # # # # Table b 5.3 shows # the footprint # of L 3. Note # that the patterns in the model inde atchc the# log-based # ordering relations # extracted # from the event log. Consider, d # # # # ample, the process fragment involving b, c, d, and e. Obviously, this fragm e # # # 132 5 Process Discovery: An Introduction n f be constructed # # based# on b L3 # c, b# L3 d, c L3 d, c L3 e, and d L3 e g choice # following # # e is# revealed by # e # L3 f, e L3 g, and f # L3 g. Etc. Another example is shown in Fig. 5.6. WF-net N 4 can be derived from L 4 L 4 = [ a, c, d 45, b, c, d 42, a, c, e 38, b, c, e 22] from L 4 =[ a, c, d 45, b, c, d 42, a, c, e 38, b, c, e 22 ] 51 Fig. 5.5 WF-net N derived from L =[ a, b, c, d, e, f, b, d, c, e, g, a, b, d, c, e, g 2, a, b, c,

match the log-based ordering relations extracted from the event log. Consider, for example, the process fragment involving b, c, d, and e. Obviously, this fragment can be constructed based on b L3 c, b L3 d, c L3 d, c L3 e, and d L3 e. The choice following e is revealed by e L3 f, e L3 g, and f # L3 g. Etc. Another example is shown in Fig. 5.6. WF-net N 4 can be derived from L 4 Other Examples L 4 = [ Fig. 5.5 a, c, d 45 WF-net N 3 derived, b, c, d 42 from L 3 =[ a,, a, c, e 38 b, c, d, e, f, b,, b, c, e 22] d, c, e, g, a, b, d, c, e, g 2, a, d, e, f, b, c, d, e, f, b, d, c, e, g ] Table 5.3 Footprint of L 3 a b c d e f a b c d e a # # # # b # # # # c # d # # # # e # # # # a # # # # # b # # c # # # d # # # e # # # f # # # # g # # # # # Fig. 5.6 WF-net N 4 derived 52 from L 4 =[ a, c, d 45, b, c, d 42, a, c, e 38, b, c, e 22 ]

3 4 shows that indeed α(l 3 ) = N 3 and α(l 4 ) = N 4. In Figs. 5.5 and 5.6, the p named based on the sets Y L3 and Y L4. Moreover, α(l 1 ) = N 1 and α(l 2 ) = ulo renaming of places (because different place names are used in Figs. 5.1 renaming of places (because different place names are used in Figs. 5.1 and 5. ese examples show that the α-algorithm is indeed able to discover WF-nets bas These examples show that the α-algorithm is indeed able to discover WF-n event logs. on event logs. Let us now consider event log L 5 : Other Examples Let us now consider event log L 5 : L 5 = [ a, b, e, f 2, a, b, e, c, d, b, f 3, a, b, c, d, b, 2 a, b, c, d, e, b, f 4, a, e, b, c, d, b, f 3], a, b, c, d, e, b, f 4, a, e, b, c, d, b, f 3] r Process Discovery 135 a b c d e f le 5.5 shows the footprint of the log. T a # # # # I ={a} Let us now apply the 8 steps of the algorithm for L = L 5 : b # c # # # T L ={a, b, c, d, e, f } d # # # e # T I ={a} f # # # 5 Process Discovery: # An Introduction Fig. 5.8 WF-net N 5 derived from L 5 =[ a, b, e, f 2, a, b, e, c, d, b, f 3, a, b, c, e, d, b a, b, c, d, e, b, f 4, a, e, b, c, d, b, f 3 ] T I ={f } L 5 = [ a, b, e, f 2, a, b, e, c, d, b, f 3, a, b, c, e, d, b, f 2, 136 5 Process Discovery: An Introdu Table 5.5 shows the footprint of the log. Let us now apply the 8 steps of the algorithm for L = L 5 : T L ={a, b, c, d, e, f } T I ={f } X L = {( {a}, {b} ), ( {a}, {e} ), ( {b}, {c} ), ( {b}, {f } ), ( {c}, {d} ), ( {d}, {b} ), ( {e}, {f } ), ( {a, d}, {b} ), ( {b}, {c, f } )} Y L = {( {a}, {e} ), ( {c}, {d} ), ( {e}, {f } ), ( {a, d}, {b} ), ( {b}, {c, f } )} P L = { p ({a},{e}),p ({c},{d}),p ({e},{f }),p ({a,d},{b}),p ({b},{c,f }),i L,o L } B) Y L corresponds to a place p (A,B) connecting transi- In addition, P L X L = {( also contains a unique {a}, {b} ), ( source place {a}, {e} ) i L, ( and F L = { (a, {b}, {c} ) p, ( ({a},{e}) ), (p ({a},{e}) {b}, {f } ), (, e), (c, p ({c},{d}) {c}, {d} ) ), (p ({c},{d}), d), f. Step 6). Remember that the goal is to create a WF-net. 1 Nevertheless, 1 the (e, α-algorithm p may construct a Petri net that is not a WF-net (see, fo ({e},{f }) ), (p ({e},{f }), f ), (a, p ({a,d},{b}) ), (d, p, ({a,d},{b}) ), the WF-net are generated. All start transitions in T Fig. I have 5.12). Later, we will discuss such problems in detail. (p all end transitions( T O have o L as ) output ( place. All ) places ({a,d},{b}), b), (b, p ({b},{c,f }) ), (p ({b},{c,f }), c), (p ({b},{c,f }), f ), ( ) ( )} odes and B as output {d}, nodes. {b} The result, {e}, is a Petri {f net } α(l), {a, = es the behavior d}, (i {b} L, a),, (f, o {b}, L ) } {c, f } 53 Y L = {( seen in event log {a}, {e} ) L. four logs and four WF-nets. Application, ( of the α-algorithm {c}, {d} ), ( α(l) = (P {e}, {f } ) L,T ( L,F L ) {a, d}, {b} ), ( {b}, {c, f } )} 5.8 WF-net N = N 3 and α(l 5 derived from L 4 ) = N 4. In 5 =[ a, b, e, f Figs. 5.5 and 2, a, b, e, c, d, b, f 5.6, the places 3, a, b, c, e, d, b, f are 2,, c, d, e, b, f 4, a, e, b, c, d, b, f 3 ] Figure 5.8 shows N 5 = α(l 5 ), i.e., the model just computed. N 5 can in

tly followed by b. Consequently, a footprint like the one shown in Table 5.5 ed to be valid. Limitation: We revisit the notion of completeness Implicit later in this chapter. en if we assume that the log is complete, the α-algorithm has some problem e are many different WF-nets that have the same possible behavior, i.e., tw ls can be structurally differentplaces but trace equivalent. Consider, for instance, th ing event log: L 6 = [ a, c, e, g 2, a, e, c, g 3, b, d, f, g 2, b, f, d, g 4] 5.2 A Simple Algorithm for Process Discovery 137 ) is shown in Fig. 5.9. Although the model is able to generate the observe ior, the resulting WF-net is needlessly complex. Two of the input places of dundant, i.e., they can be removed without changing the behavior. The place ted as p 1 and p 2 are so-called implicit places and can be removed witho Fig. 5.9 WF-net N 6 derived from L 6 =[ a, c, e, g 2, a, e, c, g 3, b, d, f, g 2, b, f, d, g 4 ]. The two highlighted places are redundant, i.e., removing them will simplify the model without changing its behavior p1 and p2 are redundant Fig. 5.10 Incorrect WF-net N 7 derived from L 7 =[ a, c 2, a, b, c 3, 54

possible trace equivalent WF-nets. e original α-algorithm (as presented in Sect. 5.2.2) has problems dealing loops, i.e., loops of length one or two. For a loop of length one, this Limitation: Short Loop ed by WF-net N 7 in Fig. 5.10, which shows the result of applying the b thm to L 7. t N 6 derived from L 6 =[ a, c, e, g 2, a, e, c, g 3, b, d, f, g 2, b, f, d, g 4 ]. The places are redundant, i.e., removing them will simplify the model without changing L 7 = [ a, c 2, a, b, c 3, a, b, b, c 2, a, b, b, b, b, c 1] e resulting model is not a WF-net as transition b is disconnected from the rect WF-net model. The models allows for the execution of b before a and after c. Th nsistent, b, c 3, with the event log. This problem can be addressed easily as show b, b, sing an improved version of the Fig. α-algorithm, 5.9 WF-net N 6 derived from L 6 =[ a, one c, e, g 2, can a, e, c, g 3 discover, b, d, f, g 2, b, f, d, g 4 the ]. The WF two highlighted places are redundant, i.e., removing them will simplify the model without changing in Fig. 5.11. e problem with loops of length two is illustrated by Petri net N 8 in Fig. L 7 =[ a, c 2, a, b, c 3, et N 7 having rt-loop of 5.2 A Simple Algorithm for Process Discovery 137 its behavior Fig. 5.10 Incorrect WF-net N 7 derived from a, b, b, c 2, a, b, b, b, b, c 1 ] b is disconnected from the model shows the result of applying the basic algorithm to L 8. [ L 8 = a, b, d 3 Fig. 5.11 WF-net, a, b, c, b, d 2 N ] 7 having a so-called short-loop of length one, a, b, c, b, c, b, d Expected net: 55 affecting the set of possible firing sequences. In fact, Fig. 5.9 shows only one of many possible trace equivalent WF-nets. The original α-algorithm (as presented in Sect. 5.2.2) has problems dealing with

ig. 5.11. blem with loops of length two is illustrated by Petri net N 8 in Fi Limitation: Short Loop ws the result of applying the basic algorithm to L 8. L 8 = [ a, b, d 3, a, b, c, b, d 2, a, b, c, b, c, b, d ] Fig. 5.13 Corrected WF-net N 8 having a so-called short-loop of length two 138 5 Process Discovery: An Introduction rrected WF-net N 8 having a so-called short-loop of length two og: a L8 b, that The b and following c log-based ordering relations are derived from this event log: a L8 b, g. 5.12 b is L8 not d, and b L8 c. Hence, the basic algorithm incorrectly assumes that b and c ng log-based he extension are in parallel ordering because relations they follow are derived one another. from this The event model log: shown a in L8 Fig. b, 5.12 is not nd b F-net even L8 shown a c. WF-net, Hence, because the basic c is algorithm not on a path incorrectly from source assumes to sink. that Using b and the c extension lel because they follow one another. The model shown in Fig. 5.12 is not net, able inbecause to Fig. deal 5.13. c is not on a path from source to sink. Using the extension n [30], c is disconnected from the model lternatives There the improved to are various α-algorithm ways to improve correctlythe discovers basic α-algorithm the WF-netto shown be able to deal. t. 5.2.2. with loops. The The α + -algorithm described in [30] is one of several alternatives to re various phase address ways deals problems to improve related tothethebasic original α-algorithm topresented be able in to Sect. deal 5.2.2. The The ps α of + α length -algorithm + -algorithm usesdescribed a pre andin postprocessing [30] is one of phase. several The alternatives preprocessing tophase deals blems withrelated loops of tolength the original two whereas algorithm the preprocessing presented phase Sect. inserts 5.2.2. The loops of length Expected net: m uses or more. one. a pre and postprocessing phase. The preprocessing phase deals For Fig. 5.13 Corrected WF-net N 8 having a so-called short-loop two rency of length can Thebe two basicwhereas algorithmthehas preprocessing no problems phase mininginserts loops of loops length of three lengthor more. For 56 Fig. 5.13 Corrected WF-net N 8 having a so-called short-loop of length two L b, a b> loop of L c, involving at least three activities (say a, b, and c), concurrency can be described Fig. 5.12 in [30], Incorrect thewf-net improved N 8 derived α-algorithm from L 8 =[ a, correctly b, d 3, a, discovers b, c, b, d 2, the a, b, WF-net c, b, c, b, shown d ] 138 5 Process Discovery: An In Fig. 5.12 Incorrect WF-net N 8 derived from L 8 =[ a, b, d 3, a, b, c, b, d 2, a, b, c, b

Limitation: Noise We use the term noise to refer to rare and infrequent behavior rather than errors related to event logging. For example, frequencies are not taken into account by the α-algorithm (discard less frequent traces?). 57

Limitation: Noise 58 6 Advanced Process Discovery Techniques 58

Limitation: Noise 59 Fig. 6.1 Overview of the challenges that process discovery techniques need to address

Limitation: Incompleteness Whereas noise refers to the problem of having too much data (describing rare behavior), (in)completeness refers to the problem of having too little data. Process models typically allow for an exponential or even infinite number of different traces (in case of loops). Moreover, some traces may have a much lower probability than others. Therefore, it is unrealistic to assume that every possible trace is present in the event log. 60

Limitation: Incompleteness The α-algorithm uses a local completeness notion: if there are two activities a and b, and a can be directly followed by b, then this should be observed at least once in the log. 61

Conformance Checking 62

Two Angles Conformance check is based on the comparison between an event log and a process model. (Un)desirable deviations can be detected. First viewpoint (the model is supposed to be descriptive): the model does not capture the real behavior ( the model is wrong, how to improve it? ) Second viewpoint (the model is normative) reality deviates from the desired model ( the event log is wrong, how to impose control? ). 63

Measures and Diagnostic 192 7 Conformance Checking Fig. 7.1 Conformance checking: comparing observed behavior with modeled behavior. Global conformance measures quantify the overall conformance of the model and log. Local diagnostics are given by highlighting the nodes in the model where model and log disagree. Cases that do not fit are highlighted in the visualization of the log 64

Business Alignment The goal of business alignment is to make sure that the information systems and the real business processes are well aligned. People should be supported by the information system rather than work behind its back to get things done. Process mining can assist in improving the alignment of information systems, business processes, and the organization. By analyzing the real processes and diagnosing discrepancies, new insights can be gathered showing how to improve the support by information systems. 65

Auditing The term auditing refers to the evaluation of organizations and their processes. Audits are performed to ascertain the validity and reliability of information about these organizations and associated processes. This is done to check whether business processes are executed within certain boundaries set by managers, governments, and other stakeholders. Rules violations may indicate fraud, malpractice, risks, and inefficiencies. 66

New Forms of Auditing However, today detailed information about processes is being recorded in the form of event logs, audit trails, transaction logs, databases, data warehouses, etc. All events in a business process can be evaluated and this can be done while the process is still running. The availability of log data and advanced process mining techniques enables new forms of auditing, and conformance checking in particular, provide the means to do so. 67

Quality Criteria We have seen four quality criteria: fitness, precision, generalization, and simplicity. In an example shown, for each of these models, a subjective judgment is given with respect to the four quality criteria. As the models are rather extreme, the scores +/- for the various quality criteria are evident. We discuss how the notion of fitness can be quantified. 68

Appropriateness 5.4 Challenges Fig. 5.22 Balancing the four quality dimensions: fitness, simplicity, precision, and gen 69 made. For example: What is the penalty if a step needs to be skipped an the penalty if tokens remain in the WF-net after replay? Later, we will giv definitions for fitness. In Sect. 3.6.1, we defined performance measures like error, accurac fp-rate, precision, recall, and F1 score. Recall, also known as the tp-rate, the proportion of positive instances indeed classified as positive (tp/p). T in the log are positive instances. When such an instance can be replay model, then the instance is indeed classified as positive. Hence, the variou of fitness can be seen as variants of the recall measure. Most of the notion in Sect. 3.6.1 cannot be used because there are no negative examples, i.e. are unknown (see Fig. 3.14). Since the event log does not contain informa events that could not happen at a particular point in time, other notations ar The simplicity dimension refers to Occam s Razor. This principle wa discussed in Sect. 3.6.3. In the context of process discovery, this mean simplest model that can explain the behavior seen in the log, is the be The complexity of the model could be defined by the number of nodes in the underlying graph. Also more sophisticated metrics can be used, e.g that take the structuredness or entropy of the model into account. Se an empirical evaluation of the model complexity metrics defined in lite Sect. 3.6.3, we also mentioned that this principle can be operationalized

Measuring Fitness However, in a more realistic setting it is much more difficult to judge the quality of a model. Fitness measures the proportion of behavior in the event log possible according to the model. Of the four quality criteria, fitness is most related to conformance. A naïve approach toward conformance checking would be to count the fraction of cases that can be parsed completely (i.e., the proportion of cases corresponding to firing sequences leading from [start] to [end]). 70

1391 cases Table 7.1 Event log L full : a = register request, b = examine thoroughly, c = examine casually, d = check ticket, e = decide, f = reinitiate request, g = pay compensation, and h = reject request Example Frequency Reference Trace 455 σ 1 a, c, d, e, h 191 σ 2 a, b, d, e, g 177 σ 3 a, d, c, e, h 144 σ 4 a, b, d, e, h 111 σ 5 a, c, d, e, g 82 σ 6 a, d, c, e, g 56 σ 7 a, d, b, e, h 47 σ 8 a, c, d, e, f, d, b, e, h 38 σ 9 a, d, b, e, g 33 σ 10 a, c, d, e, f, b, d, e, h 14 σ 11 a, c, d, e, f, b, d, e, g 11 σ 12 a, c, d, e, f, d, b, e, g 9 σ 13 a, d, c, e, f, c, d, e, h 8 σ 14 a, d, c, e, f, d, b, e, h 5 σ 15 a, d, c, e, f, b, d, e, g 3 σ 16 a, c, d, e, f, b, d, e, f, d, b, e, g 2 σ 17 a, d, c, e, f, d, b, e, g 2 σ 18 a, d, c, e, f, b, d, e, f, b, d, e, g 1 σ 19 a, d, c, e, f, d, b, e, f, b, d, e, h 1 σ 20 a, d, b, e, f, b, d, e, f, d, b, e, g 1 σ 21 71 a, d, c, e, f, d, b, e, f, c, d, e, f, d, b, e, g

fore checking the ticket (activity d). Clearly, N able 7.1. For example, σ 3 = a, d, c, e, h is not Example N1 F-net N 3 has no choices, e.g., the request 7 Conformance is Che al e 7.1 cannot be replayed by this model, e.g., σ cording to WF-net N 3. WF-net N 4 is a varian quirement is that traces need to start with a and Table 7.1 can be replayed by N 4. A naïve approach toward conformance check action of cases that can be parsed completely ( onding to firing sequences leading from [start] tness naïve offitness N 1 is 1391 1391 = 1, i.e., all 1391 cases in L f N 1 ( can be replayed ). 72 The fitness of N 2 is

d). Clearly, N 2 does not allow for all traces in, c, e, h is not possible according to WF-net N 2. Example N2 e request is always rejected. Many traces in Tamodel, e.g., σ 2 = a, b, d, e, g is not possible N 4 is a variant of the flower model : the only tart with a and end with g or h. Clearly, all traces. rmance checking would be to simply count the d completely (i.e., the proportion of cases correg from [start] to [end]). Using this approach the 391 cases in L full correspond to a firing a, sequence d, c, e, g 82 tness of N 2 is 948 naïve fitness 443 cases do not correspond to a firing sequence 1391 a, d, c, e, h 177 a, d, b, e, h 56 = 0.6815 because 948... cases 443 cases do not correspond to a firing sequence 73

as no choices, e.g., the request is always rejected ot be replayed by this model, e.g., σ 2 = a, b, d, WF-net N 3. WF-net N 4 is a variant of the flow naïve fitness Example N3 is that traces need to start with a and end with g or an be replayed by N 4. pproach toward conformance checking would be ses that can be parsed completely (i.e., the propo ring sequences leading from [start] to [end]). Usi is 1391 1391 = 1, i.e., all 1391 cases in L correspond full be replayed ). The fitness of N 2 is 948 1391 a, b, d, = e, g 0.6815 191 ed correctly 759 whereas cases do not correspond 443 cases to a firing do sequence not a, correspond b, d, e, h 144 a, c, d, e, g 111 tness of N 3 is 632 = 0.4543: only 632... cases have 1391 g sequence of N 2. The fitness of N 74 4 is 1391 1391 = 1

fore checking the ticket (activity d). Clearly, N able 7.1. For example, σ 3 = a, d, c, e, h is not Example N4 F-net N 3 has no choices, e.g., the request is al e 7.1 cannot be replayed by this model, e.g., σ cording to WF-net N 3. WF-net N 4 is a varian quirement is that traces need to start with a and Table 7.1 can be replayed by N 4. A naïve approach toward conformance check action of cases that can be parsed completely ( flower model (poorly structured). onding 7.2 Four WF-nets: to firing N 1, N 2, Nsequences 3 and N 4 leading from [start] tness naïve offitness N 1 is 1391 e this example to introduce 1391 = 1, i.e., all 1391 cases in L f the notation. Figure 7.3 shows the various stages N 1 ( can be replayed ). The fitness of N 2 is 75 lay. Four counters are shown at each stage: p (produced tokens), c (consum

which places p1 and p2 are merged into a Almost Fitting Traces fitness of 0 1391 This naïve fitness notion seems to be too strict as traces can differ only slightly and not be counted at all. Now consider a model that cannot replay σ, but that can replay 99 of the 100 events in σ. Then, consider another model that can only replay 10 of the 100 events in σ. = 0, because none of the tr seems to be too strict as most of the model This is especially the case for larger proces σ = a 1,a 2,...,a 100 in some log L. Now but that can replay 99 of the 100 events in σ consider another model that can only repla is not fitting at all). Using the naïve fitness fied as nonfitting for both models without in one model and in complete disagreemen Using the naïve fitness metric, the trace would simply be a fitness notion defined at the level of even classified as nonfitting for both models without acknowledging that σ was almost fitting In the naïve fitness computation just d in one model and in complete disagreement with the other. once we encounter a problem and mark it 76

Missing and Remaining Tokens We introduce a fitness notion defined at the level of events rather than full traces. In the naïve fitness computation just described, we stopped replaying a trace once we encounter a problem (and mark it as nonfitting). Let us instead just continue replaying the trace on the model but record all situations where a transition is forced to fire without being enabled, i.e., we count all missing tokens. Moreover, we record the tokens that remain at the end. 77

Four Counters p (produced tokens) c (consumed tokens) m (missing tokens) r (remaining tokens) 7 Conformance Ch counter is incremented by 2. Therefore, p = 3 and c = 1 after. Then we replay the second event (c). Firing transition c results in p fter replaying the third event (i.e. d) p = 5 and c = 3. They we rep sumes two tokens and produces one, the result is p = 6 and c = 5. he last event (h). Firing h results in p = 7 and c = 6. At the en t consumes a token from place end. Hence the final result is p = 0. Clearly, there are no problems when replaying the σ 1, i.e., the or remaining tokens (m = r = 0). ss of a case with trace σ on WF-net N is defined as follows: ( ) ( ) fitness(σ, N) = 1 2 1 m c + 1 2 1 r p rts computes the fraction of missing tokens relative to the numb okens. 1 m c = 1 if there are no 78 missing tokens (m = 0) and 1 m

Example 197 the environment produces a token for place start Fig. 7.2 Four WF-nets: N 1, N 2, N 3 and N 4 ), use this example to introduce the notation. F replay. Four counters are shown at each stag tokens), m (missing tokens), and r (remainin Initially, p = c = 0 and all places are emp token for place start. Therefore, the p counte to replay σ 1 = a, c, d, e, h, i.e., we first fi a consumes one token and produces two to 79

), Example replaying a is possible one token is consumed, two produced Fig. 7.2 Four WF-nets: N 1, N 2, N 3 and N 4 use this example to introduce the notation. F replay. Four counters are shown at each stag tokens), m (missing tokens), and r (remainin Initially, p = c = 0 and all places are emp token for place start. Therefore, the p counte to replay σ 1 = a, c, d, e, h, i.e., we first fi a consumes one token and produces two to 80

replaying c is possible one token is consumed, one produced Example Fig. 7.2 Four WF-nets: N 1, N 2, N 3 and N 4 use this example to introduce the notation. F replay. Four counters are shown at each stag tokens), m (missing tokens), and r (remainin Initially, p = c = 0 and all places are emp token for place start. Therefore, the p counte to replay σ 1 = a, c, d, e, h, i.e., we first fi a consumes one token and produces two to 81

replaying d is possible one token is consumed, one produced Example Fig. 7.2 Four WF-nets: N 1, N 2, N 3 and N 4 use this example to introduce the notation. F replay. Four counters are shown at each stag tokens), m (missing tokens), and r (remainin Initially, p = c = 0 and all places are emp token for place start. Therefore, the p counte to replay σ 1 = a, c, d, e, h, i.e., we first fi a consumes one token and produces two to 82

replaying e is possible two tokens are consumed, one produced Example Fig. 7.2 Four WF-nets: N 1, N 2, N 3 and N 4 use this example to introduce the notation. F replay. Four counters are shown at each stag tokens), m (missing tokens), and r (remainin Initially, p = c = 0 and all places are emp token for place start. Therefore, the p counte to replay σ 1 = a, c, d, e, h, i.e., we first fi a consumes one token and produces two to 83

the third event (i.e. d) p = 5 and c = 3. They we replay e. ens and produces one, the result is p = 6 and c = 5. Then (h). Firing h results in p = 7 and c = 6. At the end, the token from place end. Hence Example the final result is p = c = 7 here are no problems when replaying the σ 1, i.e., there are tokens (m = r = 0). replaying h is possible one ith token traceis σconsumed, WF-net one produced N is defined as follows: s(σ, N) = 1 ( 1 m ) + 1 ( 1 r ) 2 c 2 p At the end, the environment consumes a token from place end. Fig. 7.2 Four WF-nets: N 1, N 2, N 3 and N 4 use this example to introduce the notation. F the fraction of missing tokens relative to the number of = 1 if there are no missing tokens (m = 0) and 1 m c = 0 ed were missing (m = c). Similarly, 1 p r = 1 if there nd 1 r p = 0 if none of the produced tokens was actually al penalty for missing and remaining tokens. By definition: our example, fitness(σ 1,N 1 ) = 1 2 (1 0 7 ) + 1 2 (1 0 7 ) = 1 ing or remaining tokens. trace that cannot be replayed properly. Figure 7.4 shows 3 = a, d, c, e, h on WF-net N 2. Initially, p = c = 0 and giovedì the 13 dicembre environment 2012 produces a token for place start and the replay. Four counters are shown at each stag tokens), m (missing tokens), and r (remainin Initially, p = c = 0 and all places are emp token for place start. Therefore, the p counte to replay σ 1 = a, c, d, e, h, i.e., we first fi a consumes one token and produces two to 84

Example: Missing Token the environment produces a 7.2 token Tokenfor Replay place start 199 g. 7.4 Replaying σ 3 = a, d, c, e, h on top of WF-net e token is remaining (r = 1). The r-tag and m-tag high 85

7.2 Token Replay 199 Example: Missing Token replaying a is possible one token is consumed, one produced g. 7.4 Replaying σ 3 = a, d, c, e, h on top of WF-net e token is remaining (r = 1). The r-tag and m-tag high 86

Example: Missing Token replaying d is NOT possible one token is missing, one produced, one consumed g. 7.4 Replaying σ 3 = a, d, c, e, h on top of WF-net e token is remaining (r = 1). The r-tag and m-tag high 87

Example: Missing Token replaying c is possible one token is produced, one consumed g. 7.4 Replaying σ 3 = a, d, c, e, h on top of WF-net e token is remaining (r = 1). The r-tag and m-tag high 88

Example: Missing Token replaying e is possible one token is produced, one consumed g. 7.4 Replaying σ 3 = a, d, c, e, h on top of WF-net e token is remaining (r = 1). The r-tag and m-tag high giovedì 13 Fig. dicembre 7.42012 Replaying σ 3 = a, d, c, e, h on top of WF-net N 2 : one token is missing (m = 1) and 89

ave p = 2, c = 1, m = 0, and r = 0. Now we try to replay the second event. This is ot possible, because transition d is not enabled. To fire d, we need to add a token o place p2 and record the missing token, i.e., the m counter is incremented. The p nd c counter Example: are updated as usual. Missing Therefore, after firing Token d, we have p = 3, c = 2, = 1, and r = 0. We also tag place p2 to remember that a token was missing. Then e replay the next three events (c, e, h). The corresponding transitions are enabled. herefore, replaying we only h is need possible to update p and c counters. After At the replaying end, the last event, one token is produced, one consumed the environment consumes a token from place end. e have p = 6, c = 5, m = 1, and r = 0. In the final state [p2, end], the environment onsumes the token from place end. A token remains in place p2. Therefore, place 2 is tagged and the r counter is incremented. Hence, the final result is p = c = 6 nd m = r = 1. Figure 7.4 shows diagnostic information that helps to understand he nature of non-conformance. There was a situation in which d occurred but could ot happen according to the model (m-tag) and there was a situation in which d was upposed to happen but did not occur according to the log (r-tag). Moreover, we can ompute the fitness of trace σ 3 on WF-net N 2 based on the values of p, c, m, and r: fitness(σ 3,N 2 ) = 1 ( 1 1 ) + 1 ( 1 1 ) 2 6 2 6 Fig. 7.4 Replaying σ 3 = a, d, c, e, h on top of WF-net N 2 : one token is missing (m = 1) and one token is remaining (r = 1). The r-tag and m-tag highlight the place where = 0.8333 σ 3 and the model diverge As a third example, we replay σ 2 = a, b, d, e, g on top of WF-net N 3. Now the ituation is slightly different because N 3 does not contain all activities appearing in he event log. In such a situation it seems reasonable to abstract from these events. 90 ence, we effectively replay σ 2 = a, d, e. Figure 7.5 shows the process of replaygiovedì 13 dicembre 2012 g. 7.4 Replaying σ 3 = a, d, c, e, h on top of WF-net e token is remaining (r = 1). The r-tag and m-tag high

Example: Event Removal 200 7 Conformance Checking = a, b, d, e, g on top of WF-net N 3, all events not correspo ay removed σ 2 = a, first. b, d, Replaying e, g on top σ 2 = of a, WF-net d, e shows N 3, allthat events two not token co are remaining (r = 2) thus resulting = in a, ad, fitness e shows of 0.6that two del are removed first. Replaying σ 91 2

Example: Event Removal = a, b, d, e, g on top of WF-net N 3, all events not correspo ay removed σ 2 = a, first. b, d, Replaying e, g on top σ 2 = of a, WF-net d, e shows N 3, allthat events two not token co are remaining (r = 2) thus resulting = in a, ad, fitness e shows of 0.6that two del are removed first. Replaying σ 92 2

Example: Event Removal Token Replay 2 d also place end gets an m-tag. Moreover, two tokens are remaining: one in pla and one in place p5. The places are tagged with an r-tag, and the two remainin ens are recorded r = 2. This way we find a fitness of 0.6 for trace σ 2 and WF-n based on the values p = 5, c = 5, m = 2, and r = 2: Fig. 7.5 To replay σ 2 = a, b, d, e, g on top of WF-net N 3, all events not corresponding to activities in the model are removed first. ( Replaying σ ) 2 = a, d, ( e shows that ) two tokens are missing (m = 2) and two tokens are remaining (r = 2) thus resulting in a fitness of 0.6 fitness(σ 2,N 3 ) = 1 2 1 2 5 + 1 2 1 2 5 = 0.6 reover, Fig. 7.5 clearly shows the cause of this poor conformance: c was su sed to happen according to the model but did not happen, e happened but was n removed first. Replaying σ ssible according to the model, and h2 was supposed to happen but did not happe fire, place p3 is still empty and e is not enabled. The missing token is recorded (m = 1) and place p3 gets an m-tag. After replaying σ 2, the resulting marking is [p1,p5]. Now the environment needs to consume the token from place end. However, place end is not marked. Therefore, another missing token is recorded (m = 2) = a, b, d, e, g on top of WF-net N 3, all events not correspo ay σ 2 = a, b, d, e, g on top= of a, WF-net d, e shows N 3, allthat events two not token co del Figures are are remaining 7.3, removed 7.4, 7.5 (r first. illustrate = 2) Replaying thus howresulting to93analyze σ 2 = in the a, afitness d, e of shows aofsingle 0.6that case. two Th

but also shows that the relation between94the proportion of events that cannot be re- same as fitness approach depends can be used on missing to analyzeand the fitness remaining of a logtokens consisting rather of many than cases ev Simply take the sums of all produced, consumed, missing, and remaining tokens and apply the same formula. Let p N,σ denote the number of produced tokens when replaying σ on N. Fitness c N,σ, m N,σ, r N,σ are of defined a in Log a similar fashion, e.g., m N,σ is results the number in one of missing tokens andwhen onereplaying remaining σ ontoken. N. NowThis we can seems definerea the fitness of an event log L on WF-net N: fitness(l, N) = 1 ( 1 σ L L(σ ) m ) N,σ 2 σ L L(σ ) c + 1 ( 1 σ L L(σ ) r ) N,σ N,σ 2 σ L L(σ ) p N,σ Note that laying theσ L entire L(σ ) event m N,σlog, is total we number can now of missing compute tokens when the fitness replaying ofthe e entire event log, because L(σ ) is the frequency of trace σ and m N,σ is the number of missing tokens for a single instance of σ. The value of fitness(l, N) is between 0 (very poor fitness; none of the produced tokens is consumed and all of the consumed tokens are missing) fitness(l and 1 (perfect full,nfitness; 1 ) = 1all cases can be replayed withou any problems). Althoughfitness(L fitness(l, N) is a measure focusing on tokens in places, we will interpret it as a measure on events. full,n The 2 ) = 0.9504 intuition of fitness(l, N) = 0.9 is tha about 90% of the eventsfitness(l can be replayed correctly. 1 full,n 3 ) = 0.8797 This is only an informal characterization as fitness depends on missing and remaining tokens rather than events. For instance, a transition that fitness(l is forced full to,n fire 4 during ) = 1replay may have multiple empty input places. Note that if two subsequent events are swapped in a sequential process, this results in one missing and one remaining token. This seems reasonable a transition that is forced to fire during replay may have multip ces. Note that if two subsequent events are swapped in a sequen hows that the relation between the proportion of events that cann rrectly and the proportion of tokens that are missing or remaining he four models in Fig. 7.2 ainder of this book, we often use this intuitive characterization of fitness, alth

Diagnostic Information 202 7 Conformance Checking Fig. 7.6 Diagnostic information showing the deviations (fitness(l full,n 2 ) = 0.9504) 95

Diagnostic Information 7.2 Token Replay 203 Fig. 7.7 Diagnostic information showing the deviations (fitness(l full,n 3 ) = 0.8797) 96

Drill Down An event log can be split into two sublogs: one event log containing only fitting cases and one event log containing only non-fitting cases. The second event log can be used to discover a different process model. Also other data and process mining techniques can be used. For instance, it is interesting to know which people handled the deviating cases and whether these cases took longer or were more costly. In case fraud is suspected, one may create a social network based on the event log with deviating cases. 97

Drill Down 204 7 Conformance Checking 98

Comparing Footprints 99

Footprint from Play-out Given a workflow net, the play-out technique can be used to extract a local complete set of traces. If we see the set of traces as an event log (without multiplicities), then we can derive the relation >. Then, we can construct the footprint (i.e. a matrix showing causal dependencies between events) of the net model based on such relation >. (From the viewpoint of a footprint matrix, an event log is complete if and only if all activities that can follow one another do so at least once in the log.) 100

Footprint-based Conformance Footprints are available for logs and models (nets). This allows for: log vs model conformance (do the log and the model agree on the ordering of activities?) model vs model conformance (quantification of their similarities) log vs log comparison (concept drift: how does the work changes in sub-logs?) 101

Conformance based on footprints The conformance based on footprints can be computed by taking: n: total number of cells in the footprint matrix d: number of cells with different content between the two matrices 1 d n 102

Example 196 7 Conformance Checking 7 Conformance Checking 7.2 Footprint of L full 1 a b c d e f g h a # # # # # b # # # # c # # # # d # # # e # # f # # # # g # # # # # # # h # # # # # # # 103 206 Also Table 7.2 Footprint of L full and N 1

7.2 Footprint of L full 1 a b c d e f g h a # # # # # Example b # # # # c # # # # d # # # e # # f # # # # g # # # # # # # h # # # # # # # 7.3 Footprint of N 2 in Fig. 7.2 a b c d e f g h a # # # # # # b # # # # # c # # # # # d # # # # # e # # # # f # # # # # g # # # # # # # h # # # # # # # 104

7.2 Footprint of L full 1 a b c d e f g h a # # # # # Example b # # # # c # # # # d # # # e # # f # # # # g # # # # # # # h # # # # # # # 7 Conformance Checking 27.3 Footprint of of L full N 2 in Fig. 7.2 a a b b c c d d e e f f g g h h a a # # # # # # # # # # # b b # # # # # # # # # c c # # # # # # # # # d d # # # # # # # # e e # # # # # # f f # # # # # # # # # g g # # # # # # # # # # # # # # h h # # # # # # # # # # # # # # 105

this is not possible in N 2. The relation between 7.3 Footprint of N 2 in Fig. 7.2 a b c d e f g h reflects that in WF-net N 2 both activities are no a # # # # # # Example b # # # # # detailed diagnostics, Table 7.4 can also be used t c # # # # # ce, 12 of the d 64 # cells differ. # Hence, # # # one could s e # # # # e footprints is 1 12 64 = 0.8125. f # # # # # g # # # # # # # ance analysis h # based # # on # footprints # # is # only mean respect to the directly follows relation > L. T 7.4 Differences en the footprints of L a b c d e f g h s-validation full (see Sect. 3.6.2). 2. The event log and the disagree on 12 of the a : # ls ingly, of the footprint bothb models and event : : # logs have footpri c : : # omparisons d as : # just : described, : i.e., : # it can be che e : # : # f : # g h 106

Conclusion 107

Requirements gone bad 108

Requirements gone bad 108

Requirements gone bad 108

Requirements gone bad 108

Requirements gone bad 108

Requirements gone bad 108

Requirements gone bad 108

Requirements gone bad 108

Requirements gone bad 108

Requirements gone bad 108

Conclusion We have overviewed the iceberg tip of business process management more notation, theory, technology, tools, methodology, encoding, validation, verification, research lie down there, more or less deep, below the surface......for all of us to explore 109